🤖 AI Tools
· 7 min read

How to Run Baidu Unlimited-OCR Locally (All Methods)


Baidu Unlimited-OCR is a 3B parameter, MIT-licensed OCR model that processes multi-page PDFs in a single pass. It’s completely free and runs locally, meaning your documents never leave your machine. This guide covers every way to run it: from the quickest setup (Ollama) to the fastest inference (vLLM) to native Apple Silicon support (MLX).

Pick the method that matches your hardware and use case. I’ll cover all of them with working code.

Hardware Requirements

Before choosing a deployment method, know what you’re working with:

SetupMinimumRecommended
Full precision (BF16)12 GB VRAM GPU24 GB VRAM (RTX 4090, A100)
GGUF Q88 GB RAM (CPU) or 8 GB VRAM16 GB RAM, modern CPU
GGUF Q46 GB RAM (CPU)12 GB RAM
MLX 8-bit8 GB Apple Silicon16 GB Apple Silicon
NVFP44 GB VRAM8 GB VRAM

The full model is 6.78 GB in BF16. Quantized versions range from 2 to 5 GB depending on the quantization level.

Method 1: HuggingFace Transformers (Standard Python)

The most flexible option. Works anywhere Python runs with a capable GPU.

Installation

pip install transformers torch pillow accelerate

Basic Usage

from transformers import AutoModel, AutoTokenizer
from PIL import Image
import torch

# Load model (downloads ~6.78 GB on first run)
model = AutoModel.from_pretrained(
    "baidu/Unlimited-OCR",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "baidu/Unlimited-OCR",
    trust_remote_code=True
)

# Process a single image
image = Image.open("document.png")
inputs = tokenizer(
    images=image,
    text="<image>document parsing.",
    return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=4096)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Multi-Page Processing

The key advantage of Unlimited-OCR: pass multiple pages in one call.

from pdf2image import convert_from_path

def process_pdf(model, tokenizer, pdf_path, max_pages=40):
    """Process a multi-page PDF in a single inference pass."""
    images = convert_from_path(pdf_path, dpi=150)

    if len(images) > max_pages:
        print(f"Warning: PDF has {len(images)} pages, processing first {max_pages}")
        images = images[:max_pages]

    # Pass all page images at once
    inputs = tokenizer(
        images=images,
        text="<image>" * len(images) + "document parsing.",
        return_tensors="pt"
    ).to(model.device)

    outputs = model.generate(**inputs, max_new_tokens=8192)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result

text = process_pdf(model, tokenizer, "contract.pdf")
print(text)

When to Use This Method

  • You want full control over inference settings
  • You’re integrating into an existing Python pipeline
  • You’re doing research or experimentation
  • You need custom pre/post-processing

Method 2: vLLM (Production Serving)

vLLM gives you the highest throughput for serving Unlimited-OCR to multiple clients. It handles batching, memory management, and concurrent requests automatically.

Installation

pip install vllm

Start the Server

vllm serve baidu/Unlimited-OCR \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

Send Requests

Once the server is running, send requests via the OpenAI-compatible API:

import base64
import requests

def ocr_with_vllm(image_path, server_url="http://localhost:8000"):
    """Send an image to the vLLM server for OCR."""
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    response = requests.post(
        f"{server_url}/v1/chat/completions",
        json={
            "model": "baidu/Unlimited-OCR",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
                        {"type": "text", "text": "document parsing."}
                    ]
                }
            ],
            "max_tokens": 4096
        }
    )

    return response.json()["choices"][0]["message"]["content"]

result = ocr_with_vllm("invoice.png")
print(result)

When to Use This Method

  • You’re serving OCR to multiple clients or services
  • Throughput matters more than simplicity
  • You have a dedicated GPU server
  • You want an OpenAI-compatible endpoint

Method 3: Ollama (Easiest Setup)

Ollama is the fastest way to get running. One command to download, one command to run.

Installation

Install Ollama from ollama.com:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Pull and Run

# Pull the model
ollama pull unlimited-ocr

# Run interactively
ollama run unlimited-ocr

Use from Python

import ollama
import base64

def ocr_with_ollama(image_path):
    """Process a document image with Ollama."""
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = ollama.chat(
        model="unlimited-ocr",
        messages=[{
            "role": "user",
            "content": "document parsing.",
            "images": [image_data]
        }]
    )
    return response["message"]["content"]

result = ocr_with_ollama("receipt.jpg")
print(result)

When to Use This Method

  • You want the simplest possible setup
  • You’re running on a personal machine for occasional use
  • You want easy model version management
  • You’re prototyping and don’t need maximum throughput

Method 4: MLX (Apple Silicon Native)

If you’re on a Mac with M1/M2/M3/M4 chip, MLX gives you native performance using unified memory. No CUDA needed.

Installation

pip install mlx mlx-lm pillow

Download the Model

# The 8-bit MLX quantization
huggingface-cli download sahilchachra/unlimited-ocr-8bit-mlx --local-dir ./unlimited-ocr-mlx

Usage

import mlx.core as mx
from mlx_lm import load, generate
from PIL import Image

# Load the MLX model
model, tokenizer = load("sahilchachra/unlimited-ocr-8bit-mlx")

# Process an image
image = Image.open("document.png")

# Prepare the prompt with image
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "document parsing."}
    ]}],
    tokenize=False
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=4096)
print(response)

Performance on Apple Silicon

Rough benchmarks for single-page processing:

Chip8-bit MLXTime per page
M1 (8 GB)Works, tight on memory~8-12 seconds
M1 Pro (16 GB)Comfortable~5-8 seconds
M2 Pro (16 GB)Good~4-6 seconds
M3 Pro (18 GB)Great~3-5 seconds
M4 Pro (24 GB)Excellent~2-4 seconds

When to Use This Method

  • You have an Apple Silicon Mac
  • You don’t want to deal with CUDA/GPU setup
  • You want good performance from unified memory
  • You’re building macOS-native applications

Method 5: GGUF with llama.cpp (CPU-Friendly)

GGUF quantization lets you run on CPU without a GPU. Slower, but works on any machine.

Installation

Build llama.cpp with the Unlimited-OCR PR branch (support isn’t in upstream main yet):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout unlimited-ocr-support  # Check for merged PR
mkdir build && cd build
cmake ..
cmake --build . --config Release

Download GGUF Files

You need two files: the language model GGUF and the vision projector.

# Download from HuggingFace
huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
    unlimited-ocr-Q4_K_M.gguf \
    unlimited-ocr-vision-projector-f16.gguf \
    --local-dir ./models/

Available quantizations:

QuantizationSizeQualitySpeed
Q8_0~5.5 GBNear losslessSlower
Q6_K~4.5 GBVery goodModerate
Q5_K_M~4.0 GBGoodModerate
Q4_K_M~3.5 GBAcceptableFaster
Q3_K_M~2.8 GBNoticeable lossFastest

Run Inference

./build/bin/llama-ocr \
    -m ./models/unlimited-ocr-Q4_K_M.gguf \
    --mmproj ./models/unlimited-ocr-vision-projector-f16.gguf \
    --image document.png \
    -p "document parsing." \
    -n 4096

When to Use This Method

  • You don’t have a GPU
  • You need to run on minimal hardware (embedded, VPS, old machines)
  • You want the smallest possible model footprint
  • You’re deploying in constrained environments

Choosing the Right Method

PriorityBest Method
Fastest setupOllama
Highest throughputvLLM
Apple SiliconMLX
No GPU availableGGUF (llama.cpp)
Full Python controlTransformers
Production API servingvLLM
PrototypingOllama

Tips for Best Results

  1. Image quality matters: Higher DPI (150-300) gives better results. Don’t go above 300 DPI; it wastes memory without improving accuracy.
  2. Use the right prompt: "<image>document parsing." is the standard prompt. For specific extraction, try "<image>extract tables as HTML." or "<image>extract equations as LaTeX."
  3. Multi-page batching: For multi-page PDFs, pass all pages at once rather than one by one. The model performs better with full document context.
  4. Memory management: If you’re hitting OOM errors, reduce the number of pages per batch or use a more aggressive quantization.
  5. Output format: The model outputs Markdown by default. Tables come as HTML within the Markdown. Equations come as LaTeX.

Comparison with Other Self-Hosted Options

For context on how Unlimited-OCR compares to other models you can run locally, see our best open-source OCR models comparison. If you’re considering DeepSeek Vision as an alternative for self-hosting, our self-hosting DeepSeek Vision guide covers the full setup process.

For a broader look at how Unlimited-OCR stacks up against paid alternatives like Mistral OCR 4, check our three-way OCR comparison.

FAQ

How much disk space does Unlimited-OCR need?

Full precision: 6.78 GB. GGUF Q4_K_M: ~3.5 GB. MLX 8-bit: ~5 GB. Plus a few hundred MB for the vision projector. Plan for 4-8 GB total depending on your chosen quantization.

Can I run it without a GPU?

Yes. GGUF quantization with llama.cpp runs entirely on CPU. It’s slower (30-60 seconds per page vs 2-5 seconds on GPU) but completely functional. Good for low-volume or batch overnight processing.

Does it work on Windows?

Yes. Ollama and llama.cpp have Windows builds. Transformers and vLLM work on Windows with CUDA. MLX is macOS only. WSL2 with GPU passthrough is another solid option.

How do I process a multi-page PDF?

Convert to images first (using pdf2image or PyMuPDF), then pass all images in a single inference call. The model handles up to ~40 pages in one pass with its 32K context window.

What if the model runs out of memory?

Options: 1) Use a more aggressive quantization (Q4 instead of Q8). 2) Reduce the number of pages per batch. 3) Lower the image resolution (try 100 DPI instead of 150). 4) Use a method with better memory management (vLLM handles this automatically).

Is vLLM support fully merged?

Check the vLLM GitHub for the latest status. As of launch, Unlimited-OCR works with vLLM’s vision-language model support. The llama.cpp GGUF support requires a specific PR branch that may not be in upstream main yet.