Jun 24, 2026 · 7 min read

How to Run Baidu Unlimited-OCR Locally (All Methods)

Baidu Unlimited-OCR is a 3B parameter, MIT-licensed OCR model that processes multi-page PDFs in a single pass. It’s completely free and runs locally, meaning your documents never leave your machine. This guide covers every way to run it: from the quickest setup (Ollama) to the fastest inference (vLLM) to native Apple Silicon support (MLX).

Pick the method that matches your hardware and use case. I’ll cover all of them with working code.

Hardware Requirements

Before choosing a deployment method, know what you’re working with:

Setup	Minimum	Recommended
Full precision (BF16)	12 GB VRAM GPU	24 GB VRAM (RTX 4090, A100)
GGUF Q8	8 GB RAM (CPU) or 8 GB VRAM	16 GB RAM, modern CPU
GGUF Q4	6 GB RAM (CPU)	12 GB RAM
MLX 8-bit	8 GB Apple Silicon	16 GB Apple Silicon
NVFP4	4 GB VRAM	8 GB VRAM

The full model is 6.78 GB in BF16. Quantized versions range from 2 to 5 GB depending on the quantization level.

Method 1: HuggingFace Transformers (Standard Python)

The most flexible option. Works anywhere Python runs with a capable GPU.

Installation

pip install transformers torch pillow accelerate

Basic Usage

from transformers import AutoModel, AutoTokenizer
from PIL import Image
import torch

# Load model (downloads ~6.78 GB on first run)
model = AutoModel.from_pretrained(
    "baidu/Unlimited-OCR",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "baidu/Unlimited-OCR",
    trust_remote_code=True
)

# Process a single image
image = Image.open("document.png")
inputs = tokenizer(
    images=image,
    text="<image>document parsing.",
    return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=4096)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Multi-Page Processing

The key advantage of Unlimited-OCR: pass multiple pages in one call.

from pdf2image import convert_from_path

def process_pdf(model, tokenizer, pdf_path, max_pages=40):
    """Process a multi-page PDF in a single inference pass."""
    images = convert_from_path(pdf_path, dpi=150)

    if len(images) > max_pages:
        print(f"Warning: PDF has {len(images)} pages, processing first {max_pages}")
        images = images[:max_pages]

    # Pass all page images at once
    inputs = tokenizer(
        images=images,
        text="<image>" * len(images) + "document parsing.",
        return_tensors="pt"
    ).to(model.device)

    outputs = model.generate(**inputs, max_new_tokens=8192)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result

text = process_pdf(model, tokenizer, "contract.pdf")
print(text)

When to Use This Method

You want full control over inference settings
You’re integrating into an existing Python pipeline
You’re doing research or experimentation
You need custom pre/post-processing

Method 2: vLLM (Production Serving)

vLLM gives you the highest throughput for serving Unlimited-OCR to multiple clients. It handles batching, memory management, and concurrent requests automatically.

Installation

pip install vllm

Start the Server

vllm serve baidu/Unlimited-OCR \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

Send Requests

Once the server is running, send requests via the OpenAI-compatible API:

import base64
import requests

def ocr_with_vllm(image_path, server_url="http://localhost:8000"):
    """Send an image to the vLLM server for OCR."""
    with open(image_path, "rb") as f:
        image_b64 = base64.b64encode(f.read()).decode()

    response = requests.post(
        f"{server_url}/v1/chat/completions",
        json={
            "model": "baidu/Unlimited-OCR",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
                        {"type": "text", "text": "document parsing."}
                    ]
                }
            ],
            "max_tokens": 4096
        }
    )

    return response.json()["choices"][0]["message"]["content"]

result = ocr_with_vllm("invoice.png")
print(result)

When to Use This Method

You’re serving OCR to multiple clients or services
Throughput matters more than simplicity
You have a dedicated GPU server
You want an OpenAI-compatible endpoint

Method 3: Ollama (Easiest Setup)

Ollama is the fastest way to get running. One command to download, one command to run.

Installation

Install Ollama from ollama.com:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Pull and Run

# Pull the model
ollama pull unlimited-ocr

# Run interactively
ollama run unlimited-ocr

Use from Python

import ollama
import base64

def ocr_with_ollama(image_path):
    """Process a document image with Ollama."""
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = ollama.chat(
        model="unlimited-ocr",
        messages=[{
            "role": "user",
            "content": "document parsing.",
            "images": [image_data]
        }]
    )
    return response["message"]["content"]

result = ocr_with_ollama("receipt.jpg")
print(result)

When to Use This Method

You want the simplest possible setup
You’re running on a personal machine for occasional use
You want easy model version management
You’re prototyping and don’t need maximum throughput

Method 4: MLX (Apple Silicon Native)

If you’re on a Mac with M1/M2/M3/M4 chip, MLX gives you native performance using unified memory. No CUDA needed.

Installation

pip install mlx mlx-lm pillow

Download the Model

# The 8-bit MLX quantization
huggingface-cli download sahilchachra/unlimited-ocr-8bit-mlx --local-dir ./unlimited-ocr-mlx

Usage

import mlx.core as mx
from mlx_lm import load, generate
from PIL import Image

# Load the MLX model
model, tokenizer = load("sahilchachra/unlimited-ocr-8bit-mlx")

# Process an image
image = Image.open("document.png")

# Prepare the prompt with image
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "document parsing."}
    ]}],
    tokenize=False
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=4096)
print(response)

Performance on Apple Silicon

Rough benchmarks for single-page processing:

Chip	8-bit MLX	Time per page
M1 (8 GB)	Works, tight on memory	~8-12 seconds
M1 Pro (16 GB)	Comfortable	~5-8 seconds
M2 Pro (16 GB)	Good	~4-6 seconds
M3 Pro (18 GB)	Great	~3-5 seconds
M4 Pro (24 GB)	Excellent	~2-4 seconds

When to Use This Method

You have an Apple Silicon Mac
You don’t want to deal with CUDA/GPU setup
You want good performance from unified memory
You’re building macOS-native applications

Method 5: GGUF with llama.cpp (CPU-Friendly)

GGUF quantization lets you run on CPU without a GPU. Slower, but works on any machine.

Installation

Build llama.cpp with the Unlimited-OCR PR branch (support isn’t in upstream main yet):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout unlimited-ocr-support  # Check for merged PR
mkdir build && cd build
cmake ..
cmake --build . --config Release

Download GGUF Files

You need two files: the language model GGUF and the vision projector.

# Download from HuggingFace
huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
    unlimited-ocr-Q4_K_M.gguf \
    unlimited-ocr-vision-projector-f16.gguf \
    --local-dir ./models/

Available quantizations:

Quantization	Size	Quality	Speed
Q8_0	~5.5 GB	Near lossless	Slower
Q6_K	~4.5 GB	Very good	Moderate
Q5_K_M	~4.0 GB	Good	Moderate
Q4_K_M	~3.5 GB	Acceptable	Faster
Q3_K_M	~2.8 GB	Noticeable loss	Fastest

Run Inference

./build/bin/llama-ocr \
    -m ./models/unlimited-ocr-Q4_K_M.gguf \
    --mmproj ./models/unlimited-ocr-vision-projector-f16.gguf \
    --image document.png \
    -p "document parsing." \
    -n 4096

When to Use This Method

You don’t have a GPU
You need to run on minimal hardware (embedded, VPS, old machines)
You want the smallest possible model footprint
You’re deploying in constrained environments

Choosing the Right Method

Priority	Best Method
Fastest setup	Ollama
Highest throughput	vLLM
Apple Silicon	MLX
No GPU available	GGUF (llama.cpp)
Full Python control	Transformers
Production API serving	vLLM
Prototyping	Ollama

Tips for Best Results

Image quality matters: Higher DPI (150-300) gives better results. Don’t go above 300 DPI; it wastes memory without improving accuracy.
Use the right prompt: "<image>document parsing." is the standard prompt. For specific extraction, try "<image>extract tables as HTML." or "<image>extract equations as LaTeX."
Multi-page batching: For multi-page PDFs, pass all pages at once rather than one by one. The model performs better with full document context.
Memory management: If you’re hitting OOM errors, reduce the number of pages per batch or use a more aggressive quantization.
Output format: The model outputs Markdown by default. Tables come as HTML within the Markdown. Equations come as LaTeX.

Comparison with Other Self-Hosted Options

For context on how Unlimited-OCR compares to other models you can run locally, see our best open-source OCR models comparison. If you’re considering DeepSeek Vision as an alternative for self-hosting, our self-hosting DeepSeek Vision guide covers the full setup process.

For a broader look at how Unlimited-OCR stacks up against paid alternatives like Mistral OCR 4, check our three-way OCR comparison.

FAQ

How much disk space does Unlimited-OCR need?

Full precision: 6.78 GB. GGUF Q4_K_M: ~3.5 GB. MLX 8-bit: ~5 GB. Plus a few hundred MB for the vision projector. Plan for 4-8 GB total depending on your chosen quantization.

Can I run it without a GPU?

Yes. GGUF quantization with llama.cpp runs entirely on CPU. It’s slower (30-60 seconds per page vs 2-5 seconds on GPU) but completely functional. Good for low-volume or batch overnight processing.

Does it work on Windows?

Yes. Ollama and llama.cpp have Windows builds. Transformers and vLLM work on Windows with CUDA. MLX is macOS only. WSL2 with GPU passthrough is another solid option.

How do I process a multi-page PDF?

Convert to images first (using pdf2image or PyMuPDF), then pass all images in a single inference call. The model handles up to ~40 pages in one pass with its 32K context window.

What if the model runs out of memory?

Options: 1) Use a more aggressive quantization (Q4 instead of Q8). 2) Reduce the number of pages per batch. 3) Lower the image resolution (try 100 DPI instead of 150). 4) Use a method with better memory management (vLLM handles this automatically).

Is vLLM support fully merged?

Check the vLLM GitHub for the latest status. As of launch, Unlimited-OCR works with vLLM’s vision-language model support. The llama.cpp GGUF support requires a specific PR branch that may not be in upstream main yet.

How to Run Baidu Unlimited-OCR Locally (All Methods)

Hardware Requirements

Method 1: HuggingFace Transformers (Standard Python)

Installation

Basic Usage

Multi-Page Processing

When to Use This Method

Method 2: vLLM (Production Serving)

Installation

Start the Server

Send Requests

When to Use This Method

Method 3: Ollama (Easiest Setup)

Installation

Pull and Run

Use from Python

When to Use This Method

Method 4: MLX (Apple Silicon Native)

Installation

Download the Model

Usage

Performance on Apple Silicon

When to Use This Method

Method 5: GGUF with llama.cpp (CPU-Friendly)

Installation

Download GGUF Files

Run Inference

When to Use This Method

Choosing the Right Method

Tips for Best Results

Comparison with Other Self-Hosted Options

FAQ

How much disk space does Unlimited-OCR need?

Can I run it without a GPU?

Does it work on Windows?

How do I process a multi-page PDF?

What if the model runs out of memory?

Is vLLM support fully merged?

📬 AI Dev Weekly

You might also like

Baidu Unlimited-OCR: Free Open-Source OCR (Complete Guide)

Best Open-Source OCR Models 2026 (Compared)

Mistral OCR 4 vs DeepSeek Vision vs Baidu Unlimited-OCR

Best Free Local AI Tools in 2026: Ollama, LM Studio, Jan, Open WebUI Ranked