Baidu Unlimited-OCR is a 3B parameter, MIT-licensed OCR model that processes multi-page PDFs in a single pass. It’s completely free and runs locally, meaning your documents never leave your machine. This guide covers every way to run it: from the quickest setup (Ollama) to the fastest inference (vLLM) to native Apple Silicon support (MLX).
Pick the method that matches your hardware and use case. I’ll cover all of them with working code.
Hardware Requirements
Before choosing a deployment method, know what you’re working with:
| Setup | Minimum | Recommended |
|---|---|---|
| Full precision (BF16) | 12 GB VRAM GPU | 24 GB VRAM (RTX 4090, A100) |
| GGUF Q8 | 8 GB RAM (CPU) or 8 GB VRAM | 16 GB RAM, modern CPU |
| GGUF Q4 | 6 GB RAM (CPU) | 12 GB RAM |
| MLX 8-bit | 8 GB Apple Silicon | 16 GB Apple Silicon |
| NVFP4 | 4 GB VRAM | 8 GB VRAM |
The full model is 6.78 GB in BF16. Quantized versions range from 2 to 5 GB depending on the quantization level.
Method 1: HuggingFace Transformers (Standard Python)
The most flexible option. Works anywhere Python runs with a capable GPU.
Installation
pip install transformers torch pillow accelerate
Basic Usage
from transformers import AutoModel, AutoTokenizer
from PIL import Image
import torch
# Load model (downloads ~6.78 GB on first run)
model = AutoModel.from_pretrained(
"baidu/Unlimited-OCR",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"baidu/Unlimited-OCR",
trust_remote_code=True
)
# Process a single image
image = Image.open("document.png")
inputs = tokenizer(
images=image,
text="<image>document parsing.",
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Multi-Page Processing
The key advantage of Unlimited-OCR: pass multiple pages in one call.
from pdf2image import convert_from_path
def process_pdf(model, tokenizer, pdf_path, max_pages=40):
"""Process a multi-page PDF in a single inference pass."""
images = convert_from_path(pdf_path, dpi=150)
if len(images) > max_pages:
print(f"Warning: PDF has {len(images)} pages, processing first {max_pages}")
images = images[:max_pages]
# Pass all page images at once
inputs = tokenizer(
images=images,
text="<image>" * len(images) + "document parsing.",
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
return result
text = process_pdf(model, tokenizer, "contract.pdf")
print(text)
When to Use This Method
- You want full control over inference settings
- You’re integrating into an existing Python pipeline
- You’re doing research or experimentation
- You need custom pre/post-processing
Method 2: vLLM (Production Serving)
vLLM gives you the highest throughput for serving Unlimited-OCR to multiple clients. It handles batching, memory management, and concurrent requests automatically.
Installation
pip install vllm
Start the Server
vllm serve baidu/Unlimited-OCR \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9
Send Requests
Once the server is running, send requests via the OpenAI-compatible API:
import base64
import requests
def ocr_with_vllm(image_path, server_url="http://localhost:8000"):
"""Send an image to the vLLM server for OCR."""
with open(image_path, "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
f"{server_url}/v1/chat/completions",
json={
"model": "baidu/Unlimited-OCR",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
{"type": "text", "text": "document parsing."}
]
}
],
"max_tokens": 4096
}
)
return response.json()["choices"][0]["message"]["content"]
result = ocr_with_vllm("invoice.png")
print(result)
When to Use This Method
- You’re serving OCR to multiple clients or services
- Throughput matters more than simplicity
- You have a dedicated GPU server
- You want an OpenAI-compatible endpoint
Method 3: Ollama (Easiest Setup)
Ollama is the fastest way to get running. One command to download, one command to run.
Installation
Install Ollama from ollama.com:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Pull and Run
# Pull the model
ollama pull unlimited-ocr
# Run interactively
ollama run unlimited-ocr
Use from Python
import ollama
import base64
def ocr_with_ollama(image_path):
"""Process a document image with Ollama."""
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.chat(
model="unlimited-ocr",
messages=[{
"role": "user",
"content": "document parsing.",
"images": [image_data]
}]
)
return response["message"]["content"]
result = ocr_with_ollama("receipt.jpg")
print(result)
When to Use This Method
- You want the simplest possible setup
- You’re running on a personal machine for occasional use
- You want easy model version management
- You’re prototyping and don’t need maximum throughput
Method 4: MLX (Apple Silicon Native)
If you’re on a Mac with M1/M2/M3/M4 chip, MLX gives you native performance using unified memory. No CUDA needed.
Installation
pip install mlx mlx-lm pillow
Download the Model
# The 8-bit MLX quantization
huggingface-cli download sahilchachra/unlimited-ocr-8bit-mlx --local-dir ./unlimited-ocr-mlx
Usage
import mlx.core as mx
from mlx_lm import load, generate
from PIL import Image
# Load the MLX model
model, tokenizer = load("sahilchachra/unlimited-ocr-8bit-mlx")
# Process an image
image = Image.open("document.png")
# Prepare the prompt with image
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "document parsing."}
]}],
tokenize=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=4096)
print(response)
Performance on Apple Silicon
Rough benchmarks for single-page processing:
| Chip | 8-bit MLX | Time per page |
|---|---|---|
| M1 (8 GB) | Works, tight on memory | ~8-12 seconds |
| M1 Pro (16 GB) | Comfortable | ~5-8 seconds |
| M2 Pro (16 GB) | Good | ~4-6 seconds |
| M3 Pro (18 GB) | Great | ~3-5 seconds |
| M4 Pro (24 GB) | Excellent | ~2-4 seconds |
When to Use This Method
- You have an Apple Silicon Mac
- You don’t want to deal with CUDA/GPU setup
- You want good performance from unified memory
- You’re building macOS-native applications
Method 5: GGUF with llama.cpp (CPU-Friendly)
GGUF quantization lets you run on CPU without a GPU. Slower, but works on any machine.
Installation
Build llama.cpp with the Unlimited-OCR PR branch (support isn’t in upstream main yet):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout unlimited-ocr-support # Check for merged PR
mkdir build && cd build
cmake ..
cmake --build . --config Release
Download GGUF Files
You need two files: the language model GGUF and the vision projector.
# Download from HuggingFace
huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
unlimited-ocr-Q4_K_M.gguf \
unlimited-ocr-vision-projector-f16.gguf \
--local-dir ./models/
Available quantizations:
| Quantization | Size | Quality | Speed |
|---|---|---|---|
| Q8_0 | ~5.5 GB | Near lossless | Slower |
| Q6_K | ~4.5 GB | Very good | Moderate |
| Q5_K_M | ~4.0 GB | Good | Moderate |
| Q4_K_M | ~3.5 GB | Acceptable | Faster |
| Q3_K_M | ~2.8 GB | Noticeable loss | Fastest |
Run Inference
./build/bin/llama-ocr \
-m ./models/unlimited-ocr-Q4_K_M.gguf \
--mmproj ./models/unlimited-ocr-vision-projector-f16.gguf \
--image document.png \
-p "document parsing." \
-n 4096
When to Use This Method
- You don’t have a GPU
- You need to run on minimal hardware (embedded, VPS, old machines)
- You want the smallest possible model footprint
- You’re deploying in constrained environments
Choosing the Right Method
| Priority | Best Method |
|---|---|
| Fastest setup | Ollama |
| Highest throughput | vLLM |
| Apple Silicon | MLX |
| No GPU available | GGUF (llama.cpp) |
| Full Python control | Transformers |
| Production API serving | vLLM |
| Prototyping | Ollama |
Tips for Best Results
- Image quality matters: Higher DPI (150-300) gives better results. Don’t go above 300 DPI; it wastes memory without improving accuracy.
- Use the right prompt:
"<image>document parsing."is the standard prompt. For specific extraction, try"<image>extract tables as HTML."or"<image>extract equations as LaTeX." - Multi-page batching: For multi-page PDFs, pass all pages at once rather than one by one. The model performs better with full document context.
- Memory management: If you’re hitting OOM errors, reduce the number of pages per batch or use a more aggressive quantization.
- Output format: The model outputs Markdown by default. Tables come as HTML within the Markdown. Equations come as LaTeX.
Comparison with Other Self-Hosted Options
For context on how Unlimited-OCR compares to other models you can run locally, see our best open-source OCR models comparison. If you’re considering DeepSeek Vision as an alternative for self-hosting, our self-hosting DeepSeek Vision guide covers the full setup process.
For a broader look at how Unlimited-OCR stacks up against paid alternatives like Mistral OCR 4, check our three-way OCR comparison.
FAQ
How much disk space does Unlimited-OCR need?
Full precision: 6.78 GB. GGUF Q4_K_M: ~3.5 GB. MLX 8-bit: ~5 GB. Plus a few hundred MB for the vision projector. Plan for 4-8 GB total depending on your chosen quantization.
Can I run it without a GPU?
Yes. GGUF quantization with llama.cpp runs entirely on CPU. It’s slower (30-60 seconds per page vs 2-5 seconds on GPU) but completely functional. Good for low-volume or batch overnight processing.
Does it work on Windows?
Yes. Ollama and llama.cpp have Windows builds. Transformers and vLLM work on Windows with CUDA. MLX is macOS only. WSL2 with GPU passthrough is another solid option.
How do I process a multi-page PDF?
Convert to images first (using pdf2image or PyMuPDF), then pass all images in a single inference call. The model handles up to ~40 pages in one pass with its 32K context window.
What if the model runs out of memory?
Options: 1) Use a more aggressive quantization (Q4 instead of Q8). 2) Reduce the number of pages per batch. 3) Lower the image resolution (try 100 DPI instead of 150). 4) Use a method with better memory management (vLLM handles this automatically).
Is vLLM support fully merged?
Check the vLLM GitHub for the latest status. As of launch, Unlimited-OCR works with vLLM’s vision-language model support. The llama.cpp GGUF support requires a specific PR branch that may not be in upstream main yet.