Baidu released Unlimited-OCR on June 22, 2026, and it immediately became the most interesting free OCR model available. It’s 3 billion parameters, MIT-licensed, processes multi-page PDFs in a single forward pass, and runs on consumer hardware. Your documents never leave your device.
That last point matters more than people realize. Every time you send an invoice, contract, or medical document to a cloud OCR API, you’re trusting that provider with sensitive data. Unlimited-OCR eliminates that tradeoff entirely.
Let me walk you through everything this model can do, how it works, and when it makes sense over paid alternatives.
What Makes It Special
Multi-Page PDF in One Pass
Most OCR models process documents page by page. You split a PDF into individual images, run each one through the model, then stitch the results back together. This loses cross-page context (tables that span pages, references between sections, running headers).
Unlimited-OCR takes a different approach. Thanks to its 32K context window and a constant-KV-cache design, it ingests up to 40 pages in a single inference pass. The model sees the entire document at once, which means:
- Tables that span page breaks are handled correctly
- Document structure is preserved across pages
- No manual splitting and reassembly needed
- Context from earlier pages informs later page extraction
Architecture
Unlimited-OCR builds on the DeepSeek-OCR architecture:
- Vision encoder: SAM + CLIP DeepEncoder. This is the “eyes” of the model, converting document images into high-compression visual tokens.
- Text decoder: DeepSeek-V2 MoE (Mixture of Experts). The language model that generates structured text output from the visual tokens.
- Reference Sliding Window Attention: A memory-efficient attention mechanism that keeps the KV cache constant even as more pages are added. This is the key innovation that enables multi-page processing without memory explosion.
The combination of high-compression visual encoding and constant-memory attention means you can feed in dozens of pages without needing proportionally more GPU memory.
Structured Output
Unlimited-OCR doesn’t just dump raw text. It produces structured output:
- Tables are converted to HTML markup
- Equations are rendered as LaTeX
- Layout is preserved with bounding box coordinates
- Reading order follows natural document flow
This means the output is immediately usable for downstream tasks without extensive post-processing.
Model Specs
| Specification | Value |
|---|---|
| Parameters | 3B |
| License | MIT |
| Context window | 32,768 tokens |
| Model size | 6.78 GB (full precision) |
| Architecture | SAM+CLIP encoder + DeepSeek-V2 MoE decoder |
| Max pages (single pass) | ~40 (varies by content density) |
| Output formats | Markdown, HTML (tables), LaTeX (equations) |
| Release date | June 22, 2026 |
Supported Deployment Options
One of the best things about Unlimited-OCR is how many ways you can run it:
- HuggingFace Transformers: Standard Python integration
- vLLM: High-throughput serving for production
- SGLang: Alternative serving framework
- Ollama: One-command local deployment
- llama.cpp: CPU-friendly inference (with GGUF quantization)
- MLX: Native Apple Silicon optimization
Available Quantizations
- Full precision (BF16): 6.78 GB, best quality
- GGUF (multiple quant levels): 2-5 GB, good quality, runs on CPU
- MLX 8-bit: Optimized for Apple Silicon Macs
- NVFP4: 4-bit quantization for NVIDIA GPUs, smallest VRAM footprint
For detailed setup instructions across all methods, see our guide to running Baidu Unlimited-OCR locally.
Hardware Requirements
Minimum (quantized)
- 8 GB RAM (CPU inference with GGUF Q4)
- Or: Apple Silicon Mac with 8 GB unified memory (MLX)
- Or: GPU with 4 GB VRAM (NVFP4 quantization)
Recommended (full precision)
- GPU with 12+ GB VRAM (RTX 3060 12GB or better)
- Or: Apple Silicon Mac with 16 GB unified memory
- 16 GB system RAM
Production (high throughput)
- GPU with 24+ GB VRAM (RTX 4090, A100)
- vLLM or SGLang serving
- Multiple GPUs for parallel processing
Quick Start with Transformers
from transformers import AutoModel, AutoTokenizer
from PIL import Image
model = AutoModel.from_pretrained("baidu/Unlimited-OCR", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("baidu/Unlimited-OCR", trust_remote_code=True)
# Process a single image
image = Image.open("document.png")
inputs = tokenizer(images=image, text="<image>document parsing.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
What It Excels At
Based on testing and community reports:
Great for:
- Invoices and receipts (structured tables, numbers)
- Academic papers (equations, citations, multi-column)
- Contracts and legal documents (dense text, multiple pages)
- Forms with mixed content (checkboxes, handwriting, printed text)
- CJK documents (Chinese, Japanese, Korean)
Acceptable for:
- Low-resolution scans (quality degrades but still usable)
- Handwritten documents (limited but improving)
- Complex mixed-language documents
Struggles with:
- Very low quality (fax-quality scans, heavy noise)
- Unusual scripts not well-represented in training data
- Documents with heavy graphical elements overlapping text
Privacy and Security
This is the killer feature for many organizations:
- Nothing leaves your device. Ever. No API calls, no cloud processing, no data transmission.
- Air-gapped deployments are fully supported since inference is local.
- GDPR compliance is simplified because document data never crosses organizational boundaries.
- No vendor lock-in: MIT license means you can modify, distribute, and deploy without restrictions.
For teams handling sensitive documents (medical records, legal contracts, financial statements, government documents), this privacy guarantee is worth more than any accuracy improvement from cloud services.
How It Compares
vs. Mistral OCR 4: Mistral wins on raw accuracy (72% win rate in blind tests, top OlmOCRBench score) and language support (170 vs 40+). But Mistral costs $4/1K pages and requires cloud or enterprise self-hosting. Unlimited-OCR is free and truly local. For a detailed breakdown, see our Mistral OCR 4 vs DeepSeek Vision vs Baidu Unlimited-OCR comparison.
vs. DeepSeek Vision: DeepSeek is a general-purpose vision model, not a dedicated OCR system. It’s larger, more capable for general questions, but doesn’t match Unlimited-OCR for raw document extraction speed or multi-page handling. See our DeepSeek Vision complete guide for more on its capabilities.
vs. Tesseract: Tesseract is the classic open-source OCR engine. It works but produces raw text without structure. Unlimited-OCR gives you tables as HTML, equations as LaTeX, and layout-aware output. It’s a generational leap in quality.
For a full comparison of all open-source OCR options in 2026, see our best open-source OCR models roundup.
Use Cases
- Private document processing: Law firms, healthcare, finance. Process sensitive documents without any cloud dependency.
- RAG ingestion: Feed structured document content into retrieval systems. The HTML table output is particularly useful for vector databases.
- Bulk digitization: Libraries, archives, organizations with large paper backlogs. Zero per-page cost means budget-friendly at any scale.
- Edge deployment: Run OCR on IoT devices, kiosks, or in-field equipment without internet connectivity.
- Development and prototyping: Free and fast to iterate with during development before deciding if you need a cloud service for production.
Limitations
- Not the quality leader: Mistral OCR 4 and premium services produce better results on hard documents. For most standard documents the difference is negligible, but for edge cases it matters.
- Language coverage: 40+ languages is good but not 170. If you process documents in less common scripts, this might not cover you.
- No managed API: You manage the infrastructure yourself. That means handling scaling, monitoring, and updates.
- No confidence scores: Unlike Mistral, there’s no built-in confidence score to flag uncertain extractions.
- Requires GPU for speed: CPU inference works but is slow. Real-time processing needs a GPU.
Getting Started
- Check your hardware meets minimum requirements
- Choose a deployment method (Ollama is easiest, vLLM is fastest)
- Download the model (~6.78 GB)
- Test with a sample document
- Integrate into your pipeline
The full setup walkthrough covering all deployment methods is in our local deployment guide.
FAQ
Is Baidu Unlimited-OCR really free?
Yes. MIT license, no API fees, no usage limits. You only pay for the hardware to run it. If you already have a computer with a GPU or modern Apple Silicon Mac, there’s no additional cost.
Can it run on a Mac?
Yes. The MLX quantization is optimized for Apple Silicon. An M1 Mac with 16GB unified memory runs it comfortably. 8GB works with the 8-bit quantization but leaves less room for other applications.
How does multi-page processing work?
You pass all pages as a sequence of images in a single inference call. The model’s constant-KV-cache design and 32K context window allow it to process up to ~40 pages without running out of memory. Denser pages reduce the maximum count.
Is it better than Tesseract?
For most use cases, yes, significantly. Unlimited-OCR understands document structure (tables, equations, layout) while Tesseract only extracts raw text. Accuracy on complex documents is also substantially better. Tesseract’s advantage is being extremely lightweight and requiring no GPU.
Can I use it commercially?
Yes. The MIT license permits commercial use without restriction. You can integrate it into products, modify the code, and distribute it freely.
What about updates and support?
Baidu maintains the model on HuggingFace and ModelScope. Community support through GitHub issues and HuggingFace discussions. There’s no commercial support tier, but the open-source community is active.