Traditional OCR tools like Tesseract work fine for clean, well-formatted text. But real-world documents are messy. Rotated receipts, handwritten notes in margins, tables that don’t follow a grid, logos mixed with text. This is where vision models crush dedicated OCR engines, and DeepSeek Vision does it at a fraction of what you’d pay GPT-4o.
This guide walks through building a complete document processing pipeline. We’ll handle invoices, receipts, forms, and tables, then output structured JSON and CSV. Everything runs with Python and the DeepSeek API.
For API basics and setup instructions, see our Python tutorial. For a broader look at DeepSeek Vision’s capabilities, check the complete guide.
Why DeepSeek Vision for OCR?
Here’s the honest truth: GPT-4o is slightly more accurate on complex documents. Maybe 2-3% better on handwritten text and heavily formatted layouts. But DeepSeek V4-Flash costs $0.14 per million input tokens compared to GPT-4o’s $2.50. That’s an 18x price difference.
For most document processing, that small accuracy gap doesn’t matter. An invoice has standard fields. A receipt lists items and totals. The model gets these right 95%+ of the time with either provider. So why pay 18x more?
The math works out clearly: if you’re processing 10,000 documents per month, V4-Flash costs roughly $2-3. GPT-4o would run $40-50 for the same workload.
Pipeline Architecture
Here’s what we’re building:
Input Folder (images) -> Preprocessor -> DeepSeek API -> Parser -> Output (JSON/CSV)
| | | |
v v v v
Validate Resize/Convert Retry Logic Validate Schema
extensions to JPEG Rate Limit Write results
The full pipeline handles:
- Multiple document types (invoices, receipts, forms)
- Automatic retry with exponential backoff
- Rate limiting to stay within API quotas
- Cost tracking per batch
- Structured output validation
- Failed document quarantine
Setup
import os
import json
import csv
import time
import base64
from pathlib import Path
from dataclasses import dataclass, field
from openai import OpenAI, RateLimitError, APIError, APITimeoutError
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com"
)
Document Type Prompts
Different document types need different extraction prompts. Being specific about what you want dramatically improves accuracy:
PROMPTS = {
"invoice": """Extract all information from this invoice image.
Return JSON with exactly these fields:
{
"invoice_number": "",
"date": "",
"due_date": "",
"vendor_name": "",
"vendor_address": "",
"bill_to": "",
"line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
"subtotal": 0,
"tax": 0,
"total": 0,
"currency": "",
"payment_terms": ""
}
Use null for fields you cannot find. Numbers should be floats, not strings.""",
"receipt": """Extract all information from this receipt image.
Return JSON with exactly these fields:
{
"store_name": "",
"store_address": "",
"date": "",
"time": "",
"items": [{"name": "", "price": 0}],
"subtotal": 0,
"tax": 0,
"total": 0,
"payment_method": "",
"card_last_four": ""
}
Use null for fields you cannot find. Numbers should be floats.""",
"form": """Extract all filled fields from this form image.
Return JSON with:
{
"form_title": "",
"fields": [{"label": "", "value": "", "field_type": "text/checkbox/date/signature"}],
"checkboxes": [{"label": "", "checked": true/false}],
"signatures_present": true/false,
"date_signed": ""
}""",
"table": """Extract the table data from this image.
Return JSON with:
{
"headers": ["col1", "col2", ...],
"rows": [["val1", "val2", ...], ...],
"notes": "any text above or below the table"
}
Preserve the exact column order and all rows."""
}
Core Processing Function
@dataclass
class ProcessingResult:
file: str
doc_type: str
data: dict | None = None
error: str | None = None
input_tokens: int = 0
output_tokens: int = 0
cost: float = 0.0
def encode_image(image_path: str) -> str:
with open(image_path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def process_document(
image_path: str,
doc_type: str,
model: str = "deepseek-v4-flash",
max_retries: int = 3
) -> ProcessingResult:
prompt = PROMPTS.get(doc_type, PROMPTS["form"])
image_data = encode_image(image_path)
filename = Path(image_path).name
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]
}
],
max_tokens=3000,
temperature=0.1
)
content = response.choices[0].message.content
usage = response.usage
# Parse JSON from response
# Strip markdown code fences if present
if content.startswith("```"):
content = content.split("\n", 1)[1].rsplit("```", 1)[0]
data = json.loads(content)
input_tokens = usage.prompt_tokens
output_tokens = usage.completion_tokens
cost = calculate_cost(input_tokens, output_tokens, model)
return ProcessingResult(
file=filename,
doc_type=doc_type,
data=data,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost=cost
)
except json.JSONDecodeError as e:
if attempt == max_retries - 1:
return ProcessingResult(
file=filename,
doc_type=doc_type,
error=f"Invalid JSON response: {e}"
)
time.sleep(1)
except RateLimitError:
wait = 2 ** attempt * 5
print(f" Rate limited. Waiting {wait}s...")
time.sleep(wait)
except APITimeoutError:
if attempt == max_retries - 1:
return ProcessingResult(
file=filename,
doc_type=doc_type,
error="API timeout after all retries"
)
time.sleep(3)
except APIError as e:
if e.status_code and e.status_code >= 500:
time.sleep(3)
else:
return ProcessingResult(
file=filename,
doc_type=doc_type,
error=f"API error: {e.message}"
)
return ProcessingResult(
file=filename,
doc_type=doc_type,
error="Max retries exceeded"
)
Cost Calculation
PRICING = {
"deepseek-v4-flash": {"input": 0.14, "output": 0.28},
"deepseek-v4-pro": {"input": 1.74, "output": 3.48},
}
def calculate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
prices = PRICING[model]
return (
(input_tokens / 1_000_000) * prices["input"] +
(output_tokens / 1_000_000) * prices["output"]
)
Let’s put some real numbers on this. A typical invoice image uses about 1,200 input tokens (90 for the image, ~1,100 for the prompt) and generates roughly 800 output tokens. With V4-Flash:
- Per invoice: $0.000168 + $0.000224 = $0.000392
- 1,000 invoices: $0.39
- 10,000 invoices: $3.92
Compare that to GPT-4o at roughly $0.011 per invoice ($110 for 10,000). The savings are substantial.
Batch Processing Pipeline
Here’s the full batch processor that ties everything together:
@dataclass
class BatchResult:
results: list[ProcessingResult] = field(default_factory=list)
total_cost: float = 0.0
success_count: int = 0
error_count: int = 0
def process_folder(
folder: str,
doc_type: str,
model: str = "deepseek-v4-flash",
delay: float = 0.3,
output_dir: str = "./output"
) -> BatchResult:
image_extensions = {".jpg", ".jpeg", ".png", ".webp", ".tiff"}
folder_path = Path(folder)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
image_files = sorted([
f for f in folder_path.iterdir()
if f.suffix.lower() in image_extensions
])
if not image_files:
print(f"No images found in {folder}")
return BatchResult()
print(f"Processing {len(image_files)} {doc_type} images with {model}")
print(f"Estimated cost: ${len(image_files) * 0.0004:.4f} (V4-Flash)")
print("-" * 50)
batch = BatchResult()
for i, image_file in enumerate(image_files):
result = process_document(str(image_file), doc_type, model)
batch.results.append(result)
batch.total_cost += result.cost
if result.error:
batch.error_count += 1
print(f" [{i+1}/{len(image_files)}] FAIL {image_file.name}: {result.error}")
else:
batch.success_count += 1
print(f" [{i+1}/{len(image_files)}] OK {image_file.name} (${result.cost:.6f})")
time.sleep(delay)
# Save results
save_json(batch, output_path / f"{doc_type}_results.json")
if doc_type in ("invoice", "receipt"):
save_csv(batch, output_path / f"{doc_type}_results.csv", doc_type)
print("-" * 50)
print(f"Done. {batch.success_count} success, {batch.error_count} errors")
print(f"Total cost: ${batch.total_cost:.4f}")
return batch
Output Formats
JSON Output
def save_json(batch: BatchResult, output_path: Path):
output = {
"summary": {
"total": len(batch.results),
"success": batch.success_count,
"errors": batch.error_count,
"total_cost": round(batch.total_cost, 6)
},
"results": [
{
"file": r.file,
"doc_type": r.doc_type,
"data": r.data,
"error": r.error,
"cost": round(r.cost, 6)
}
for r in batch.results
]
}
with open(output_path, "w") as f:
json.dump(output, f, indent=2)
print(f"Saved JSON: {output_path}")
CSV Output for Invoices
def save_csv(batch: BatchResult, output_path: Path, doc_type: str):
successful = [r for r in batch.results if r.data]
if not successful:
return
if doc_type == "invoice":
headers = ["file", "invoice_number", "date", "vendor_name", "total", "currency"]
rows = []
for r in successful:
rows.append([
r.file,
r.data.get("invoice_number", ""),
r.data.get("date", ""),
r.data.get("vendor_name", ""),
r.data.get("total", ""),
r.data.get("currency", "")
])
elif doc_type == "receipt":
headers = ["file", "store_name", "date", "total", "payment_method"]
rows = []
for r in successful:
rows.append([
r.file,
r.data.get("store_name", ""),
r.data.get("date", ""),
r.data.get("total", ""),
r.data.get("payment_method", "")
])
else:
return
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(rows)
print(f"Saved CSV: {output_path}")
Running the Pipeline
if __name__ == "__main__":
# Process a folder of invoices
results = process_folder(
folder="./documents/invoices",
doc_type="invoice",
model="deepseek-v4-flash",
delay=0.3,
output_dir="./output"
)
# Process receipts separately
results = process_folder(
folder="./documents/receipts",
doc_type="receipt",
model="deepseek-v4-flash",
delay=0.3,
output_dir="./output"
)
Handling Difficult Documents
Some documents need extra attention. Here are patterns for tricky cases:
Multi-page Documents
If you have multi-page PDFs converted to images, process them together:
def process_multipage(image_paths: list[str], doc_type: str) -> ProcessingResult:
content = [
{"type": "text", "text": f"These images are pages of a single {doc_type}. " + PROMPTS[doc_type]}
]
for path in image_paths:
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{encode_image(path)}"}
})
response = client.chat.completions.create(
model="deepseek-v4-pro", # Use Pro for multi-page reasoning
messages=[{"role": "user", "content": content}],
max_tokens=5000,
temperature=0.1
)
# Parse and return...
Tables with Complex Formatting
Tables are where vision models truly outperform traditional OCR. Tesseract struggles with merged cells, color-coded rows, and headers that span columns. DeepSeek handles these naturally because it “sees” the table structure rather than parsing character positions:
def extract_table_to_csv(image_path: str, output_csv: str):
result = process_document(image_path, "table", model="deepseek-v4-pro")
if result.data and "headers" in result.data and "rows" in result.data:
with open(output_csv, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(result.data["headers"])
writer.writerows(result.data["rows"])
print(f"Table extracted to {output_csv}")
else:
print(f"Failed to extract table structure")
Rate Limiting Strategy
DeepSeek’s rate limits depend on your plan. Here’s a simple token bucket implementation that works for most setups:
class RateLimiter:
def __init__(self, requests_per_minute: int = 50):
self.rpm = requests_per_minute
self.interval = 60.0 / requests_per_minute
self.last_request = 0.0
def wait(self):
now = time.time()
elapsed = now - self.last_request
if elapsed < self.interval:
time.sleep(self.interval - elapsed)
self.last_request = time.time()
limiter = RateLimiter(requests_per_minute=50)
# Use it before each API call
limiter.wait()
response = client.chat.completions.create(...)
For high-volume processing (10,000+ documents), consider running multiple API keys in parallel or using async requests with asyncio and the AsyncOpenAI client.
When to Use V4-Pro vs V4-Flash
After processing thousands of documents with both models, here’s my take:
V4-Flash (use for 90% of tasks):
- Standard invoices and receipts
- Printed text extraction
- Simple forms with clear labels
- Tables with regular structure
V4-Pro (use when Flash struggles):
- Handwritten documents
- Multi-page reasoning (connecting info across pages)
- Complex layouts with overlapping elements
- Documents in multiple languages
- When you need the model to infer missing information
The accuracy difference is usually 2-5% in Flash’s favor… I mean Pro’s favor. But at 12x the price, you should start with Flash and only upgrade specific document types that show high error rates.
For a detailed accuracy comparison across different document types, see our benchmark comparison of DeepSeek vs GPT-4o vs Gemini.
FAQ
How accurate is DeepSeek Vision for OCR compared to Tesseract?
For printed text in good lighting, Tesseract and DeepSeek Vision are both above 95% accurate. The difference shows up with messy real-world documents: rotated images, mixed layouts, handwriting, and tables. DeepSeek Vision handles these cases far better because it understands document structure visually rather than trying to segment character positions.
What’s the maximum document size I can process?
The API accepts images up to 20MB. For very high-resolution scans (like 600dpi), you’ll want to resize to around 1500x2000 pixels. You won’t lose meaningful OCR accuracy, and you’ll save on upload time and token costs. The model’s internal processing resizes anyway.
Can I process PDFs directly?
Not directly. You’ll need to convert PDF pages to images first. pdf2image (Python package wrapping poppler) is the standard approach. Convert at 200-300 DPI for good results without excessive file sizes. Each page becomes a separate image that you process individually or as a group.
How do I handle documents in non-Latin scripts?
DeepSeek Vision handles Chinese, Japanese, Korean, Arabic, Hindi, and most other scripts well. It’s particularly strong with CJK characters, unsurprisingly given DeepSeek’s training data. For Arabic and Hebrew (right-to-left), make sure your output handling preserves text direction in the JSON.
What about data privacy for sensitive documents?
All data sent to the API is processed on DeepSeek’s servers in China. If you’re handling HIPAA-protected health records, financial documents under strict compliance, or data subject to GDPR restrictions on cross-border transfer, you should consider self-hosting DeepSeek Vision locally instead. The open-weight models give you full control over where data goes.
How do I validate the extracted data?
Build a validation layer after extraction. Check that required fields exist, amounts parse as numbers, dates follow expected formats, and totals match line item sums. Flag documents that fail validation for human review rather than silently accepting potentially wrong data. A 95% automation rate with 5% human review is much better than 100% automation with silent errors.