🤖 AI Tools
· 9 min read

DeepSeek Vision for OCR and Document Processing (Batch Pipeline Guide)


Traditional OCR tools like Tesseract work fine for clean, well-formatted text. But real-world documents are messy. Rotated receipts, handwritten notes in margins, tables that don’t follow a grid, logos mixed with text. This is where vision models crush dedicated OCR engines, and DeepSeek Vision does it at a fraction of what you’d pay GPT-4o.

This guide walks through building a complete document processing pipeline. We’ll handle invoices, receipts, forms, and tables, then output structured JSON and CSV. Everything runs with Python and the DeepSeek API.

For API basics and setup instructions, see our Python tutorial. For a broader look at DeepSeek Vision’s capabilities, check the complete guide.

Why DeepSeek Vision for OCR?

Here’s the honest truth: GPT-4o is slightly more accurate on complex documents. Maybe 2-3% better on handwritten text and heavily formatted layouts. But DeepSeek V4-Flash costs $0.14 per million input tokens compared to GPT-4o’s $2.50. That’s an 18x price difference.

For most document processing, that small accuracy gap doesn’t matter. An invoice has standard fields. A receipt lists items and totals. The model gets these right 95%+ of the time with either provider. So why pay 18x more?

The math works out clearly: if you’re processing 10,000 documents per month, V4-Flash costs roughly $2-3. GPT-4o would run $40-50 for the same workload.

Pipeline Architecture

Here’s what we’re building:

Input Folder (images) -> Preprocessor -> DeepSeek API -> Parser -> Output (JSON/CSV)
     |                       |                |              |
     v                       v                v              v
  Validate           Resize/Convert     Retry Logic    Validate Schema
  extensions         to JPEG             Rate Limit    Write results

The full pipeline handles:

  • Multiple document types (invoices, receipts, forms)
  • Automatic retry with exponential backoff
  • Rate limiting to stay within API quotas
  • Cost tracking per batch
  • Structured output validation
  • Failed document quarantine

Setup

import os
import json
import csv
import time
import base64
from pathlib import Path
from dataclasses import dataclass, field
from openai import OpenAI, RateLimitError, APIError, APITimeoutError

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com"
)

Document Type Prompts

Different document types need different extraction prompts. Being specific about what you want dramatically improves accuracy:

PROMPTS = {
    "invoice": """Extract all information from this invoice image.
Return JSON with exactly these fields:
{
  "invoice_number": "",
  "date": "",
  "due_date": "",
  "vendor_name": "",
  "vendor_address": "",
  "bill_to": "",
  "line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
  "subtotal": 0,
  "tax": 0,
  "total": 0,
  "currency": "",
  "payment_terms": ""
}
Use null for fields you cannot find. Numbers should be floats, not strings.""",

    "receipt": """Extract all information from this receipt image.
Return JSON with exactly these fields:
{
  "store_name": "",
  "store_address": "",
  "date": "",
  "time": "",
  "items": [{"name": "", "price": 0}],
  "subtotal": 0,
  "tax": 0,
  "total": 0,
  "payment_method": "",
  "card_last_four": ""
}
Use null for fields you cannot find. Numbers should be floats.""",

    "form": """Extract all filled fields from this form image.
Return JSON with:
{
  "form_title": "",
  "fields": [{"label": "", "value": "", "field_type": "text/checkbox/date/signature"}],
  "checkboxes": [{"label": "", "checked": true/false}],
  "signatures_present": true/false,
  "date_signed": ""
}""",

    "table": """Extract the table data from this image.
Return JSON with:
{
  "headers": ["col1", "col2", ...],
  "rows": [["val1", "val2", ...], ...],
  "notes": "any text above or below the table"
}
Preserve the exact column order and all rows."""
}

Core Processing Function

@dataclass
class ProcessingResult:
    file: str
    doc_type: str
    data: dict | None = None
    error: str | None = None
    input_tokens: int = 0
    output_tokens: int = 0
    cost: float = 0.0

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def process_document(
    image_path: str,
    doc_type: str,
    model: str = "deepseek-v4-flash",
    max_retries: int = 3
) -> ProcessingResult:
    prompt = PROMPTS.get(doc_type, PROMPTS["form"])
    image_data = encode_image(image_path)
    filename = Path(image_path).name

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/jpeg;base64,{image_data}"
                                }
                            }
                        ]
                    }
                ],
                max_tokens=3000,
                temperature=0.1
            )

            content = response.choices[0].message.content
            usage = response.usage

            # Parse JSON from response
            # Strip markdown code fences if present
            if content.startswith("```"):
                content = content.split("\n", 1)[1].rsplit("```", 1)[0]

            data = json.loads(content)

            input_tokens = usage.prompt_tokens
            output_tokens = usage.completion_tokens
            cost = calculate_cost(input_tokens, output_tokens, model)

            return ProcessingResult(
                file=filename,
                doc_type=doc_type,
                data=data,
                input_tokens=input_tokens,
                output_tokens=output_tokens,
                cost=cost
            )

        except json.JSONDecodeError as e:
            if attempt == max_retries - 1:
                return ProcessingResult(
                    file=filename,
                    doc_type=doc_type,
                    error=f"Invalid JSON response: {e}"
                )
            time.sleep(1)

        except RateLimitError:
            wait = 2 ** attempt * 5
            print(f"  Rate limited. Waiting {wait}s...")
            time.sleep(wait)

        except APITimeoutError:
            if attempt == max_retries - 1:
                return ProcessingResult(
                    file=filename,
                    doc_type=doc_type,
                    error="API timeout after all retries"
                )
            time.sleep(3)

        except APIError as e:
            if e.status_code and e.status_code >= 500:
                time.sleep(3)
            else:
                return ProcessingResult(
                    file=filename,
                    doc_type=doc_type,
                    error=f"API error: {e.message}"
                )

    return ProcessingResult(
        file=filename,
        doc_type=doc_type,
        error="Max retries exceeded"
    )

Cost Calculation

PRICING = {
    "deepseek-v4-flash": {"input": 0.14, "output": 0.28},
    "deepseek-v4-pro": {"input": 1.74, "output": 3.48},
}

def calculate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
    prices = PRICING[model]
    return (
        (input_tokens / 1_000_000) * prices["input"] +
        (output_tokens / 1_000_000) * prices["output"]
    )

Let’s put some real numbers on this. A typical invoice image uses about 1,200 input tokens (90 for the image, ~1,100 for the prompt) and generates roughly 800 output tokens. With V4-Flash:

  • Per invoice: $0.000168 + $0.000224 = $0.000392
  • 1,000 invoices: $0.39
  • 10,000 invoices: $3.92

Compare that to GPT-4o at roughly $0.011 per invoice ($110 for 10,000). The savings are substantial.

Batch Processing Pipeline

Here’s the full batch processor that ties everything together:

@dataclass
class BatchResult:
    results: list[ProcessingResult] = field(default_factory=list)
    total_cost: float = 0.0
    success_count: int = 0
    error_count: int = 0

def process_folder(
    folder: str,
    doc_type: str,
    model: str = "deepseek-v4-flash",
    delay: float = 0.3,
    output_dir: str = "./output"
) -> BatchResult:
    image_extensions = {".jpg", ".jpeg", ".png", ".webp", ".tiff"}
    folder_path = Path(folder)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    image_files = sorted([
        f for f in folder_path.iterdir()
        if f.suffix.lower() in image_extensions
    ])

    if not image_files:
        print(f"No images found in {folder}")
        return BatchResult()

    print(f"Processing {len(image_files)} {doc_type} images with {model}")
    print(f"Estimated cost: ${len(image_files) * 0.0004:.4f} (V4-Flash)")
    print("-" * 50)

    batch = BatchResult()

    for i, image_file in enumerate(image_files):
        result = process_document(str(image_file), doc_type, model)

        batch.results.append(result)
        batch.total_cost += result.cost

        if result.error:
            batch.error_count += 1
            print(f"  [{i+1}/{len(image_files)}] FAIL {image_file.name}: {result.error}")
        else:
            batch.success_count += 1
            print(f"  [{i+1}/{len(image_files)}] OK   {image_file.name} (${result.cost:.6f})")

        time.sleep(delay)

    # Save results
    save_json(batch, output_path / f"{doc_type}_results.json")
    if doc_type in ("invoice", "receipt"):
        save_csv(batch, output_path / f"{doc_type}_results.csv", doc_type)

    print("-" * 50)
    print(f"Done. {batch.success_count} success, {batch.error_count} errors")
    print(f"Total cost: ${batch.total_cost:.4f}")

    return batch

Output Formats

JSON Output

def save_json(batch: BatchResult, output_path: Path):
    output = {
        "summary": {
            "total": len(batch.results),
            "success": batch.success_count,
            "errors": batch.error_count,
            "total_cost": round(batch.total_cost, 6)
        },
        "results": [
            {
                "file": r.file,
                "doc_type": r.doc_type,
                "data": r.data,
                "error": r.error,
                "cost": round(r.cost, 6)
            }
            for r in batch.results
        ]
    }

    with open(output_path, "w") as f:
        json.dump(output, f, indent=2)

    print(f"Saved JSON: {output_path}")

CSV Output for Invoices

def save_csv(batch: BatchResult, output_path: Path, doc_type: str):
    successful = [r for r in batch.results if r.data]
    if not successful:
        return

    if doc_type == "invoice":
        headers = ["file", "invoice_number", "date", "vendor_name", "total", "currency"]
        rows = []
        for r in successful:
            rows.append([
                r.file,
                r.data.get("invoice_number", ""),
                r.data.get("date", ""),
                r.data.get("vendor_name", ""),
                r.data.get("total", ""),
                r.data.get("currency", "")
            ])
    elif doc_type == "receipt":
        headers = ["file", "store_name", "date", "total", "payment_method"]
        rows = []
        for r in successful:
            rows.append([
                r.file,
                r.data.get("store_name", ""),
                r.data.get("date", ""),
                r.data.get("total", ""),
                r.data.get("payment_method", "")
            ])
    else:
        return

    with open(output_path, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)

    print(f"Saved CSV: {output_path}")

Running the Pipeline

if __name__ == "__main__":
    # Process a folder of invoices
    results = process_folder(
        folder="./documents/invoices",
        doc_type="invoice",
        model="deepseek-v4-flash",
        delay=0.3,
        output_dir="./output"
    )

    # Process receipts separately
    results = process_folder(
        folder="./documents/receipts",
        doc_type="receipt",
        model="deepseek-v4-flash",
        delay=0.3,
        output_dir="./output"
    )

Handling Difficult Documents

Some documents need extra attention. Here are patterns for tricky cases:

Multi-page Documents

If you have multi-page PDFs converted to images, process them together:

def process_multipage(image_paths: list[str], doc_type: str) -> ProcessingResult:
    content = [
        {"type": "text", "text": f"These images are pages of a single {doc_type}. " + PROMPTS[doc_type]}
    ]

    for path in image_paths:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{encode_image(path)}"}
        })

    response = client.chat.completions.create(
        model="deepseek-v4-pro",  # Use Pro for multi-page reasoning
        messages=[{"role": "user", "content": content}],
        max_tokens=5000,
        temperature=0.1
    )

    # Parse and return...

Tables with Complex Formatting

Tables are where vision models truly outperform traditional OCR. Tesseract struggles with merged cells, color-coded rows, and headers that span columns. DeepSeek handles these naturally because it “sees” the table structure rather than parsing character positions:

def extract_table_to_csv(image_path: str, output_csv: str):
    result = process_document(image_path, "table", model="deepseek-v4-pro")

    if result.data and "headers" in result.data and "rows" in result.data:
        with open(output_csv, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(result.data["headers"])
            writer.writerows(result.data["rows"])
        print(f"Table extracted to {output_csv}")
    else:
        print(f"Failed to extract table structure")

Rate Limiting Strategy

DeepSeek’s rate limits depend on your plan. Here’s a simple token bucket implementation that works for most setups:

class RateLimiter:
    def __init__(self, requests_per_minute: int = 50):
        self.rpm = requests_per_minute
        self.interval = 60.0 / requests_per_minute
        self.last_request = 0.0

    def wait(self):
        now = time.time()
        elapsed = now - self.last_request
        if elapsed < self.interval:
            time.sleep(self.interval - elapsed)
        self.last_request = time.time()

limiter = RateLimiter(requests_per_minute=50)

# Use it before each API call
limiter.wait()
response = client.chat.completions.create(...)

For high-volume processing (10,000+ documents), consider running multiple API keys in parallel or using async requests with asyncio and the AsyncOpenAI client.

When to Use V4-Pro vs V4-Flash

After processing thousands of documents with both models, here’s my take:

V4-Flash (use for 90% of tasks):

  • Standard invoices and receipts
  • Printed text extraction
  • Simple forms with clear labels
  • Tables with regular structure

V4-Pro (use when Flash struggles):

  • Handwritten documents
  • Multi-page reasoning (connecting info across pages)
  • Complex layouts with overlapping elements
  • Documents in multiple languages
  • When you need the model to infer missing information

The accuracy difference is usually 2-5% in Flash’s favor… I mean Pro’s favor. But at 12x the price, you should start with Flash and only upgrade specific document types that show high error rates.

For a detailed accuracy comparison across different document types, see our benchmark comparison of DeepSeek vs GPT-4o vs Gemini.

FAQ

How accurate is DeepSeek Vision for OCR compared to Tesseract?

For printed text in good lighting, Tesseract and DeepSeek Vision are both above 95% accurate. The difference shows up with messy real-world documents: rotated images, mixed layouts, handwriting, and tables. DeepSeek Vision handles these cases far better because it understands document structure visually rather than trying to segment character positions.

What’s the maximum document size I can process?

The API accepts images up to 20MB. For very high-resolution scans (like 600dpi), you’ll want to resize to around 1500x2000 pixels. You won’t lose meaningful OCR accuracy, and you’ll save on upload time and token costs. The model’s internal processing resizes anyway.

Can I process PDFs directly?

Not directly. You’ll need to convert PDF pages to images first. pdf2image (Python package wrapping poppler) is the standard approach. Convert at 200-300 DPI for good results without excessive file sizes. Each page becomes a separate image that you process individually or as a group.

How do I handle documents in non-Latin scripts?

DeepSeek Vision handles Chinese, Japanese, Korean, Arabic, Hindi, and most other scripts well. It’s particularly strong with CJK characters, unsurprisingly given DeepSeek’s training data. For Arabic and Hebrew (right-to-left), make sure your output handling preserves text direction in the JSON.

What about data privacy for sensitive documents?

All data sent to the API is processed on DeepSeek’s servers in China. If you’re handling HIPAA-protected health records, financial documents under strict compliance, or data subject to GDPR restrictions on cross-border transfer, you should consider self-hosting DeepSeek Vision locally instead. The open-weight models give you full control over where data goes.

How do I validate the extracted data?

Build a validation layer after extraction. Check that required fields exist, amounts parse as numbers, dates follow expected formats, and totals match line item sums. Flag documents that fail validation for human review rather than silently accepting potentially wrong data. A 95% automation rate with 5% human review is much better than 100% automation with silent errors.

📘