πŸ€– AI Tools
Β· 5 min read
Last updated on

Reliable Data Extraction with LLMs β€” From Messy Text to Clean Data


LLMs are excellent at extracting structured data from messy text β€” emails, PDFs, web pages, support tickets, invoices, resumes. But getting reliable, production-grade extraction requires more than a prompt. You need schemas, validation, retry logic, and cost-aware model routing.

The extraction pipeline

Unstructured text β†’ Preprocessing β†’ LLM + schema β†’ Validation β†’ Retry if invalid β†’ Clean data

Each step matters. Skip validation and you’ll get hallucinated fields. Skip preprocessing and you’ll waste tokens on irrelevant content. Skip retry logic and your pipeline breaks on the first ambiguous document.

Step 1: Define a strict schema

The schema is your contract. It tells the LLM exactly what structure to produce and gives you a validation target.

import { z } from 'zod';

const InvoiceSchema = z.object({
  vendor: z.string().min(1),
  invoice_number: z.string(),
  date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),  // ISO format
  line_items: z.array(z.object({
    description: z.string(),
    quantity: z.number().positive(),
    unit_price: z.number().nonneg(),
  })).min(1),
  total: z.number().positive(),
  currency: z.enum(["USD", "EUR", "GBP", "JPY", "CAD"]),
  confidence: z.number().min(0).max(1),  // Let the LLM self-report confidence
});

Tips for schema design:

  • Add constraints (.min(1), .positive(), regex patterns) β€” they catch hallucinations
  • Include a confidence field β€” lets you flag low-confidence extractions for human review
  • Use enums for categorical fields β€” prevents creative variations (β€œUS Dollars” vs β€œUSD”)
  • Make optional fields explicit with .optional() β€” don’t let the LLM invent data for missing fields

Step 2: Preprocess the input

Raw documents often contain noise that wastes tokens and confuses the model.

def preprocess(text: str, max_tokens: int = 4000) -> str:
    # Remove excessive whitespace
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r' {2,}', ' ', text)
    
    # Remove headers/footers if PDF
    text = remove_repeated_headers(text)
    
    # Truncate if too long (keep beginning and end β€” most relevant for invoices)
    if len(text) > max_tokens * 4:  # rough char-to-token ratio
        half = max_tokens * 2
        text = text[:half] + "\n...[truncated]...\n" + text[-half:]
    
    return text.strip()

For PDFs, use a layout-aware parser (like pdfplumber or unstructured) rather than raw text extraction. Table structure matters for invoices and financial documents.

Step 3: Extract with structured outputs

Most LLM providers now support structured output modes that guarantee valid JSON matching your schema.

OpenAI / GPT:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice_extraction",
            "schema": invoice_json_schema,
            "strict": True
        }
    },
    messages=[
        {"role": "system", "content": EXTRACTION_PROMPT},
        {"role": "user", "content": preprocessed_text}
    ]
)
data = json.loads(response.choices[0].message.content)

Anthropic / Claude:

response = anthropic.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2000,
    messages=[{"role": "user", "content": f"{EXTRACTION_PROMPT}\n\nDocument:\n{text}"}],
    # Claude uses tool_use for structured outputs
    tools=[{
        "name": "extract_invoice",
        "description": "Extract invoice data",
        "input_schema": invoice_json_schema
    }],
    tool_choice={"type": "tool", "name": "extract_invoice"}
)
data = response.content[0].input  # Structured output from tool call

DeepSeek (cost-effective):

response = client.chat.completions.create(
    model="deepseek-chat",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": f"Extract data matching this schema: {schema_description}"},
        {"role": "user", "content": text}
    ]
)

See our structured outputs guide for provider-specific details.

Step 4: Validate and retry

Never trust LLM output without validation. Even with structured output mode, the content can be wrong (hallucinated values, wrong dates, impossible totals).

from pydantic import ValidationError

MAX_RETRIES = 2

def extract_with_retry(text: str, schema, retries=MAX_RETRIES):
    for attempt in range(retries + 1):
        response = call_llm(text, schema)
        
        try:
            data = schema.model_validate_json(response.content)
            
            # Business logic validation
            if data.total != sum(item.quantity * item.unit_price for item in data.line_items):
                raise ValueError("Total doesn't match line items")
            
            return data
            
        except (ValidationError, ValueError) as e:
            if attempt == retries:
                return None  # Flag for human review
            
            # Retry with error context
            text = f"{text}\n\nPrevious extraction had errors: {e}\nPlease fix and try again."
    
    return None

Key insight: Include the validation error in the retry prompt. The LLM can usually fix its own mistakes when told what went wrong.

Step 5: Cost-aware model routing

Not all extractions need GPT-4. Route based on complexity:

def choose_model(text: str, schema_complexity: int) -> str:
    # Simple extractions: name, email, date, single values
    if schema_complexity <= 3 and len(text) < 500:
        return "deepseek-chat"  # $0.27/1M input tokens
    
    # Medium: nested objects, multiple fields
    if schema_complexity <= 8:
        return "gpt-4o-mini"  # $0.15/1M input tokens
    
    # Complex: ambiguous text, many nested arrays, requires reasoning
    return "gpt-4o"  # $2.50/1M input tokens

For high-volume extraction (thousands of documents), consider self-hosting a model like Qwen 3.5 β€” zero per-token cost after infrastructure.

Production patterns

Batch processing with progress tracking

async def extract_batch(documents: list[str], schema):
    results = []
    for i, doc in enumerate(documents):
        result = await extract_with_retry(doc, schema)
        results.append({
            "document_index": i,
            "data": result.model_dump() if result else None,
            "needs_review": result is None or result.confidence < 0.8
        })
    return results

Confidence-based human-in-the-loop

extraction = extract_invoice(text)

if extraction.confidence >= 0.95:
    save_to_database(extraction)  # Auto-approve
elif extraction.confidence >= 0.7:
    queue_for_review(extraction)  # Human verifies
else:
    queue_for_manual_entry(text)  # Too uncertain

Handling multi-page documents

For long documents (contracts, reports), split into sections and extract from each:

def extract_from_long_document(pages: list[str], schema):
    # Extract from each page independently
    page_results = [extract(page, schema) for page in pages]
    
    # Merge results (deduplicate, resolve conflicts)
    return merge_extractions(page_results)

GDPR considerations

If extracting personal data (names, emails, addresses), ensure compliance:

  • Your LLM provider must have a Data Processing Agreement (DPA)
  • Consider self-hosting for sensitive data
  • Log what data was extracted and by which model
  • Implement data retention policies on extracted results

See our GDPR compliance guide for AI and AI GDPR developer guide.

FAQ

How accurate is LLM extraction compared to traditional NER/regex?

For well-structured documents (invoices, receipts), LLMs achieve 95-99% accuracy β€” significantly better than regex for varied formats. For ambiguous text (emails, chat logs), accuracy depends on prompt quality and schema design. Traditional NER is faster but only works for predefined entity types.

Should I fine-tune a model for extraction?

Only if you have thousands of labeled examples AND the base model isn’t accurate enough. For most use cases, good prompts + structured outputs + validation achieves production-grade accuracy without fine-tuning cost.

How do I handle extraction from images/scans?

Use a multimodal model (GPT-4o, Claude) that accepts images directly, or run OCR first (Tesseract, AWS Textract) and extract from the resulting text. Multimodal models handle layout better but cost more per call.

What’s the token cost for extracting from a typical invoice?

A one-page invoice is roughly 500-1000 tokens input + 200-500 tokens output. At GPT-4o rates: ~$0.002 per invoice. At DeepSeek rates: ~$0.0002 per invoice. At scale (10,000 invoices/month), model choice matters significantly.