Reliable Data Extraction with LLMs β From Messy Text to Clean Data
LLMs are excellent at extracting structured data from messy text β emails, PDFs, web pages, support tickets, invoices, resumes. But getting reliable, production-grade extraction requires more than a prompt. You need schemas, validation, retry logic, and cost-aware model routing.
The extraction pipeline
Unstructured text β Preprocessing β LLM + schema β Validation β Retry if invalid β Clean data
Each step matters. Skip validation and youβll get hallucinated fields. Skip preprocessing and youβll waste tokens on irrelevant content. Skip retry logic and your pipeline breaks on the first ambiguous document.
Step 1: Define a strict schema
The schema is your contract. It tells the LLM exactly what structure to produce and gives you a validation target.
import { z } from 'zod';
const InvoiceSchema = z.object({
vendor: z.string().min(1),
invoice_number: z.string(),
date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/), // ISO format
line_items: z.array(z.object({
description: z.string(),
quantity: z.number().positive(),
unit_price: z.number().nonneg(),
})).min(1),
total: z.number().positive(),
currency: z.enum(["USD", "EUR", "GBP", "JPY", "CAD"]),
confidence: z.number().min(0).max(1), // Let the LLM self-report confidence
});
Tips for schema design:
- Add constraints (
.min(1),.positive(), regex patterns) β they catch hallucinations - Include a
confidencefield β lets you flag low-confidence extractions for human review - Use enums for categorical fields β prevents creative variations (βUS Dollarsβ vs βUSDβ)
- Make optional fields explicit with
.optional()β donβt let the LLM invent data for missing fields
Step 2: Preprocess the input
Raw documents often contain noise that wastes tokens and confuses the model.
def preprocess(text: str, max_tokens: int = 4000) -> str:
# Remove excessive whitespace
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r' {2,}', ' ', text)
# Remove headers/footers if PDF
text = remove_repeated_headers(text)
# Truncate if too long (keep beginning and end β most relevant for invoices)
if len(text) > max_tokens * 4: # rough char-to-token ratio
half = max_tokens * 2
text = text[:half] + "\n...[truncated]...\n" + text[-half:]
return text.strip()
For PDFs, use a layout-aware parser (like pdfplumber or unstructured) rather than raw text extraction. Table structure matters for invoices and financial documents.
Step 3: Extract with structured outputs
Most LLM providers now support structured output modes that guarantee valid JSON matching your schema.
OpenAI / GPT:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
response_format={
"type": "json_schema",
"json_schema": {
"name": "invoice_extraction",
"schema": invoice_json_schema,
"strict": True
}
},
messages=[
{"role": "system", "content": EXTRACTION_PROMPT},
{"role": "user", "content": preprocessed_text}
]
)
data = json.loads(response.choices[0].message.content)
Anthropic / Claude:
response = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2000,
messages=[{"role": "user", "content": f"{EXTRACTION_PROMPT}\n\nDocument:\n{text}"}],
# Claude uses tool_use for structured outputs
tools=[{
"name": "extract_invoice",
"description": "Extract invoice data",
"input_schema": invoice_json_schema
}],
tool_choice={"type": "tool", "name": "extract_invoice"}
)
data = response.content[0].input # Structured output from tool call
DeepSeek (cost-effective):
response = client.chat.completions.create(
model="deepseek-chat",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": f"Extract data matching this schema: {schema_description}"},
{"role": "user", "content": text}
]
)
See our structured outputs guide for provider-specific details.
Step 4: Validate and retry
Never trust LLM output without validation. Even with structured output mode, the content can be wrong (hallucinated values, wrong dates, impossible totals).
from pydantic import ValidationError
MAX_RETRIES = 2
def extract_with_retry(text: str, schema, retries=MAX_RETRIES):
for attempt in range(retries + 1):
response = call_llm(text, schema)
try:
data = schema.model_validate_json(response.content)
# Business logic validation
if data.total != sum(item.quantity * item.unit_price for item in data.line_items):
raise ValueError("Total doesn't match line items")
return data
except (ValidationError, ValueError) as e:
if attempt == retries:
return None # Flag for human review
# Retry with error context
text = f"{text}\n\nPrevious extraction had errors: {e}\nPlease fix and try again."
return None
Key insight: Include the validation error in the retry prompt. The LLM can usually fix its own mistakes when told what went wrong.
Step 5: Cost-aware model routing
Not all extractions need GPT-4. Route based on complexity:
def choose_model(text: str, schema_complexity: int) -> str:
# Simple extractions: name, email, date, single values
if schema_complexity <= 3 and len(text) < 500:
return "deepseek-chat" # $0.27/1M input tokens
# Medium: nested objects, multiple fields
if schema_complexity <= 8:
return "gpt-4o-mini" # $0.15/1M input tokens
# Complex: ambiguous text, many nested arrays, requires reasoning
return "gpt-4o" # $2.50/1M input tokens
For high-volume extraction (thousands of documents), consider self-hosting a model like Qwen 3.5 β zero per-token cost after infrastructure.
Production patterns
Batch processing with progress tracking
async def extract_batch(documents: list[str], schema):
results = []
for i, doc in enumerate(documents):
result = await extract_with_retry(doc, schema)
results.append({
"document_index": i,
"data": result.model_dump() if result else None,
"needs_review": result is None or result.confidence < 0.8
})
return results
Confidence-based human-in-the-loop
extraction = extract_invoice(text)
if extraction.confidence >= 0.95:
save_to_database(extraction) # Auto-approve
elif extraction.confidence >= 0.7:
queue_for_review(extraction) # Human verifies
else:
queue_for_manual_entry(text) # Too uncertain
Handling multi-page documents
For long documents (contracts, reports), split into sections and extract from each:
def extract_from_long_document(pages: list[str], schema):
# Extract from each page independently
page_results = [extract(page, schema) for page in pages]
# Merge results (deduplicate, resolve conflicts)
return merge_extractions(page_results)
GDPR considerations
If extracting personal data (names, emails, addresses), ensure compliance:
- Your LLM provider must have a Data Processing Agreement (DPA)
- Consider self-hosting for sensitive data
- Log what data was extracted and by which model
- Implement data retention policies on extracted results
See our GDPR compliance guide for AI and AI GDPR developer guide.
FAQ
How accurate is LLM extraction compared to traditional NER/regex?
For well-structured documents (invoices, receipts), LLMs achieve 95-99% accuracy β significantly better than regex for varied formats. For ambiguous text (emails, chat logs), accuracy depends on prompt quality and schema design. Traditional NER is faster but only works for predefined entity types.
Should I fine-tune a model for extraction?
Only if you have thousands of labeled examples AND the base model isnβt accurate enough. For most use cases, good prompts + structured outputs + validation achieves production-grade accuracy without fine-tuning cost.
How do I handle extraction from images/scans?
Use a multimodal model (GPT-4o, Claude) that accepts images directly, or run OCR first (Tesseract, AWS Textract) and extract from the resulting text. Multimodal models handle layout better but cost more per call.
Whatβs the token cost for extracting from a typical invoice?
A one-page invoice is roughly 500-1000 tokens input + 200-500 tokens output. At GPT-4o rates: ~$0.002 per invoice. At DeepSeek rates: ~$0.0002 per invoice. At scale (10,000 invoices/month), model choice matters significantly.