LLMs are excellent at extracting structured data from messy text — emails, PDFs, web pages, logs. Here’s how to do it reliably in production.
The pattern
Unstructured text → LLM + schema → Validated structured data → Your database/API
Step 1: Define your schema
import { z } from 'zod';
const InvoiceSchema = z.object({
vendor: z.string(),
invoice_number: z.string(),
date: z.string(),
line_items: z.array(z.object({
description: z.string(),
quantity: z.number(),
unit_price: z.number()
})),
total: z.number(),
currency: z.enum(["USD", "EUR", "GBP"])
});
Step 2: Extract with structured outputs
response = client.chat.completions.create(
model="gpt-5.4", # or DeepSeek for cost savings
response_format={"type": "json_schema", "json_schema": {"schema": invoice_schema}},
messages=[{
"role": "system",
"content": "Extract invoice data from the provided text. Be precise with numbers."
}, {
"role": "user",
"content": raw_invoice_text
}]
)
See our structured outputs guide for provider-specific setup.
Step 3: Validate and handle errors
try:
data = InvoiceSchema.parse(json.loads(response.content))
except ValidationError as e:
# Retry with error context
retry_response = client.chat.completions.create(
messages=[..., {"role": "user", "content": f"Fix these validation errors: {e}"}]
)
Cost optimization
Extraction is a perfect use case for model routing:
- Simple extractions (name, email, date): DeepSeek at $0.27/1M
- Complex extractions (nested data, ambiguous text): Claude or GPT-5
- High volume: Self-hosted Qwen 3.5 for zero per-token cost
With MCP
Build an MCP server that extracts data from any source:
server.tool('extract_from_url', { url: z.string(), schema: z.string() }, async ({ url, schema }) => {
const text = await fetchAndExtract(url);
const result = await llm.extract(text, schema);
return { content: [{ type: 'text', text: JSON.stringify(result) }] };
});
Now any AI host can extract structured data from web pages.
For GDPR compliance
If extracting personal data, ensure your LLM provider has a DPA. Or self-host the extraction model. See our GDPR compliance guide.
Related: Structured Outputs Explained · Schema-First AI App Design · How to Reduce LLM API Costs · MCP Complete Guide