🤖 AI Tools
· 1 min read

Reliable Data Extraction with LLMs — From Messy Text to Clean Data


LLMs are excellent at extracting structured data from messy text — emails, PDFs, web pages, logs. Here’s how to do it reliably in production.

The pattern

Unstructured text → LLM + schema → Validated structured data → Your database/API

Step 1: Define your schema

import { z } from 'zod';

const InvoiceSchema = z.object({
  vendor: z.string(),
  invoice_number: z.string(),
  date: z.string(),
  line_items: z.array(z.object({
    description: z.string(),
    quantity: z.number(),
    unit_price: z.number()
  })),
  total: z.number(),
  currency: z.enum(["USD", "EUR", "GBP"])
});

Step 2: Extract with structured outputs

response = client.chat.completions.create(
    model="gpt-5.4",  # or DeepSeek for cost savings
    response_format={"type": "json_schema", "json_schema": {"schema": invoice_schema}},
    messages=[{
        "role": "system",
        "content": "Extract invoice data from the provided text. Be precise with numbers."
    }, {
        "role": "user",
        "content": raw_invoice_text
    }]
)

See our structured outputs guide for provider-specific setup.

Step 3: Validate and handle errors

try:
    data = InvoiceSchema.parse(json.loads(response.content))
except ValidationError as e:
    # Retry with error context
    retry_response = client.chat.completions.create(
        messages=[..., {"role": "user", "content": f"Fix these validation errors: {e}"}]
    )

Cost optimization

Extraction is a perfect use case for model routing:

With MCP

Build an MCP server that extracts data from any source:

server.tool('extract_from_url', { url: z.string(), schema: z.string() }, async ({ url, schema }) => {
  const text = await fetchAndExtract(url);
  const result = await llm.extract(text, schema);
  return { content: [{ type: 'text', text: JSON.stringify(result) }] };
});

Now any AI host can extract structured data from web pages.

For GDPR compliance

If extracting personal data, ensure your LLM provider has a DPA. Or self-host the extraction model. See our GDPR compliance guide.

Related: Structured Outputs Explained · Schema-First AI App Design · How to Reduce LLM API Costs · MCP Complete Guide