In this blog post Extracting Structured Data with OpenAI for Real-World Pipelines we will turn unstructured content into trustworthy, structured JSON you can store, query, and automate against.
Whether you process invoices, support emails, resumes, or contracts, the goal is the same: capture key fields accurately and repeatably. We’ll start with a high-level view of how it works, then move into practical steps, robust prompting, and production-ready code patterns in JavaScript and Python.
What structured extraction means and why it matters
Structured extraction converts messy text (PDFs, emails, chat logs) into a predictable shape (think JSON). For example, from an invoice you might extract vendor name, invoice number, dates, totals, and line items. From a support ticket, you might extract customer, product, category, and severity.
Why it matters:
- Search and analytics: Query fields directly instead of scraping text every time.
- Automation: Trigger workflows when a field changes (e.g., auto-create a payment).
- Data quality: Validate fields, enforce types, and catch anomalies early.
The technology behind it
OpenAI’s models are strong at reading context and following instructions. Two capabilities make structured extraction reliable:
- Tool (function) calling: You define a function with a JSON Schema-like set of parameters. The model chooses to “call” that function by returning a JSON payload that conforms to your schema. This gives you typed, structured outputs.
- JSON-only responses: You can instruct the model to return only JSON, making parsing straightforward. Pair this with validation and you get both flexibility and control.
In short, the model interprets the text, fills in your schema, and you validate the result. Deterministic structure, non-deterministic content—safely harnessed.
Design your schema first
Start by deciding exactly what you want to capture. Keep it minimal, typed, and explicit about unknowns.
{
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string", "format": "date"},
"due_date": {"type": "string", "format": "date"},
"currency": {"type": "string", "enum": ["AUD", "USD", "EUR", "GBP"]},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"}
},
"required": ["description", "quantity", "unit_price"]
}
}
},
"required": ["vendor_name", "invoice_number", "invoice_date", "currency", "total"]
}
Guidelines:
- Use nullable fields when values may be missing.
- Keep numbers as numbers, dates as dates (ISO 8601), and enums tight.
- Avoid optional fields that you don’t need—the more focused, the better.
Prompting patterns that work
Great extraction is 50% schema, 50% instruction. A reliable pattern:
- Tell the model what the document is and what you need.
- State handling for unknown or conflicting information (return null, not guesses).
- Insist on valid JSON only—no extra text.
- Provide 1–2 short examples if your data is quirky.
System: You extract structured data from business documents.
- Output must be valid JSON matching the provided schema.
- Use null if a field is unknown or not present.
- Do not add extra keys.
- Do not include any explanation outside JSON.
User: Extract fields from this invoice text:
"""
Invoice No: INV-2087
Vendor: Acme Parts Pty Ltd
Invoice Date: 2025-04-11
Due: 2025-05-11
Currency: AUD
Items: Widget A x 5 @ 19.95; Widget B x 2 @ 99.00
Total: 278.75
"""
Code example with tool calling (JavaScript)
This pattern uses Chat Completions with a function tool so the model returns typed arguments.
import OpenAI from "openai";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const schema = {
type: "object",
properties: {
vendor_name: { type: "string" },
invoice_number: { type: "string" },
invoice_date: { type: "string" },
due_date: { type: "string" },
currency: { type: "string" },
total: { type: "number" },
line_items: {
type: "array",
items: {
type: "object",
properties: {
description: { type: "string" },
quantity: { type: "number" },
unit_price: { type: "number" }
},
required: ["description", "quantity", "unit_price"]
}
}
},
required: ["vendor_name", "invoice_number", "invoice_date", "currency", "total"]
};
async function extractInvoice(text) {
const completion = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "You extract structured data. Use null for unknown fields."
},
{ role: "user", content: text }
],
tools: [
{
type: "function",
function: {
name: "record_invoice",
description: "Return structured invoice fields",
parameters: schema
}
}
],
tool_choice: "auto"
});
const message = completion.choices[0].message;
const call = message.tool_calls && message.tool_calls[0];
if (!call) throw new Error("Model did not return a tool call");
return JSON.parse(call.function.arguments);
}
// Example usage
const text = `Invoice No: INV-2087\nVendor: Acme Parts Pty Ltd\nInvoice Date: 2025-04-11\nDue: 2025-05-11\nCurrency: AUD\nItems: Widget A x 5 @ 19.95; Widget B x 2 @ 99.00\nTotal: 278.75`;
extractInvoice(text).then(console.log);
Code example with tool calling (Python)
from openai import OpenAI
import json
client = OpenAI()
schema = {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string"},
"due_date": {"type": "string"},
"currency": {"type": "string"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"}
},
"required": ["description", "quantity", "unit_price"]
}
}
},
"required": ["vendor_name", "invoice_number", "invoice_date", "currency", "total"]
}
messages = [
{"role": "system", "content": "You extract structured data. Use null for unknown fields."},
{"role": "user", "content": "Invoice No: INV-2087\nVendor: Acme Parts Pty Ltd\nInvoice Date: 2025-04-11\nDue: 2025-05-11\nCurrency: AUD\nItems: Widget A x 5 @ 19.95; Widget B x 2 @ 99.00\nTotal: 278.75"}
]
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=[{
"type": "function",
"function": {
"name": "record_invoice",
"description": "Return structured invoice fields",
"parameters": schema
}
}],
tool_choice="auto"
)
call = resp.choices[0].message.tool_calls[0]
record = json.loads(call.function.arguments)
print(record)
Validate and post-process
Always validate model output before using it. A schema validator helps you catch mistakes early.
JavaScript validation with Ajv
// npm i ajv
import Ajv from "ajv";
const ajv = new Ajv({ allErrors: true, strict: false });
const validate = ajv.compile(schema);
const data = await extractInvoice(text);
if (!validate(data)) {
console.error(validate.errors);
// handle fallback, retry with stricter prompt, or alert for review
}
Python validation with jsonschema
# pip install jsonschema
from jsonschema import validate, ValidationError
try:
validate(instance=record, schema=schema)
except ValidationError as e:
print("Validation failed:", e.message)
# fallback, retry, or route to human review
Post-processing tips:
- Normalize dates to YYYY-MM-DD.
- Round currency to two decimals; verify totals equal sum(line_items).
- Use regexes to cross-check fields like invoice numbers or ABNs.
Handling long or messy documents
- Chunking: Split large docs by sections (headings, pages). Extract per chunk, then reconcile.
- Priority zones: Use heuristics (e.g., sections near “Invoice”, “Total”) to bias extraction.
- Vision/OCR: If you have images or scanned PDFs, run OCR first, or use a multimodal model that can read images and text.
- Conflict resolution: If chunks disagree, prefer the most recent date or the chunk with higher confidence (e.g., presence of currency and totals together).
// Pseudo-chunking
const chunks = splitByPageOrHeading(docText);
const partials = await Promise.all(chunks.map(extractInvoice));
const merged = reconcile(partials); // e.g., pick non-null fields, verify totals
Reliability patterns
- Few-shot grounding: Include a minimal example of input → output to reduce ambiguity.
- Null over guess: Encourage nulls when uncertain; better for data quality.
- Retries with variation: On validation failure, retry with a nudge (e.g., “Total must equal sum of items”).
- Human-in-the-loop: Route edge cases to review; log diffs between model and human corrections to improve prompts.
Cost and model choices
- Start small: For extraction, lighter models like gpt-4o-mini are often sufficient and cost-effective.
- Upgrade when needed: If you see frequent nulls or errors on complex docs, try a stronger model (e.g., gpt-4o).
- Batching and streaming: Process documents in parallel within rate limits; use backoff with jitter.
Security and governance
- Redact PII you don’t need before sending to the model.
- Log only what’s required; mask secrets in observability tools.
- If data residency matters, consider a regional deployment option that meets your compliance needs (for Australian workloads, ensure your provider supports AU regions).
Putting it all together
- Define a tight JSON schema.
- Write a clear, firm prompt (nulls over guesses, JSON only).
- Use tool calling to enforce structure.
- Validate outputs; add post-processing rules.
- Handle long docs with chunking and reconciliation.
- Monitor quality, add retries, and include human review for edge cases.
- Optimize cost and ensure security/compliance.
Conclusion
Structured extraction doesn’t have to be fragile. With a focused schema, crisp instructions, and OpenAI’s tool calling, you can turn unstructured text into reliable JSON and wire it into your operational systems. Start small, validate everything, and iterate toward the accuracy your business needs.
Discover more from CPI Consulting
Subscribe to get the latest posts sent to your email.