Hoe je betrouwbaar JSON parst uit LLM-responses

Je vroeg het model om een JSON-object met factuurdata. De prompt was helder: "Geef alleen geldige JSON terug. Geen uitleg." Wat terugkwam was een markdown code-fence, twee zinnen commentaar, een JSON-object — en daaronder een vriendelijke noot die elk veld uitlegde. In productie, om 02:00, met de data-pipeline van een klant die stilligt. Als je iets bouwt op LLM API's, ken je deze pijn al. LLM's zijn geen JSON-serializers. Het zijn tekst-generators die meestal geldige JSON produceren — totdat ze dat niet doen. Dit artikel behandelt de vijf manieren waarop ze het breken en de battle-tested patronen om elk ervan af te handelen.

De 5 manieren waarop LLM's JSON breken

Dit zijn geen edge cases. Elk van deze zal je in productie overkomen, meestal op het moment dat je ermee stopt erop te checken.

Markdown code-fences — Het model wikkelt de JSON in ```json\n...\n``` omdat zijn trainingsdata vol zit met docs en README-bestanden die JSON zo presenteren.
Commentaar achteraf — Het model voegt na de sluitingsbrace een zin of alinea toe: "Let op: het total-veld is in USD."
Truncatie — Lange outputs worden halverwege een object afgekapt als de response de token-limiet raakt, waardoor je structureel gebroken JSON overhoudt zonder afsluitbraces.
Gehallucineerde keys — Het model verzint veldnamen die niet in je schema staan. Je vroeg om invoice_number, je kreeg invoiceNumber, invoice_no en ref_id — soms in dezelfde response.
Verkeerde types — Nummers komen als strings binnen ("49.99" in plaats van 49.99), booleans als "true", arrays als komma-gescheiden strings. Type-coercion-bugs vermomd.

Patroon 1: Markdown code-fences strippen

Dit is de meest voorkomende breuk en de makkelijkst te fixen. Een simpele regex strippet de fence of de taal-tag nu json, JSON of helemaal afwezig is. Voer hem uit voor elke andere verwerking — het kost niks en voorkomt een grote klasse fouten.

python

import re

def strip_code_fences(text: str) -> str:
    """Remove markdown code fences from LLM output."""
    # Handles ```json, ```JSON, ``` (no lang tag), etc.
    pattern = r'^```(?:json|JSON)?\s*\n?(.*?)\n?```$'
    match = re.search(pattern, text.strip(), re.DOTALL)
    if match:
        return match.group(1).strip()
    return text.strip()

# Example: model returned a fenced block
raw = """
```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1249.99,
  "currency": "USD"
}
```
"""

clean = strip_code_fences(raw)
invoice = json.loads(clean)  # now safe

function stripCodeFences(text) {
  // Handles ```json, ```JSON, bare ``` (no lang), etc.
  const match = text.trim().match(/^```(?:json|JSON)?\s*\n?([\s\S]*?)\n?```$/s);
  return match ? match[1].trim() : text.trim();
}

// raw response contains a triple-backtick fence (shown here as a single-quoted string)
const raw = '```json\n{\n  "invoice_number": "INV-2024-0192",\n  "vendor": "Acme Supplies",\n  "total": 1249.99\n}\n```';

const clean = stripCodeFences(raw);
const invoice = JSON.parse(clean); // safe

Patroon 2: JSON extraheren met regex

Wanneer het model tekst vóór of na het JSON-object toevoegt — "Hier is de geëxtraheerde data:", "Laat me weten of je aanpassingen nodig hebt." — is fences strippen niet genoeg. Je moet het buitenste {...}-blok vinden en eruit halen. De truc is een greedy match die geneste objects correct afhandelt. Merk op dat deze aanpak objects ({}) afhandelt; als je schema een array is, wissel dan de character class dienovereenkomstig.

python

import re
import json

def extract_json_object(text: str) -> str | None:
    """
    Extract the first complete JSON object from a string that may
    contain surrounding prose or commentary.
    """
    # Find the first { and last } to grab the outermost object
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if not match:
        # Fall back to array extraction if no object found
        match = re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

# Model returned prose + JSON + footnote
raw_response = """
Based on the document you provided, here is the structured data:

{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99},
    {"description": "Standing desk", "qty": 1, "unit_price": 649.99}
  ],
  "total": 1849.95
}

Note: unit prices are pre-tax. Let me know if you need the tax breakdown.
"""

json_str = extract_json_object(raw_response)
if json_str:
    invoice = json.loads(json_str)
    print(f"Parsed invoice: {invoice['invoice_number']}")
else:
    raise ValueError("No JSON object found in LLM response")

Patroon 3: json-repair gebruiken voor structurele fouten

Truncatie en kleine structurele fouten — een ontbrekende sluitingsbrace, een key zonder quotes, een trailing komma — daar schiet regex-extractie tekort. De json-repair-library is precies hiervoor gebouwd. Hij past een reeks heuristieken toe om zoveel mogelijk geldige structuur te herstellen uit gebroken JSON, vergelijkbaar met hoe browsers misvormde HTML tolereren. Installeer met pip install json-repair, en zet hem in je parse-pipeline als laatste linie van verdediging voordat je een response opgeeft.

python

import json
import json_repair  # pip install json-repair

def parse_with_repair(text: str) -> dict | list | None:
    """
    Attempt standard parse first; fall back to json_repair for
    structurally broken responses (truncation, missing braces, etc.).
    """
    # First pass: clean up fences and extract the JSON substring
    cleaned = extract_json_object(strip_code_fences(text))
    if not cleaned:
        return None

    # Second pass: try the fast standard parse
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        pass

    # Third pass: let json_repair reconstruct broken structure
    try:
        repaired = json_repair.repair_json(cleaned, return_objects=True)
        return repaired if repaired else None
    except Exception:
        return None

# Works even on truncated output from a token-limited response
truncated = """
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4
"""

result = parse_with_repair(truncated)
# Returns {"invoice_number": "INV-2024-0192", "vendor": "Acme Supplies",
#          "line_items": [{"description": "Office chairs", "qty": 4}]}

Handmatige debug-tip: Wanneer je een specifieke gebroken response onderzoekt, plak hem dan in de JSON Fixer om te zien wat json-repair er precies mee doet — of gebruik de JSON Validator om de exacte regel en karakterpositie van de syntaxfout te identificeren voor je beslist of je repareert of opnieuw prompt.

Patroon 4: Opnieuw proberen met expliciete prompting

Soms is de beste parser het model zelf. Als de output zo vermangeld is dat json-repair er niets mee kan — gehallucineerde keys, compleet verkeerde structuur, een response die meer proza dan data is — stuur dan de gebroken output terug naar het model met de parse-fout en vraag het om zijn eigen fout te corrigeren. Modellen zijn daar verrassend goed in. Houd het aantal retries laag (max 2–3) en volg pogingen om oneindige loops te voorkomen.

python

import json
from openai import OpenAI

client = OpenAI()

def call_model(messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

def extract_invoice_data(document_text: str, max_retries: int = 3) -> dict:
    """Extract structured invoice data with automatic retry on parse failure."""
    system_prompt = """Extract invoice data and return ONLY a JSON object with these fields:
{
  "invoice_number": string,
  "vendor": string,
  "issue_date": string (YYYY-MM-DD),
  "due_date": string (YYYY-MM-DD) or null,
  "line_items": [{"description": string, "qty": number, "unit_price": number}],
  "subtotal": number,
  "tax": number,
  "total": number,
  "currency": string (ISO 4217)
}
Return ONLY the JSON object. No markdown. No explanation."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Extract invoice data from:\n\n{document_text}"}
    ]

    for attempt in range(max_retries):
        raw = call_model(messages)

        try:
            cleaned = extract_json_object(strip_code_fences(raw))
            return json.loads(cleaned)
        except (json.JSONDecodeError, TypeError) as e:
            if attempt == max_retries - 1:
                raise ValueError(
                    f"Failed to parse JSON after {max_retries} attempts. "
                    f"Last error: {e}. Last response: {raw[:200]}"
                )

            # Feed the error back — the model often corrects itself
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": (
                    f"That response caused a JSON parse error: {e}\n"
                    f"Please return ONLY a valid JSON object. No markdown fences, "
                    f"no commentary, just the raw JSON."
                )
            })

    raise ValueError("Unexpected exit from retry loop")

Patroon 5: Sla het parsen over — gebruik in plaats daarvan Structured Outputs

Als je de model-call controleert en je kunt nieuwere API's gebruiken, elimineren structured outputs het meeste van deze complexiteit. OpenAI Structured Outputs (beschikbaar op GPT-4o en later) en Gemini's response-schema beperken allebei de output van het model op token-generatieniveau — het is wiskundig onmogelijk voor het model om een misvormd JSON-object terug te geven, omdat ongeldige tokens tijdens decoding worden onderdrukt. Het nadeel: je levert wat creativiteit van het model in en deze API's kosten iets meer per call. Voor extractie-pipelines met hoog volume zijn ze dat meestal waard.

python

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class LineItem(BaseModel):
    description: str
    qty: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    issue_date: str          # YYYY-MM-DD
    total: float
    currency: str            # ISO 4217
    line_items: list[LineItem]

def extract_invoice_structured(document_text: str) -> Invoice:
    """
    Extract invoice using OpenAI Structured Outputs.
    The API guarantees the response matches the Invoice schema —
    no manual parsing or repair needed.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": "Extract invoice data from the provided document."
            },
            {"role": "user", "content": document_text}
        ],
        response_format=Invoice
    )
    return completion.choices[0].message.parsed

invoice = extract_invoice_structured(document_text)
print(f"Invoice {invoice.invoice_number}: ${invoice.total:.2f} {invoice.currency}")

Een productie-klare parser (Python)

Zo ziet een extractiefunctie voor productie eruit wanneer je alle vier de defensieve patronen combineert in één utility. Dit is de versie die ik echt draai in services die duizenden LLM-responses per dag verwerken. Hij strippt fences, haalt de JSON-substring eruit, probeert een schone parse, valt terug op json_repair en valideert optioneel tegen een JSON Schema voor hij teruggeeft. Als je geen structured outputs gebruikt, is dit je fundament.

python

import re
import json
from typing import Any
import json_repair        # pip install json-repair
import jsonschema         # pip install jsonschema

def strip_code_fences(text: str) -> str:
    match = re.search(r'^```(?:\w+)?\s*\n?(.*?)\n?```$', text.strip(), re.DOTALL)
    return match.group(1).strip() if match else text.strip()

def extract_json_substring(text: str) -> str | None:
    match = re.search(r'\{.*\}', text, re.DOTALL) or re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

def parse_llm_json(text: str, schema: dict | None = None) -> Any:
    """
    Robustly parse JSON from LLM output.

    Steps:
      1. Strip markdown code fences
      2. Extract outermost JSON object/array (handles surrounding prose)
      3. Fast-path: standard json.loads
      4. Slow-path: json_repair for structurally broken responses
      5. Optional: validate against a JSON Schema

    Args:
        text:   Raw text returned by the LLM
        schema: Optional JSON Schema dict to validate the parsed result

    Returns:
        Parsed Python object (dict or list)

    Raises:
        ValueError: If parsing fails after all recovery attempts
        jsonschema.ValidationError: If schema validation fails
    """
    if not text or not text.strip():
        raise ValueError("LLM returned an empty response")

    # Step 1 — strip fences
    text = strip_code_fences(text)

    # Step 2 — extract JSON substring (handles prose before/after)
    json_str = extract_json_substring(text)
    if not json_str:
        raise ValueError(f"No JSON object or array found in response: {text[:200]!r}")

    # Step 3 — standard parse (fast path, no overhead)
    parsed = None
    try:
        parsed = json.loads(json_str)
    except json.JSONDecodeError as original_error:
        # Step 4 — repair and retry
        try:
            repaired = json_repair.repair_json(json_str, return_objects=True)
            if repaired is not None:
                parsed = repaired
        except Exception as repair_error:
            raise ValueError(
                f"JSON parse failed and repair also failed.\n"
                f"Parse error: {original_error}\n"
                f"Repair error: {repair_error}\n"
                f"Input (first 500 chars): {json_str[:500]!r}"
            ) from original_error

    if parsed is None:
        raise ValueError(f"Parsing returned None for input: {json_str[:200]!r}")

    # Step 5 — optional schema validation
    if schema is not None:
        jsonschema.validate(parsed, schema)  # raises ValidationError on mismatch

    return parsed


# --- Usage ---

INVOICE_SCHEMA = {
    "type": "object",
    "required": ["invoice_number", "vendor", "total"],
    "properties": {
        "invoice_number": {"type": "string"},
        "vendor":         {"type": "string"},
        "total":          {"type": "number"},
        "currency":       {"type": "string"},
        "line_items":     {"type": "array"}
    }
}

llm_response = """
Sure! Here's the structured data:

```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1849.95,
  "currency": "USD",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99}
  ]
}
```

Let me know if you need any changes!
"""

invoice = parse_llm_json(llm_response, schema=INVOICE_SCHEMA)
print(f"Vendor: {invoice['vendor']}, Total: ${invoice['total']}")

JavaScript-versie

Dezelfde logica in JavaScript. Voor de repair-stap is het dichtste equivalent van json_repair JSON5 voor tolerant parsen van bijna-geldige JSON, of je schrijft zelf een lichtgewicht repair-wrapper. Voor client-side werk dekt JSON.parse() met een goede try/catch en een regex-fallback de overgrote meerderheid van productie-gevallen.

// npm install json5   (optional — for tolerant parsing of near-valid JSON)
import JSON5 from 'json5';

function stripCodeFences(text) {
  const match = text.trim().match(/^```(?:\w+)?\s*\n?([\s\S]*?)\n?```$/);
  return match ? match[1].trim() : text.trim();
}

function extractJsonSubstring(text) {
  // Greedy match for outermost object or array
  const objectMatch = text.match(/\{[\s\S]*\}/);
  if (objectMatch) return objectMatch[0];
  const arrayMatch = text.match(/\[[\s\S]*\]/);
  return arrayMatch ? arrayMatch[0] : null;
}

/**
 * Robustly parse JSON from LLM output.
 * Steps: strip fences → extract substring → JSON.parse → JSON5 fallback
 *
 * @param {string} text - Raw LLM response text
 * @returns {object|Array} Parsed JavaScript value
 * @throws {Error} If all parse attempts fail
 */
function parseLlmJson(text) {
  if (!text || !text.trim()) {
    throw new Error('LLM returned an empty response');
  }

  // Step 1 — strip markdown fences
  let cleaned = stripCodeFences(text);

  // Step 2 — extract JSON substring (skip surrounding prose)
  const jsonStr = extractJsonSubstring(cleaned);
  if (!jsonStr) {
    throw new Error(`No JSON object or array found in response: ${text.slice(0, 200)}`);
  }

  // Step 3 — standard JSON.parse (fast path)
  try {
    return JSON.parse(jsonStr);
  } catch (stdError) {
    // Step 4 — JSON5 tolerant parser (handles trailing commas, unquoted keys, etc.)
    try {
      return JSON5.parse(jsonStr);
    } catch (json5Error) {
      throw new Error(
        `JSON parse failed.\nStandard error: ${stdError.message}\nJSON5 error: ${json5Error.message}\nInput: ${jsonStr.slice(0, 300)}`
      );
    }
  }
}

// --- Usage ---

const llmResponse = `
Here is the product data you requested:

\`\`\`json
{
  "product_id": "SKU-8821-B",
  "name": "Ergonomic Office Chair",
  "price": 299.99,
  "in_stock": true,
  "tags": ["furniture", "ergonomic", "office"]
}
\`\`\`

Let me know if you need the full catalog!
`;

const product = parseLlmJson(llmResponse);
console.log(`Product: ${product.name} — $${product.price}`);
// → Product: Ergonomic Office Chair — $299.99

Afronding

LLM's breken JSON op vijf voorspelbare manieren en elk ervan heeft een voorspelbare fix. Markdown-fences en omringende proza zijn cosmetisch — een paar regexes handelen ze betrouwbaar af. Structurele schade door truncatie of kleine opmaakfouten is waarvoor json_repair gebouwd is. Wanneer de structuur klopt maar de inhoud verkeerd is — foute keys, verkeerde types — is dat een prompt-probleem, en een retry-loop met de error-message teruggevoerd naar het model is je beste tool. En als je Structured Outputs kunt gebruiken, doe het dan — het elimineert het probleem bij de bron in plaats van de symptomen te behandelen. Voor ad-hoc debugging wanneer een specifieke response stout doet, besparen de JSON Fixer en JSON Formatter je tijd. Bouw de parse_llm_json-utility één keer, test hem tegen je ergste historische responses en ga verder — er zijn betere problemen om je debug-uren aan te besteden.

← All JSON articles Browse all categories →