Come effettuare il parsing affidabile di JSON dalle risposte LLM

Hai chiesto al modello un oggetto JSON contenente dati di fattura. Il prompt era chiaro: "Restituisci solo JSON valido. Nessuna spiegazione." Quello che è tornato era un code fence markdown, due frasi di commento, un oggetto JSON — e poi una nota gentile in fondo che spiegava ogni campo. In produzione, alle 2 di notte, con la pipeline dati di un cliente bloccata. Se stai costruendo qualcosa sopra le API degli LLM, già conosci questo dolore. Gli LLM non sono serializzatori JSON. Sono generatori di testo che di solito producono JSON valido — finché non lo fanno. Questo articolo copre i cinque modi in cui lo rompono e i pattern battle-tested per gestirne ciascuno.

I 5 modi in cui gli LLM rompono il JSON

Non sono edge case. Tutti questi ti succederanno in produzione, di solito nel momento in cui smetti di controllarli.

Code fence markdown — Il modello avvolge il JSON in ```json\n...\n``` perché i suoi dati di training sono pieni di doc e file README che presentano JSON così.
Commenti finali — Il modello appende una frase o un paragrafo dopo la parentesi graffa di chiusura: "Nota: il campo total è in USD."
Troncamento — Gli output lunghi vengono tagliati a metà oggetto quando la risposta raggiunge il limite di token, lasciandoti con JSON strutturalmente rotto e senza parentesi graffe di chiusura.
Chiavi allucinate — Il modello inventa nomi di campo non presenti nel tuo schema. Hai chiesto invoice_number, hai ottenuto invoiceNumber, invoice_no e ref_id — a volte nella stessa risposta.
Tipi sbagliati — I numeri arrivano come stringhe ("49.99" invece di 49.99), i boolean come "true", gli array come stringhe separate da virgole. Bug di coercion camuffati.

Pattern 1: Togliere i code fence markdown

Questa è la rottura più comune e la più facile da sistemare. Una semplice regex rimuove il fence indipendentemente dal fatto che il tag di linguaggio sia json, JSON o mancante del tutto. Eseguila prima di qualsiasi altra elaborazione — non costa nulla e previene un'ampia classe di errori.

python

import re

def strip_code_fences(text: str) -> str:
    """Remove markdown code fences from LLM output."""
    # Handles ```json, ```JSON, ``` (no lang tag), etc.
    pattern = r'^```(?:json|JSON)?\s*\n?(.*?)\n?```$'
    match = re.search(pattern, text.strip(), re.DOTALL)
    if match:
        return match.group(1).strip()
    return text.strip()

# Example: model returned a fenced block
raw = """
```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1249.99,
  "currency": "USD"
}
```
"""

clean = strip_code_fences(raw)
invoice = json.loads(clean)  # now safe

function stripCodeFences(text) {
  // Handles ```json, ```JSON, bare ``` (no lang), etc.
  const match = text.trim().match(/^```(?:json|JSON)?\s*\n?([\s\S]*?)\n?```$/s);
  return match ? match[1].trim() : text.trim();
}

// raw response contains a triple-backtick fence (shown here as a single-quoted string)
const raw = '```json\n{\n  "invoice_number": "INV-2024-0192",\n  "vendor": "Acme Supplies",\n  "total": 1249.99\n}\n```';

const clean = stripCodeFences(raw);
const invoice = JSON.parse(clean); // safe

Pattern 2: Estrarre JSON con regex

Quando il modello aggiunge testo prima o dopo l'oggetto JSON — "Ecco i dati estratti:", "Fammi sapere se servono modifiche." — togliere i fence non basta. Devi trovare il blocco più esterno {...} ed estrarlo. Il trucco è usare un match greedy che gestisca correttamente gli oggetti annidati. Nota che questo approccio gestisce oggetti ({}); se il tuo schema è un array, sostituisci la classe di caratteri di conseguenza.

python

import re
import json

def extract_json_object(text: str) -> str | None:
    """
    Extract the first complete JSON object from a string that may
    contain surrounding prose or commentary.
    """
    # Find the first { and last } to grab the outermost object
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if not match:
        # Fall back to array extraction if no object found
        match = re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

# Model returned prose + JSON + footnote
raw_response = """
Based on the document you provided, here is the structured data:

{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99},
    {"description": "Standing desk", "qty": 1, "unit_price": 649.99}
  ],
  "total": 1849.95
}

Note: unit prices are pre-tax. Let me know if you need the tax breakdown.
"""

json_str = extract_json_object(raw_response)
if json_str:
    invoice = json.loads(json_str)
    print(f"Parsed invoice: {invoice['invoice_number']}")
else:
    raise ValueError("No JSON object found in LLM response")

Pattern 3: Usare json-repair per errori strutturali

Il troncamento e piccoli errori strutturali — una parentesi graffa di chiusura mancante, una chiave senza virgolette, una virgola finale — è dove l'estrazione via regex fallisce. La libreria json-repair è stata costruita esattamente per questo. Applica una serie di euristiche per recuperare quanta più struttura valida possibile da JSON rotto, in modo simile a come i browser tollerano HTML malformato. Installala con pip install json-repair e mettila nella tua pipeline di parsing come ultima linea di difesa prima di rinunciare a una risposta.

python

import json
import json_repair  # pip install json-repair

def parse_with_repair(text: str) -> dict | list | None:
    """
    Attempt standard parse first; fall back to json_repair for
    structurally broken responses (truncation, missing braces, etc.).
    """
    # First pass: clean up fences and extract the JSON substring
    cleaned = extract_json_object(strip_code_fences(text))
    if not cleaned:
        return None

    # Second pass: try the fast standard parse
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        pass

    # Third pass: let json_repair reconstruct broken structure
    try:
        repaired = json_repair.repair_json(cleaned, return_objects=True)
        return repaired if repaired else None
    except Exception:
        return None

# Works even on truncated output from a token-limited response
truncated = """
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4
"""

result = parse_with_repair(truncated)
# Returns {"invoice_number": "INV-2024-0192", "vendor": "Acme Supplies",
#          "line_items": [{"description": "Office chairs", "qty": 4}]}

Dritta per il debug manuale: Quando stai investigando una specifica risposta rotta, incollala nel JSON Fixer per vedere esattamente cosa ci fa json-repair — o usa il JSON Validator per identificare la riga e la posizione esatte dell'errore di sintassi prima di decidere se riparare o rilanciare il prompt.

Pattern 4: Riprovare con prompting esplicito

A volte il miglior parser è il modello stesso. Se l'output è incasinato oltre quello che json-repair può sistemare — chiavi allucinate, struttura completamente sbagliata, una risposta più prosa che dati — rimanda l'output rotto al modello con l'errore di parse e chiedigli di sistemare il suo errore. I modelli sono sorprendentemente bravi in questo. Tieni basso il numero di retry (massimo 2–3) e traccia i tentativi per evitare loop infiniti.

python

import json
from openai import OpenAI

client = OpenAI()

def call_model(messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

def extract_invoice_data(document_text: str, max_retries: int = 3) -> dict:
    """Extract structured invoice data with automatic retry on parse failure."""
    system_prompt = """Extract invoice data and return ONLY a JSON object with these fields:
{
  "invoice_number": string,
  "vendor": string,
  "issue_date": string (YYYY-MM-DD),
  "due_date": string (YYYY-MM-DD) or null,
  "line_items": [{"description": string, "qty": number, "unit_price": number}],
  "subtotal": number,
  "tax": number,
  "total": number,
  "currency": string (ISO 4217)
}
Return ONLY the JSON object. No markdown. No explanation."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Extract invoice data from:\n\n{document_text}"}
    ]

    for attempt in range(max_retries):
        raw = call_model(messages)

        try:
            cleaned = extract_json_object(strip_code_fences(raw))
            return json.loads(cleaned)
        except (json.JSONDecodeError, TypeError) as e:
            if attempt == max_retries - 1:
                raise ValueError(
                    f"Failed to parse JSON after {max_retries} attempts. "
                    f"Last error: {e}. Last response: {raw[:200]}"
                )

            # Feed the error back — the model often corrects itself
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": (
                    f"That response caused a JSON parse error: {e}\n"
                    f"Please return ONLY a valid JSON object. No markdown fences, "
                    f"no commentary, just the raw JSON."
                )
            })

    raise ValueError("Unexpected exit from retry loop")

Pattern 5: Salta il parsing — usa invece Structured Outputs

Se controlli la chiamata al modello e puoi permetterti di usare API più recenti, gli structured output eliminano gran parte di questa complessità. OpenAI Structured Outputs (disponibile su GPT-4o e successivi) e lo schema di risposta di Gemini vincolano entrambi l'output del modello a livello di generazione dei token — è matematicamente impossibile per il modello restituire un oggetto JSON malformato perché i token non validi sono soppressi durante il decoding. Il rovescio della medaglia: rinunci a un po' di creatività del modello e queste API costano leggermente di più per chiamata. Per pipeline di estrazione ad alto volume, di solito valgono la candela.

python

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class LineItem(BaseModel):
    description: str
    qty: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    issue_date: str          # YYYY-MM-DD
    total: float
    currency: str            # ISO 4217
    line_items: list[LineItem]

def extract_invoice_structured(document_text: str) -> Invoice:
    """
    Extract invoice using OpenAI Structured Outputs.
    The API guarantees the response matches the Invoice schema —
    no manual parsing or repair needed.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": "Extract invoice data from the provided document."
            },
            {"role": "user", "content": document_text}
        ],
        response_format=Invoice
    )
    return completion.choices[0].message.parsed

invoice = extract_invoice_structured(document_text)
print(f"Invoice {invoice.invoice_number}: ${invoice.total:.2f} {invoice.currency}")

Un parser production-ready (Python)

Ecco come appare una funzione di estrazione in produzione quando combini tutti e quattro i pattern difensivi in un'unica utility. Questa è la versione che faccio girare davvero nei servizi che elaborano migliaia di risposte LLM al giorno. Toglie i fence, estrae la substring JSON, tenta un parse pulito, ricade su json_repair e opzionalmente valida contro uno JSON Schema prima di restituire. Se non stai usando structured outputs, questa è la tua fondazione.

python

import re
import json
from typing import Any
import json_repair        # pip install json-repair
import jsonschema         # pip install jsonschema

def strip_code_fences(text: str) -> str:
    match = re.search(r'^```(?:\w+)?\s*\n?(.*?)\n?```$', text.strip(), re.DOTALL)
    return match.group(1).strip() if match else text.strip()

def extract_json_substring(text: str) -> str | None:
    match = re.search(r'\{.*\}', text, re.DOTALL) or re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

def parse_llm_json(text: str, schema: dict | None = None) -> Any:
    """
    Robustly parse JSON from LLM output.

    Steps:
      1. Strip markdown code fences
      2. Extract outermost JSON object/array (handles surrounding prose)
      3. Fast-path: standard json.loads
      4. Slow-path: json_repair for structurally broken responses
      5. Optional: validate against a JSON Schema

    Args:
        text:   Raw text returned by the LLM
        schema: Optional JSON Schema dict to validate the parsed result

    Returns:
        Parsed Python object (dict or list)

    Raises:
        ValueError: If parsing fails after all recovery attempts
        jsonschema.ValidationError: If schema validation fails
    """
    if not text or not text.strip():
        raise ValueError("LLM returned an empty response")

    # Step 1 — strip fences
    text = strip_code_fences(text)

    # Step 2 — extract JSON substring (handles prose before/after)
    json_str = extract_json_substring(text)
    if not json_str:
        raise ValueError(f"No JSON object or array found in response: {text[:200]!r}")

    # Step 3 — standard parse (fast path, no overhead)
    parsed = None
    try:
        parsed = json.loads(json_str)
    except json.JSONDecodeError as original_error:
        # Step 4 — repair and retry
        try:
            repaired = json_repair.repair_json(json_str, return_objects=True)
            if repaired is not None:
                parsed = repaired
        except Exception as repair_error:
            raise ValueError(
                f"JSON parse failed and repair also failed.\n"
                f"Parse error: {original_error}\n"
                f"Repair error: {repair_error}\n"
                f"Input (first 500 chars): {json_str[:500]!r}"
            ) from original_error

    if parsed is None:
        raise ValueError(f"Parsing returned None for input: {json_str[:200]!r}")

    # Step 5 — optional schema validation
    if schema is not None:
        jsonschema.validate(parsed, schema)  # raises ValidationError on mismatch

    return parsed


# --- Usage ---

INVOICE_SCHEMA = {
    "type": "object",
    "required": ["invoice_number", "vendor", "total"],
    "properties": {
        "invoice_number": {"type": "string"},
        "vendor":         {"type": "string"},
        "total":          {"type": "number"},
        "currency":       {"type": "string"},
        "line_items":     {"type": "array"}
    }
}

llm_response = """
Sure! Here's the structured data:

```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1849.95,
  "currency": "USD",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99}
  ]
}
```

Let me know if you need any changes!
"""

invoice = parse_llm_json(llm_response, schema=INVOICE_SCHEMA)
print(f"Vendor: {invoice['vendor']}, Total: ${invoice['total']}")

Versione JavaScript

La stessa logica in JavaScript. Per lo step di repair, l'equivalente più vicino a json_repair è JSON5 per parsing tollerante di JSON quasi-valido, oppure puoi scrivere tu un wrapper di repair leggero. Per il lavoro lato client, JSON.parse() con un buon try/catch e un fallback regex copre la stragrande maggioranza dei casi in produzione.

// npm install json5   (optional — for tolerant parsing of near-valid JSON)
import JSON5 from 'json5';

function stripCodeFences(text) {
  const match = text.trim().match(/^```(?:\w+)?\s*\n?([\s\S]*?)\n?```$/);
  return match ? match[1].trim() : text.trim();
}

function extractJsonSubstring(text) {
  // Greedy match for outermost object or array
  const objectMatch = text.match(/\{[\s\S]*\}/);
  if (objectMatch) return objectMatch[0];
  const arrayMatch = text.match(/\[[\s\S]*\]/);
  return arrayMatch ? arrayMatch[0] : null;
}

/**
 * Robustly parse JSON from LLM output.
 * Steps: strip fences → extract substring → JSON.parse → JSON5 fallback
 *
 * @param {string} text - Raw LLM response text
 * @returns {object|Array} Parsed JavaScript value
 * @throws {Error} If all parse attempts fail
 */
function parseLlmJson(text) {
  if (!text || !text.trim()) {
    throw new Error('LLM returned an empty response');
  }

  // Step 1 — strip markdown fences
  let cleaned = stripCodeFences(text);

  // Step 2 — extract JSON substring (skip surrounding prose)
  const jsonStr = extractJsonSubstring(cleaned);
  if (!jsonStr) {
    throw new Error(`No JSON object or array found in response: ${text.slice(0, 200)}`);
  }

  // Step 3 — standard JSON.parse (fast path)
  try {
    return JSON.parse(jsonStr);
  } catch (stdError) {
    // Step 4 — JSON5 tolerant parser (handles trailing commas, unquoted keys, etc.)
    try {
      return JSON5.parse(jsonStr);
    } catch (json5Error) {
      throw new Error(
        `JSON parse failed.\nStandard error: ${stdError.message}\nJSON5 error: ${json5Error.message}\nInput: ${jsonStr.slice(0, 300)}`
      );
    }
  }
}

// --- Usage ---

const llmResponse = `
Here is the product data you requested:

\`\`\`json
{
  "product_id": "SKU-8821-B",
  "name": "Ergonomic Office Chair",
  "price": 299.99,
  "in_stock": true,
  "tags": ["furniture", "ergonomic", "office"]
}
\`\`\`

Let me know if you need the full catalog!
`;

const product = parseLlmJson(llmResponse);
console.log(`Product: ${product.name} — $${product.price}`);
// → Product: Ergonomic Office Chair — $299.99

Tiriamo le somme

Gli LLM rompono il JSON in cinque modi prevedibili e ognuno ha un fix prevedibile. I fence markdown e la prosa circostante sono cosmetici — un paio di regex li gestiscono in modo affidabile. I danni strutturali da troncamento o piccoli errori di formattazione sono ciò per cui json_repair è stato costruito. Quando la struttura è corretta ma il contenuto è sbagliato — chiavi errate, tipi sbagliati — è un problema di prompting, e un retry loop con il messaggio d'errore rimandato al modello è lo strumento migliore. E se puoi usare Structured Outputs, fallo — elimina il problema alla radice invece di curare i sintomi. Per il debug ad-hoc quando una risposta specifica si comporta male, il JSON Fixer e il JSON Formatter ti faranno risparmiare tempo. Costruisci l'utility parse_llm_json una volta, testala contro le tue peggiori risposte storiche e vai avanti — ci sono problemi migliori su cui spendere le tue ore di debug.

← All JSON articles Browse all categories →