Cómo parsear JSON de respuestas LLM de forma fiable

Le pediste al modelo un objeto JSON con datos de factura. El prompt era claro: "Devuelve solo JSON válido. Sin explicación." Lo que llegó fue un bloque de código markdown, dos frases de comentario, un objeto JSON — y luego una nota útil al final explicando cada campo. En producción, a las 2 de la madrugada, con el pipeline de datos de un cliente detenido. Si estás construyendo cualquier cosa sobre APIs de LLM, ya conoces este dolor. Los LLM no son serializadores JSON. Son generadores de texto que normalmente producen JSON válido — hasta que no lo hacen. Este artículo cubre las cinco formas en que lo rompen y los patrones probados en batalla para manejar cada una.

Las 5 formas en que los LLM rompen el JSON

Esto no son casos extremos. Cada uno de estos te pasará en producción, normalmente en el momento en que dejes de vigilarlos.

Bloques de código markdown — El modelo envuelve el JSON en ```json\n...\n``` porque sus datos de entrenamiento están llenos de docs y READMEs que presentan JSON de esa forma.
Comentario al final — El modelo añade una frase o párrafo después de la llave de cierre: "Nota: el campo total está en USD."
Truncamiento — Las salidas largas se cortan a mitad de objeto cuando la respuesta llega al límite de tokens, dejándote con JSON estructuralmente roto y sin llaves de cierre.
Claves alucinadas — El modelo inventa nombres de campos que no están en tu esquema. Pediste invoice_number, obtuviste invoiceNumber, invoice_no y ref_id — a veces en la misma respuesta.
Tipos incorrectos — Los números llegan como cadenas ("49.99" en vez de 49.99), los booleanos como "true", los arrays como cadenas separadas por comas. Bugs de coerción de tipos disfrazados.

Patrón 1: Eliminar bloques de código markdown

Esta es la rotura más común y la más fácil de arreglar. Una regex simple elimina el bloque independientemente de si la etiqueta de lenguaje es json, JSON, o está ausente por completo. Ejecuta esto antes de cualquier otro procesamiento — no cuesta nada y previene una gran clase de errores.

python

import re

def strip_code_fences(text: str) -> str:
    """Remove markdown code fences from LLM output."""
    # Handles ```json, ```JSON, ``` (no lang tag), etc.
    pattern = r'^```(?:json|JSON)?\s*\n?(.*?)\n?```$'
    match = re.search(pattern, text.strip(), re.DOTALL)
    if match:
        return match.group(1).strip()
    return text.strip()

# Example: model returned a fenced block
raw = """
```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1249.99,
  "currency": "USD"
}
```
"""

clean = strip_code_fences(raw)
invoice = json.loads(clean)  # now safe

function stripCodeFences(text) {
  // Handles ```json, ```JSON, bare ``` (no lang), etc.
  const match = text.trim().match(/^```(?:json|JSON)?\s*\n?([\s\S]*?)\n?```$/s);
  return match ? match[1].trim() : text.trim();
}

// raw response contains a triple-backtick fence (shown here as a single-quoted string)
const raw = '```json\n{\n  "invoice_number": "INV-2024-0192",\n  "vendor": "Acme Supplies",\n  "total": 1249.99\n}\n```';

const clean = stripCodeFences(raw);
const invoice = JSON.parse(clean); // safe

Patrón 2: Extraer JSON con regex

Cuando el modelo añade texto antes o después del objeto JSON — "Aquí están los datos extraídos:", "Avísame si necesitas cambios." — eliminar los bloques no basta. Necesitas encontrar el bloque {...} más externo y sacarlo. El truco es usar un match codicioso que maneje objetos anidados correctamente. Nota que este enfoque maneja objetos ({}); si tu esquema es un array, cambia la clase de caracteres en consecuencia.

python

import re
import json

def extract_json_object(text: str) -> str | None:
    """
    Extract the first complete JSON object from a string that may
    contain surrounding prose or commentary.
    """
    # Find the first { and last } to grab the outermost object
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if not match:
        # Fall back to array extraction if no object found
        match = re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

# Model returned prose + JSON + footnote
raw_response = """
Based on the document you provided, here is the structured data:

{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99},
    {"description": "Standing desk", "qty": 1, "unit_price": 649.99}
  ],
  "total": 1849.95
}

Note: unit prices are pre-tax. Let me know if you need the tax breakdown.
"""

json_str = extract_json_object(raw_response)
if json_str:
    invoice = json.loads(json_str)
    print(f"Parsed invoice: {invoice['invoice_number']}")
else:
    raise ValueError("No JSON object found in LLM response")

Patrón 3: Usar json-repair para errores estructurales

El truncamiento y los errores estructurales menores — una llave de cierre faltante, una clave sin comillas, una coma al final — es donde la extracción por regex se queda corta. La biblioteca json-repair se construyó exactamente para esto. Aplica una serie de heurísticas para recuperar toda la estructura válida posible de JSON roto, de forma similar a cómo los navegadores toleran HTML malformado. Instálala con pip install json-repair, luego métela en tu pipeline de parseo como última línea de defensa antes de rendirte con una respuesta.

python

import json
import json_repair  # pip install json-repair

def parse_with_repair(text: str) -> dict | list | None:
    """
    Attempt standard parse first; fall back to json_repair for
    structurally broken responses (truncation, missing braces, etc.).
    """
    # First pass: clean up fences and extract the JSON substring
    cleaned = extract_json_object(strip_code_fences(text))
    if not cleaned:
        return None

    # Second pass: try the fast standard parse
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        pass

    # Third pass: let json_repair reconstruct broken structure
    try:
        repaired = json_repair.repair_json(cleaned, return_objects=True)
        return repaired if repaired else None
    except Exception:
        return None

# Works even on truncated output from a token-limited response
truncated = """
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4
"""

result = parse_with_repair(truncated)
# Returns {"invoice_number": "INV-2024-0192", "vendor": "Acme Supplies",
#          "line_items": [{"description": "Office chairs", "qty": 4}]}

Consejo para depuración manual: Cuando estés investigando una respuesta rota específica, pégala en el JSON Fixer para ver exactamente qué hace json-repair con ella — o usa el Validador JSON para identificar la línea exacta y posición del carácter del error de sintaxis antes de decidir si reparar o re-prompting.

Patrón 4: Reintentar con prompting explícito

A veces el mejor parser es el propio modelo. Si la salida está destrozada más allá de lo que json-repair puede arreglar — claves alucinadas, estructura completamente incorrecta, una respuesta que es más prosa que datos — envía la salida rota de vuelta al modelo con el error de parseo y pídele que corrija su propio error. Los modelos son sorprendentemente buenos en esto. Mantén el número de reintentos bajo (2–3 máx) y registra los intentos para evitar bucles infinitos.

python

import json
from openai import OpenAI

client = OpenAI()

def call_model(messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

def extract_invoice_data(document_text: str, max_retries: int = 3) -> dict:
    """Extract structured invoice data with automatic retry on parse failure."""
    system_prompt = """Extract invoice data and return ONLY a JSON object with these fields:
{
  "invoice_number": string,
  "vendor": string,
  "issue_date": string (YYYY-MM-DD),
  "due_date": string (YYYY-MM-DD) or null,
  "line_items": [{"description": string, "qty": number, "unit_price": number}],
  "subtotal": number,
  "tax": number,
  "total": number,
  "currency": string (ISO 4217)
}
Return ONLY the JSON object. No markdown. No explanation."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Extract invoice data from:\n\n{document_text}"}
    ]

    for attempt in range(max_retries):
        raw = call_model(messages)

        try:
            cleaned = extract_json_object(strip_code_fences(raw))
            return json.loads(cleaned)
        except (json.JSONDecodeError, TypeError) as e:
            if attempt == max_retries - 1:
                raise ValueError(
                    f"Failed to parse JSON after {max_retries} attempts. "
                    f"Last error: {e}. Last response: {raw[:200]}"
                )

            # Feed the error back — the model often corrects itself
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": (
                    f"That response caused a JSON parse error: {e}\n"
                    f"Please return ONLY a valid JSON object. No markdown fences, "
                    f"no commentary, just the raw JSON."
                )
            })

    raise ValueError("Unexpected exit from retry loop")

Patrón 5: Saltar el parseo — Usar Structured Outputs en su lugar

Si controlas la llamada al modelo y puedes permitirte usar APIs más nuevas, las structured outputs eliminan la mayor parte de esta complejidad por completo. OpenAI Structured Outputs (disponible en GPT-4o y posteriores) y el response schema de Gemini restringen ambas la salida del modelo a nivel de generación de tokens — es matemáticamente imposible que el modelo devuelva un objeto JSON malformado porque los tokens inválidos se suprimen durante la decodificación. La desventaja: cedes algo de creatividad del modelo y estas APIs cuestan un poco más por llamada. Para pipelines de extracción de alto volumen, normalmente valen la pena.

python

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class LineItem(BaseModel):
    description: str
    qty: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    issue_date: str          # YYYY-MM-DD
    total: float
    currency: str            # ISO 4217
    line_items: list[LineItem]

def extract_invoice_structured(document_text: str) -> Invoice:
    """
    Extract invoice using OpenAI Structured Outputs.
    The API guarantees the response matches the Invoice schema —
    no manual parsing or repair needed.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": "Extract invoice data from the provided document."
            },
            {"role": "user", "content": document_text}
        ],
        response_format=Invoice
    )
    return completion.choices[0].message.parsed

invoice = extract_invoice_structured(document_text)
print(f"Invoice {invoice.invoice_number}: ${invoice.total:.2f} {invoice.currency}")

Un parser listo para producción (Python)

Así es como se ve una función de extracción lista para producción cuando combinas los cuatro patrones defensivos en una sola utilidad. Esta es la versión que realmente ejecuto en servicios que procesan miles de respuestas de LLM por día. Elimina bloques, extrae la subcadena JSON, intenta un parseo limpio, cae en json_repair, y opcionalmente valida contra un JSON Schema antes de devolver. Si no usas structured outputs, esta es tu base.

python

import re
import json
from typing import Any
import json_repair        # pip install json-repair
import jsonschema         # pip install jsonschema

def strip_code_fences(text: str) -> str:
    match = re.search(r'^```(?:\w+)?\s*\n?(.*?)\n?```$', text.strip(), re.DOTALL)
    return match.group(1).strip() if match else text.strip()

def extract_json_substring(text: str) -> str | None:
    match = re.search(r'\{.*\}', text, re.DOTALL) or re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

def parse_llm_json(text: str, schema: dict | None = None) -> Any:
    """
    Robustly parse JSON from LLM output.

    Steps:
      1. Strip markdown code fences
      2. Extract outermost JSON object/array (handles surrounding prose)
      3. Fast-path: standard json.loads
      4. Slow-path: json_repair for structurally broken responses
      5. Optional: validate against a JSON Schema

    Args:
        text:   Raw text returned by the LLM
        schema: Optional JSON Schema dict to validate the parsed result

    Returns:
        Parsed Python object (dict or list)

    Raises:
        ValueError: If parsing fails after all recovery attempts
        jsonschema.ValidationError: If schema validation fails
    """
    if not text or not text.strip():
        raise ValueError("LLM returned an empty response")

    # Step 1 — strip fences
    text = strip_code_fences(text)

    # Step 2 — extract JSON substring (handles prose before/after)
    json_str = extract_json_substring(text)
    if not json_str:
        raise ValueError(f"No JSON object or array found in response: {text[:200]!r}")

    # Step 3 — standard parse (fast path, no overhead)
    parsed = None
    try:
        parsed = json.loads(json_str)
    except json.JSONDecodeError as original_error:
        # Step 4 — repair and retry
        try:
            repaired = json_repair.repair_json(json_str, return_objects=True)
            if repaired is not None:
                parsed = repaired
        except Exception as repair_error:
            raise ValueError(
                f"JSON parse failed and repair also failed.\n"
                f"Parse error: {original_error}\n"
                f"Repair error: {repair_error}\n"
                f"Input (first 500 chars): {json_str[:500]!r}"
            ) from original_error

    if parsed is None:
        raise ValueError(f"Parsing returned None for input: {json_str[:200]!r}")

    # Step 5 — optional schema validation
    if schema is not None:
        jsonschema.validate(parsed, schema)  # raises ValidationError on mismatch

    return parsed


# --- Usage ---

INVOICE_SCHEMA = {
    "type": "object",
    "required": ["invoice_number", "vendor", "total"],
    "properties": {
        "invoice_number": {"type": "string"},
        "vendor":         {"type": "string"},
        "total":          {"type": "number"},
        "currency":       {"type": "string"},
        "line_items":     {"type": "array"}
    }
}

llm_response = """
Sure! Here's the structured data:

```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1849.95,
  "currency": "USD",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99}
  ]
}
```

Let me know if you need any changes!
"""

invoice = parse_llm_json(llm_response, schema=INVOICE_SCHEMA)
print(f"Vendor: {invoice['vendor']}, Total: ${invoice['total']}")

Versión JavaScript

La misma lógica en JavaScript. Para el paso de reparación, el equivalente más cercano a json_repair es JSON5 para parseo tolerante de JSON casi válido, o puedes escribir tú mismo un wrapper de reparación ligero. Para trabajo del lado del cliente, JSON.parse() con un buen try/catch y un fallback regex cubre la gran mayoría de casos en producción.

// npm install json5   (optional — for tolerant parsing of near-valid JSON)
import JSON5 from 'json5';

function stripCodeFences(text) {
  const match = text.trim().match(/^```(?:\w+)?\s*\n?([\s\S]*?)\n?```$/);
  return match ? match[1].trim() : text.trim();
}

function extractJsonSubstring(text) {
  // Greedy match for outermost object or array
  const objectMatch = text.match(/\{[\s\S]*\}/);
  if (objectMatch) return objectMatch[0];
  const arrayMatch = text.match(/\[[\s\S]*\]/);
  return arrayMatch ? arrayMatch[0] : null;
}

/**
 * Robustly parse JSON from LLM output.
 * Steps: strip fences → extract substring → JSON.parse → JSON5 fallback
 *
 * @param {string} text - Raw LLM response text
 * @returns {object|Array} Parsed JavaScript value
 * @throws {Error} If all parse attempts fail
 */
function parseLlmJson(text) {
  if (!text || !text.trim()) {
    throw new Error('LLM returned an empty response');
  }

  // Step 1 — strip markdown fences
  let cleaned = stripCodeFences(text);

  // Step 2 — extract JSON substring (skip surrounding prose)
  const jsonStr = extractJsonSubstring(cleaned);
  if (!jsonStr) {
    throw new Error(`No JSON object or array found in response: ${text.slice(0, 200)}`);
  }

  // Step 3 — standard JSON.parse (fast path)
  try {
    return JSON.parse(jsonStr);
  } catch (stdError) {
    // Step 4 — JSON5 tolerant parser (handles trailing commas, unquoted keys, etc.)
    try {
      return JSON5.parse(jsonStr);
    } catch (json5Error) {
      throw new Error(
        `JSON parse failed.\nStandard error: ${stdError.message}\nJSON5 error: ${json5Error.message}\nInput: ${jsonStr.slice(0, 300)}`
      );
    }
  }
}

// --- Usage ---

const llmResponse = `
Here is the product data you requested:

\`\`\`json
{
  "product_id": "SKU-8821-B",
  "name": "Ergonomic Office Chair",
  "price": 299.99,
  "in_stock": true,
  "tags": ["furniture", "ergonomic", "office"]
}
\`\`\`

Let me know if you need the full catalog!
`;

const product = parseLlmJson(llmResponse);
console.log(`Product: ${product.name} — $${product.price}`);
// → Product: Ergonomic Office Chair — $299.99

Para cerrar

Los LLM rompen el JSON de cinco formas predecibles, y cada una tiene un arreglo predecible. Los bloques markdown y la prosa alrededor son cosméticos — un par de regex los manejan de forma fiable. El daño estructural por truncamiento o errores menores de formato es para lo que se construyó json_repair. Cuando la estructura es correcta pero el contenido es erróneo — claves malas, tipos incorrectos — eso es un problema de prompting, y un bucle de reintento con el mensaje de error devuelto al modelo es tu mejor herramienta. Y si puedes usar Structured Outputs, hazlo — elimina el problema en el origen en vez de tratar los síntomas. Para depuración ad-hoc cuando una respuesta específica se comporta mal, el JSON Fixer y el JSON Formatter te ahorrarán tiempo. Construye la utilidad parse_llm_json una vez, pruébala contra tus peores respuestas históricas, y sigue adelante — hay mejores problemas en los que gastar tus horas de depuración.

← All JSON articles Browse all categories →