LLM Yanıtlarından JSON'u Güvenilir Biçimde Ayrıştırma

Modelden fatura verisi içeren bir JSON objesi istedin. Prompt netti: "Yalnızca geçerli JSON döndür. Açıklama yok." Geri gelen ise bir markdown code fence, iki cümle yorum, bir JSON objesi — ve dibinde her alanı açıklayan yardımsever bir not oldu. Production'da, sabah 2'de, bir müşterinin veri pipeline'ı durmuş halde. Eğer LLM API'lerinin üzerine bir şey inşa ediyorsan, bu acıyı zaten biliyorsun. LLM'ler JSON serializer'ları değildir. Onlar genellikle geçerli JSON üreten metin üreticileridir — ta ki üretmeyene kadar. Bu makale onu kırdıkları beş yolu ve her birini ele almak için savaş testinden geçmiş pattern'leri anlatıyor.

LLM'lerin JSON'u Kırdığı 5 Yol

Bunlar edge case değil. Her biri production'da başına gelecek, genellikle de onları kontrol etmeyi bıraktığın anda.

Markdown code fence'leri — Model JSON'u ```json\n...\n``` içinde sarar çünkü eğitim verisi JSON'u bu şekilde sunan doc'lar ve README'lerle doludur.
Sonda yorum — Model kapanış süslü parantezinden sonra bir cümle veya paragraf ekler: "Not: total alanı USD cinsindendir."
Kesilme (truncation) — Cevap token limitine çarptığında uzun çıktılar object'in ortasında kesilir, sana yapısal olarak bozuk JSON ve kapanış parantezleri bırakmadan.
Halüsine edilmiş anahtarlar — Model, şemanda olmayan alan isimleri uydurur. invoice_number istedin, invoiceNumber, invoice_no ve ref_id aldın — bazen aynı cevapta.
Yanlış tipler — Sayılar string olarak gelir (49.99 yerine "49.99"), boolean'lar "true" olarak, array'ler virgülle ayrılmış string olarak. Kılık değiştirmiş tip zorlama bug'ları.

Pattern 1: Markdown Code Fence'leri Kaldır

Bu en yaygın bozulmadır ve düzeltmesi en kolay olanıdır. Basit bir regex, dil etiketi json, JSON veya tamamen eksik olsa da fence'i kaldırır. Başka herhangi bir işlemeden önce çalıştır — hiçbir maliyeti yoktur ve geniş bir hata sınıfını önler.

python

import re

def strip_code_fences(text: str) -> str:
    """Remove markdown code fences from LLM output."""
    # Handles ```json, ```JSON, ``` (no lang tag), etc.
    pattern = r'^```(?:json|JSON)?\s*\n?(.*?)\n?```$'
    match = re.search(pattern, text.strip(), re.DOTALL)
    if match:
        return match.group(1).strip()
    return text.strip()

# Example: model returned a fenced block
raw = """
```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1249.99,
  "currency": "USD"
}
```
"""

clean = strip_code_fences(raw)
invoice = json.loads(clean)  # now safe

function stripCodeFences(text) {
  // Handles ```json, ```JSON, bare ``` (no lang), etc.
  const match = text.trim().match(/^```(?:json|JSON)?\s*\n?([\s\S]*?)\n?```$/s);
  return match ? match[1].trim() : text.trim();
}

// raw response contains a triple-backtick fence (shown here as a single-quoted string)
const raw = '```json\n{\n  "invoice_number": "INV-2024-0192",\n  "vendor": "Acme Supplies",\n  "total": 1249.99\n}\n```';

const clean = stripCodeFences(raw);
const invoice = JSON.parse(clean); // safe

Pattern 2: Regex ile JSON Çıkarma

Model JSON objesinin öncesine veya sonrasına metin eklediğinde — "İşte çıkarılan veri:", "Değişiklik gerekirse haber ver." — fence'leri kaldırmak yetmez. En dıştaki {...} bloğunu bulup çıkarman gerekir. Püf noktası, iç içe object'leri doğru ele alan greedy bir match kullanmaktır. Bu yaklaşımın object'leri ({}) ele aldığını unutma; eğer şeman bir array ise, karakter sınıfını buna göre değiştir.

python

import re
import json

def extract_json_object(text: str) -> str | None:
    """
    Extract the first complete JSON object from a string that may
    contain surrounding prose or commentary.
    """
    # Find the first { and last } to grab the outermost object
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if not match:
        # Fall back to array extraction if no object found
        match = re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

# Model returned prose + JSON + footnote
raw_response = """
Based on the document you provided, here is the structured data:

{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99},
    {"description": "Standing desk", "qty": 1, "unit_price": 649.99}
  ],
  "total": 1849.95
}

Note: unit prices are pre-tax. Let me know if you need the tax breakdown.
"""

json_str = extract_json_object(raw_response)
if json_str:
    invoice = json.loads(json_str)
    print(f"Parsed invoice: {invoice['invoice_number']}")
else:
    raise ValueError("No JSON object found in LLM response")

Pattern 3: Yapısal Hatalar için json-repair Kullan

Kesilme ve küçük yapısal hatalar — eksik kapanış parantezi, tırnaksız anahtar, sonda virgül — regex çıkarmanın yetersiz kaldığı yerlerdir. json-repair kütüphanesi tam bu iş için yapılmıştır. Tarayıcıların bozuk HTML'yi tolere etmesine benzer şekilde, bozuk JSON'dan mümkün olduğunca çok geçerli yapıyı kurtarmak için bir dizi sezgisel uygular. pip install json-repair ile kur, sonra bir cevaptan vazgeçmeden önce son savunma hattı olarak parse pipeline'ına ekle.

python

import json
import json_repair  # pip install json-repair

def parse_with_repair(text: str) -> dict | list | None:
    """
    Attempt standard parse first; fall back to json_repair for
    structurally broken responses (truncation, missing braces, etc.).
    """
    # First pass: clean up fences and extract the JSON substring
    cleaned = extract_json_object(strip_code_fences(text))
    if not cleaned:
        return None

    # Second pass: try the fast standard parse
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        pass

    # Third pass: let json_repair reconstruct broken structure
    try:
        repaired = json_repair.repair_json(cleaned, return_objects=True)
        return repaired if repaired else None
    except Exception:
        return None

# Works even on truncated output from a token-limited response
truncated = """
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4
"""

result = parse_with_repair(truncated)
# Returns {"invoice_number": "INV-2024-0192", "vendor": "Acme Supplies",
#          "line_items": [{"description": "Office chairs", "qty": 4}]}

Manuel debug ipucu: Belirli bir bozuk cevabı araştırırken, onu JSON Fixer'a yapıştır ve json-repair'in ona tam olarak ne yaptığını gör — ya da tamir mi yoksa yeniden prompt mu kararı vermeden önce sözdizimi hatasının tam satır ve karakter pozisyonunu belirlemek için JSON Validator'ı kullan.

Pattern 4: Açık Prompting ile Yeniden Dene

Bazen en iyi parser modelin kendisidir. Eğer çıktı json-repair'in düzeltebileceğinden daha çok karışmışsa — halüsine edilmiş anahtarlar, tamamen yanlış yapı, veriden çok düzyazı olan bir cevap — bozuk çıktıyı parse hatasıyla birlikte modele geri gönder ve kendi hatasını düzeltmesini iste. Modeller şaşırtıcı derecede bu konuda iyidir. Retry sayısını düşük tut (en fazla 2–3) ve sonsuz döngüleri önlemek için denemeleri izle.

python

import json
from openai import OpenAI

client = OpenAI()

def call_model(messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

def extract_invoice_data(document_text: str, max_retries: int = 3) -> dict:
    """Extract structured invoice data with automatic retry on parse failure."""
    system_prompt = """Extract invoice data and return ONLY a JSON object with these fields:
{
  "invoice_number": string,
  "vendor": string,
  "issue_date": string (YYYY-MM-DD),
  "due_date": string (YYYY-MM-DD) or null,
  "line_items": [{"description": string, "qty": number, "unit_price": number}],
  "subtotal": number,
  "tax": number,
  "total": number,
  "currency": string (ISO 4217)
}
Return ONLY the JSON object. No markdown. No explanation."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Extract invoice data from:\n\n{document_text}"}
    ]

    for attempt in range(max_retries):
        raw = call_model(messages)

        try:
            cleaned = extract_json_object(strip_code_fences(raw))
            return json.loads(cleaned)
        except (json.JSONDecodeError, TypeError) as e:
            if attempt == max_retries - 1:
                raise ValueError(
                    f"Failed to parse JSON after {max_retries} attempts. "
                    f"Last error: {e}. Last response: {raw[:200]}"
                )

            # Feed the error back — the model often corrects itself
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": (
                    f"That response caused a JSON parse error: {e}\n"
                    f"Please return ONLY a valid JSON object. No markdown fences, "
                    f"no commentary, just the raw JSON."
                )
            })

    raise ValueError("Unexpected exit from retry loop")

Pattern 5: Parse'i Atla — Yerine Structured Outputs Kullan

Model çağrısını kontrol ediyorsan ve daha yeni API'leri kullanmayı göze alabiliyorsan, structured output'lar bu karmaşıklığın çoğunu tamamen ortadan kaldırır. OpenAI Structured Outputs (GPT-4o ve sonrasında mevcut) ve Gemini'nin response schema'sı her ikisi de modelin çıktısını token üretim seviyesinde kısıtlar — modelin bozuk bir JSON objesi döndürmesi matematiksel olarak imkansızdır çünkü geçersiz token'lar decoding sırasında bastırılır. Olumsuz tarafı: modelin yaratıcılığından biraz vazgeçersin ve bu API'ler çağrı başına biraz daha pahalıdır. Yüksek hacimli çıkarma pipeline'ları için genellikle buna değer.

python

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class LineItem(BaseModel):
    description: str
    qty: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    issue_date: str          # YYYY-MM-DD
    total: float
    currency: str            # ISO 4217
    line_items: list[LineItem]

def extract_invoice_structured(document_text: str) -> Invoice:
    """
    Extract invoice using OpenAI Structured Outputs.
    The API guarantees the response matches the Invoice schema —
    no manual parsing or repair needed.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": "Extract invoice data from the provided document."
            },
            {"role": "user", "content": document_text}
        ],
        response_format=Invoice
    )
    return completion.choices[0].message.parsed

invoice = extract_invoice_structured(document_text)
print(f"Invoice {invoice.invoice_number}: ${invoice.total:.2f} {invoice.currency}")

Production-Ready Bir Parser (Python)

Dört defansif pattern'i tek bir utility'de birleştirdiğinde production çıkarma fonksiyonu böyle görünür. Bu, günlük binlerce LLM cevabı işleyen servislerde gerçekten çalıştırdığım sürümdür. Fence'leri kaldırır, JSON substring'ini çıkarır, temiz bir parse dener, json_repair'e geri düşer ve döndürmeden önce isteğe bağlı olarak bir JSON Schema'ya karşı doğrular. Structured output'ları kullanmıyorsan, bu senin temelindir.

python

import re
import json
from typing import Any
import json_repair        # pip install json-repair
import jsonschema         # pip install jsonschema

def strip_code_fences(text: str) -> str:
    match = re.search(r'^```(?:\w+)?\s*\n?(.*?)\n?```$', text.strip(), re.DOTALL)
    return match.group(1).strip() if match else text.strip()

def extract_json_substring(text: str) -> str | None:
    match = re.search(r'\{.*\}', text, re.DOTALL) or re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

def parse_llm_json(text: str, schema: dict | None = None) -> Any:
    """
    Robustly parse JSON from LLM output.

    Steps:
      1. Strip markdown code fences
      2. Extract outermost JSON object/array (handles surrounding prose)
      3. Fast-path: standard json.loads
      4. Slow-path: json_repair for structurally broken responses
      5. Optional: validate against a JSON Schema

    Args:
        text:   Raw text returned by the LLM
        schema: Optional JSON Schema dict to validate the parsed result

    Returns:
        Parsed Python object (dict or list)

    Raises:
        ValueError: If parsing fails after all recovery attempts
        jsonschema.ValidationError: If schema validation fails
    """
    if not text or not text.strip():
        raise ValueError("LLM returned an empty response")

    # Step 1 — strip fences
    text = strip_code_fences(text)

    # Step 2 — extract JSON substring (handles prose before/after)
    json_str = extract_json_substring(text)
    if not json_str:
        raise ValueError(f"No JSON object or array found in response: {text[:200]!r}")

    # Step 3 — standard parse (fast path, no overhead)
    parsed = None
    try:
        parsed = json.loads(json_str)
    except json.JSONDecodeError as original_error:
        # Step 4 — repair and retry
        try:
            repaired = json_repair.repair_json(json_str, return_objects=True)
            if repaired is not None:
                parsed = repaired
        except Exception as repair_error:
            raise ValueError(
                f"JSON parse failed and repair also failed.\n"
                f"Parse error: {original_error}\n"
                f"Repair error: {repair_error}\n"
                f"Input (first 500 chars): {json_str[:500]!r}"
            ) from original_error

    if parsed is None:
        raise ValueError(f"Parsing returned None for input: {json_str[:200]!r}")

    # Step 5 — optional schema validation
    if schema is not None:
        jsonschema.validate(parsed, schema)  # raises ValidationError on mismatch

    return parsed


# --- Usage ---

INVOICE_SCHEMA = {
    "type": "object",
    "required": ["invoice_number", "vendor", "total"],
    "properties": {
        "invoice_number": {"type": "string"},
        "vendor":         {"type": "string"},
        "total":          {"type": "number"},
        "currency":       {"type": "string"},
        "line_items":     {"type": "array"}
    }
}

llm_response = """
Sure! Here's the structured data:

```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1849.95,
  "currency": "USD",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99}
  ]
}
```

Let me know if you need any changes!
"""

invoice = parse_llm_json(llm_response, schema=INVOICE_SCHEMA)
print(f"Vendor: {invoice['vendor']}, Total: ${invoice['total']}")

JavaScript Sürümü

JavaScript'teki aynı mantık. Tamir adımı için json_repair'e en yakın karşılık, neredeyse-geçerli JSON'un toleranslı parse'ı için JSON5'tir ya da kendin hafif bir tamir wrapper'ı yazabilirsin. Client-side iş için iyi bir try/catch ve bir regex fallback ile JSON.parse() production vakalarının büyük çoğunluğunu kapsar.

// npm install json5   (optional — for tolerant parsing of near-valid JSON)
import JSON5 from 'json5';

function stripCodeFences(text) {
  const match = text.trim().match(/^```(?:\w+)?\s*\n?([\s\S]*?)\n?```$/);
  return match ? match[1].trim() : text.trim();
}

function extractJsonSubstring(text) {
  // Greedy match for outermost object or array
  const objectMatch = text.match(/\{[\s\S]*\}/);
  if (objectMatch) return objectMatch[0];
  const arrayMatch = text.match(/\[[\s\S]*\]/);
  return arrayMatch ? arrayMatch[0] : null;
}

/**
 * Robustly parse JSON from LLM output.
 * Steps: strip fences → extract substring → JSON.parse → JSON5 fallback
 *
 * @param {string} text - Raw LLM response text
 * @returns {object|Array} Parsed JavaScript value
 * @throws {Error} If all parse attempts fail
 */
function parseLlmJson(text) {
  if (!text || !text.trim()) {
    throw new Error('LLM returned an empty response');
  }

  // Step 1 — strip markdown fences
  let cleaned = stripCodeFences(text);

  // Step 2 — extract JSON substring (skip surrounding prose)
  const jsonStr = extractJsonSubstring(cleaned);
  if (!jsonStr) {
    throw new Error(`No JSON object or array found in response: ${text.slice(0, 200)}`);
  }

  // Step 3 — standard JSON.parse (fast path)
  try {
    return JSON.parse(jsonStr);
  } catch (stdError) {
    // Step 4 — JSON5 tolerant parser (handles trailing commas, unquoted keys, etc.)
    try {
      return JSON5.parse(jsonStr);
    } catch (json5Error) {
      throw new Error(
        `JSON parse failed.\nStandard error: ${stdError.message}\nJSON5 error: ${json5Error.message}\nInput: ${jsonStr.slice(0, 300)}`
      );
    }
  }
}

// --- Usage ---

const llmResponse = `
Here is the product data you requested:

\`\`\`json
{
  "product_id": "SKU-8821-B",
  "name": "Ergonomic Office Chair",
  "price": 299.99,
  "in_stock": true,
  "tags": ["furniture", "ergonomic", "office"]
}
\`\`\`

Let me know if you need the full catalog!
`;

const product = parseLlmJson(llmResponse);
console.log(`Product: ${product.name} — $${product.price}`);
// → Product: Ergonomic Office Chair — $299.99

Toparlayalım

LLM'ler JSON'u beş öngörülebilir şekilde kırar ve her birinin öngörülebilir bir düzeltmesi vardır. Markdown fence'leri ve çevreleyen düzyazı kozmetiktir — birkaç regex onları güvenilir bir şekilde halleder. Kesilme veya küçük biçimlendirme hatalarından kaynaklanan yapısal hasar, json_repair'in yapılma amacıdır. Yapı doğru ama içerik yanlış olduğunda — kötü anahtarlar, yanlış tipler — bu bir prompting problemidir ve hata mesajı modele geri beslenen bir retry döngüsü en iyi aracındır. Ve Structured Outputs kullanabiliyorsan, kullan — semptomları tedavi etmek yerine sorunu kaynağında ortadan kaldırır. Belirli bir cevabın yaramazlık yaptığı ad-hoc debug için, JSON Fixer ve JSON Formatter sana zaman kazandırır. parse_llm_json utility'sini bir kez inşa et, en kötü geçmiş cevaplarına karşı test et ve devam et — debug saatlerini harcayacak daha iyi problemler var.

← All JSON articles Browse all categories →