LLM 응답에서 JSON을 안정적으로 파싱하는 방법

청구서 데이터가 담긴 JSON 객체를 모델에 요청했습니다. 프롬프트는 분명했죠: "유효한 JSON만 반환하세요. 설명 없이." 돌아온 건 마크다운 코드 펜스, 두 문장의 해설, JSON 객체 — 그리고 친절하게도 하단에 각 필드를 설명하는 메모까지. 프로덕션, 새벽 2시, 고객 데이터 파이프라인이 멈춘 채로. LLM API 위에 뭔가 빌드해 봤다면 이 고통을 이미 알고 있을 겁니다. LLM은 JSON 시리얼라이저가 아닙니다. 그것은 보통 유효한 JSON을 만드는 — 그러다 안 만들 때가 있는 — 텍스트 생성기입니다. 이 글은 LLM이 JSON을 깨뜨리는 다섯 가지 방식과 각각을 처리하는 검증된 패턴을 다룹니다.

LLM이 JSON을 깨뜨리는 5가지 방식

이건 엣지 케이스가 아닙니다. 이 다섯 가지는 전부 프로덕션에서 마주치게 되는데, 보통은 여러분이 그걸 체크하기 그만둔 순간에 일어납니다.

마크다운 코드 펜스 — 훈련 데이터가 JSON을 그런 식으로 보여주는 문서와 README 파일로 가득하기 때문에, 모델이 JSON을 ```json\n...\n```으로 감쌉니다.
뒤따르는 주석 — 모델이 닫는 중괄호 뒤에 문장이나 문단을 덧붙입니다: "참고: total 필드는 USD입니다."
잘림(Truncation) — 긴 출력이 토큰 한도에 걸리면 객체 중간에서 잘려 닫는 중괄호 없는 구조적으로 망가진 JSON이 남습니다.
환각 키(Hallucinated keys) — 모델이 스키마에 없는 필드 이름을 지어냅니다. invoice_number를 요청했는데 invoiceNumber, invoice_no, ref_id가 돌아오죠 — 때로는 같은 응답 안에.
잘못된 타입 — 숫자가 문자열로 도착하고(49.99 대신 "49.99"), 불리언이 "true", 배열이 콤마로 구분된 문자열로. 타입 강제 변환 버그의 변장입니다.

패턴 1: 마크다운 코드 펜스 벗겨내기

가장 흔한 파손이자 고치기 가장 쉬운 것입니다. 간단한 정규식 하나가 언어 태그가 json이든 JSON이든 아예 없든 상관없이 펜스를 벗깁니다. 다른 처리 전에 이걸 돌리세요 — 비용은 0이고 큰 범주의 오류를 막아줍니다.

python

import re

def strip_code_fences(text: str) -> str:
    """Remove markdown code fences from LLM output."""
    # Handles ```json, ```JSON, ``` (no lang tag), etc.
    pattern = r'^```(?:json|JSON)?\s*\n?(.*?)\n?```$'
    match = re.search(pattern, text.strip(), re.DOTALL)
    if match:
        return match.group(1).strip()
    return text.strip()

# Example: model returned a fenced block
raw = """
```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1249.99,
  "currency": "USD"
}
```
"""

clean = strip_code_fences(raw)
invoice = json.loads(clean)  # now safe

function stripCodeFences(text) {
  // Handles ```json, ```JSON, bare ``` (no lang), etc.
  const match = text.trim().match(/^```(?:json|JSON)?\s*\n?([\s\S]*?)\n?```$/s);
  return match ? match[1].trim() : text.trim();
}

// raw response contains a triple-backtick fence (shown here as a single-quoted string)
const raw = '```json\n{\n  "invoice_number": "INV-2024-0192",\n  "vendor": "Acme Supplies",\n  "total": 1249.99\n}\n```';

const clean = stripCodeFences(raw);
const invoice = JSON.parse(clean); // safe

패턴 2: 정규식으로 JSON 추출하기

모델이 JSON 앞이나 뒤에 텍스트를 덧붙일 때 — "추출된 데이터입니다:", "변경 필요하면 알려주세요." — 펜스 벗기기만으로는 부족합니다. 가장 바깥쪽 {...} 블록을 찾아서 꺼내야 합니다. 핵심은 중첩된 객체를 올바르게 처리하는 greedy 매치를 쓰는 것입니다. 이 접근은 객체({})를 처리한다는 점에 유의하세요. 스키마가 배열이라면 문자 클래스를 그에 맞게 바꾸세요.

python

import re
import json

def extract_json_object(text: str) -> str | None:
    """
    Extract the first complete JSON object from a string that may
    contain surrounding prose or commentary.
    """
    # Find the first { and last } to grab the outermost object
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if not match:
        # Fall back to array extraction if no object found
        match = re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

# Model returned prose + JSON + footnote
raw_response = """
Based on the document you provided, here is the structured data:

{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99},
    {"description": "Standing desk", "qty": 1, "unit_price": 649.99}
  ],
  "total": 1849.95
}

Note: unit prices are pre-tax. Let me know if you need the tax breakdown.
"""

json_str = extract_json_object(raw_response)
if json_str:
    invoice = json.loads(json_str)
    print(f"Parsed invoice: {invoice['invoice_number']}")
else:
    raise ValueError("No JSON object found in LLM response")

패턴 3: 구조적 오류에는 json-repair 쓰기

잘림과 사소한 구조 오류 — 빠진 닫는 중괄호, 따옴표 없는 키, 후행 콤마 — 에서는 정규식 추출이 부족합니다. json-repair 라이브러리가 정확히 이를 위해 만들어졌습니다. 브라우저가 잘못된 HTML을 관대하게 처리하는 것처럼, 망가진 JSON에서 가능한 한 많은 유효한 구조를 복구하기 위해 휴리스틱 시리즈를 적용합니다. pip install json-repair로 설치하고, 응답을 포기하기 전 마지막 방어선으로 파싱 파이프라인에 끼워 넣으세요.

python

import json
import json_repair  # pip install json-repair

def parse_with_repair(text: str) -> dict | list | None:
    """
    Attempt standard parse first; fall back to json_repair for
    structurally broken responses (truncation, missing braces, etc.).
    """
    # First pass: clean up fences and extract the JSON substring
    cleaned = extract_json_object(strip_code_fences(text))
    if not cleaned:
        return None

    # Second pass: try the fast standard parse
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        pass

    # Third pass: let json_repair reconstruct broken structure
    try:
        repaired = json_repair.repair_json(cleaned, return_objects=True)
        return repaired if repaired else None
    except Exception:
        return None

# Works even on truncated output from a token-limited response
truncated = """
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4
"""

result = parse_with_repair(truncated)
# Returns {"invoice_number": "INV-2024-0192", "vendor": "Acme Supplies",
#          "line_items": [{"description": "Office chairs", "qty": 4}]}

수동 디버깅 팁: 특정 망가진 응답을 조사할 때, JSON Fixer에 붙여넣어 json-repair가 그것에 정확히 무슨 일을 하는지 보세요 — 또는 JSON Validator로 수리할지 재프롬프트할지 결정하기 전에 구문 오류의 정확한 줄과 문자 위치를 파악하세요.

패턴 4: 명시적 프롬프팅으로 재시도하기

때로는 최고의 파서가 모델 자신입니다. 출력이 json-repair가 고칠 수 있는 수준을 넘어 망가졌다면 — 환각 키, 완전히 틀린 구조, 데이터보다 산문에 가까운 응답 — 망가진 출력을 파싱 오류와 함께 모델에 돌려보내 스스로의 실수를 고치라고 요청하세요. 모델은 이 일에 놀라울 만큼 능합니다. 재시도 횟수는 낮게 유지하고(최대 2–3회), 무한 루프를 피하기 위해 시도를 추적하세요.

python

import json
from openai import OpenAI

client = OpenAI()

def call_model(messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

def extract_invoice_data(document_text: str, max_retries: int = 3) -> dict:
    """Extract structured invoice data with automatic retry on parse failure."""
    system_prompt = """Extract invoice data and return ONLY a JSON object with these fields:
{
  "invoice_number": string,
  "vendor": string,
  "issue_date": string (YYYY-MM-DD),
  "due_date": string (YYYY-MM-DD) or null,
  "line_items": [{"description": string, "qty": number, "unit_price": number}],
  "subtotal": number,
  "tax": number,
  "total": number,
  "currency": string (ISO 4217)
}
Return ONLY the JSON object. No markdown. No explanation."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Extract invoice data from:\n\n{document_text}"}
    ]

    for attempt in range(max_retries):
        raw = call_model(messages)

        try:
            cleaned = extract_json_object(strip_code_fences(raw))
            return json.loads(cleaned)
        except (json.JSONDecodeError, TypeError) as e:
            if attempt == max_retries - 1:
                raise ValueError(
                    f"Failed to parse JSON after {max_retries} attempts. "
                    f"Last error: {e}. Last response: {raw[:200]}"
                )

            # Feed the error back — the model often corrects itself
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": (
                    f"That response caused a JSON parse error: {e}\n"
                    f"Please return ONLY a valid JSON object. No markdown fences, "
                    f"no commentary, just the raw JSON."
                )
            })

    raise ValueError("Unexpected exit from retry loop")

패턴 5: 파싱을 건너뛰기 — 대신 Structured Outputs 사용

모델 호출을 통제할 수 있고 새 API를 쓸 수 있다면, structured outputs가 이 복잡도를 대부분 제거합니다. OpenAI Structured Outputs (GPT-4o 이후에서 사용 가능)와 Gemini의 response schema 둘 다 모델의 출력을 토큰 생성 레벨에서 제약합니다 — 디코딩 중 유효하지 않은 토큰이 억제되므로, 모델이 잘못된 JSON 객체를 반환하는 것은 수학적으로 불가능합니다. 단점: 모델 창의성을 일부 포기하고 이 API들은 호출당 약간 더 비쌉니다. 대용량 추출 파이프라인에서는 보통 그만한 가치가 있습니다.

python

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class LineItem(BaseModel):
    description: str
    qty: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    issue_date: str          # YYYY-MM-DD
    total: float
    currency: str            # ISO 4217
    line_items: list[LineItem]

def extract_invoice_structured(document_text: str) -> Invoice:
    """
    Extract invoice using OpenAI Structured Outputs.
    The API guarantees the response matches the Invoice schema —
    no manual parsing or repair needed.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": "Extract invoice data from the provided document."
            },
            {"role": "user", "content": document_text}
        ],
        response_format=Invoice
    )
    return completion.choices[0].message.parsed

invoice = extract_invoice_structured(document_text)
print(f"Invoice {invoice.invoice_number}: ${invoice.total:.2f} {invoice.currency}")

프로덕션 레디 파서 (Python)

네 가지 방어적 패턴을 하나의 유틸리티로 결합한 프로덕션 추출 함수는 이렇게 생겼습니다. 하루에 수천 건의 LLM 응답을 처리하는 서비스에서 제가 실제로 돌리는 버전입니다. 펜스를 벗기고, JSON 부분 문자열을 추출하고, 깨끗한 파싱을 시도하고, json_repair로 폴백하고, 반환 전에 선택적으로 JSON Schema에 대해 검증합니다. structured outputs를 쓰지 않는다면, 이것이 여러분의 기반입니다.

python

import re
import json
from typing import Any
import json_repair        # pip install json-repair
import jsonschema         # pip install jsonschema

def strip_code_fences(text: str) -> str:
    match = re.search(r'^```(?:\w+)?\s*\n?(.*?)\n?```$', text.strip(), re.DOTALL)
    return match.group(1).strip() if match else text.strip()

def extract_json_substring(text: str) -> str | None:
    match = re.search(r'\{.*\}', text, re.DOTALL) or re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

def parse_llm_json(text: str, schema: dict | None = None) -> Any:
    """
    Robustly parse JSON from LLM output.

    Steps:
      1. Strip markdown code fences
      2. Extract outermost JSON object/array (handles surrounding prose)
      3. Fast-path: standard json.loads
      4. Slow-path: json_repair for structurally broken responses
      5. Optional: validate against a JSON Schema

    Args:
        text:   Raw text returned by the LLM
        schema: Optional JSON Schema dict to validate the parsed result

    Returns:
        Parsed Python object (dict or list)

    Raises:
        ValueError: If parsing fails after all recovery attempts
        jsonschema.ValidationError: If schema validation fails
    """
    if not text or not text.strip():
        raise ValueError("LLM returned an empty response")

    # Step 1 — strip fences
    text = strip_code_fences(text)

    # Step 2 — extract JSON substring (handles prose before/after)
    json_str = extract_json_substring(text)
    if not json_str:
        raise ValueError(f"No JSON object or array found in response: {text[:200]!r}")

    # Step 3 — standard parse (fast path, no overhead)
    parsed = None
    try:
        parsed = json.loads(json_str)
    except json.JSONDecodeError as original_error:
        # Step 4 — repair and retry
        try:
            repaired = json_repair.repair_json(json_str, return_objects=True)
            if repaired is not None:
                parsed = repaired
        except Exception as repair_error:
            raise ValueError(
                f"JSON parse failed and repair also failed.\n"
                f"Parse error: {original_error}\n"
                f"Repair error: {repair_error}\n"
                f"Input (first 500 chars): {json_str[:500]!r}"
            ) from original_error

    if parsed is None:
        raise ValueError(f"Parsing returned None for input: {json_str[:200]!r}")

    # Step 5 — optional schema validation
    if schema is not None:
        jsonschema.validate(parsed, schema)  # raises ValidationError on mismatch

    return parsed


# --- Usage ---

INVOICE_SCHEMA = {
    "type": "object",
    "required": ["invoice_number", "vendor", "total"],
    "properties": {
        "invoice_number": {"type": "string"},
        "vendor":         {"type": "string"},
        "total":          {"type": "number"},
        "currency":       {"type": "string"},
        "line_items":     {"type": "array"}
    }
}

llm_response = """
Sure! Here's the structured data:

```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1849.95,
  "currency": "USD",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99}
  ]
}
```

Let me know if you need any changes!
"""

invoice = parse_llm_json(llm_response, schema=INVOICE_SCHEMA)
print(f"Vendor: {invoice['vendor']}, Total: ${invoice['total']}")

JavaScript 버전

JavaScript로 같은 로직입니다. 수리 단계에서 json_repair에 가장 가까운 대안은 거의-유효한 JSON을 관대하게 파싱하기 위한 JSON5이거나, 가벼운 수리 래퍼를 직접 작성할 수 있습니다. 클라이언트 사이드 작업에는, JSON.parse()에 좋은 try/catch와 정규식 폴백이 대부분의 프로덕션 케이스를 커버합니다.

// npm install json5   (optional — for tolerant parsing of near-valid JSON)
import JSON5 from 'json5';

function stripCodeFences(text) {
  const match = text.trim().match(/^```(?:\w+)?\s*\n?([\s\S]*?)\n?```$/);
  return match ? match[1].trim() : text.trim();
}

function extractJsonSubstring(text) {
  // Greedy match for outermost object or array
  const objectMatch = text.match(/\{[\s\S]*\}/);
  if (objectMatch) return objectMatch[0];
  const arrayMatch = text.match(/\[[\s\S]*\]/);
  return arrayMatch ? arrayMatch[0] : null;
}

/**
 * Robustly parse JSON from LLM output.
 * Steps: strip fences → extract substring → JSON.parse → JSON5 fallback
 *
 * @param {string} text - Raw LLM response text
 * @returns {object|Array} Parsed JavaScript value
 * @throws {Error} If all parse attempts fail
 */
function parseLlmJson(text) {
  if (!text || !text.trim()) {
    throw new Error('LLM returned an empty response');
  }

  // Step 1 — strip markdown fences
  let cleaned = stripCodeFences(text);

  // Step 2 — extract JSON substring (skip surrounding prose)
  const jsonStr = extractJsonSubstring(cleaned);
  if (!jsonStr) {
    throw new Error(`No JSON object or array found in response: ${text.slice(0, 200)}`);
  }

  // Step 3 — standard JSON.parse (fast path)
  try {
    return JSON.parse(jsonStr);
  } catch (stdError) {
    // Step 4 — JSON5 tolerant parser (handles trailing commas, unquoted keys, etc.)
    try {
      return JSON5.parse(jsonStr);
    } catch (json5Error) {
      throw new Error(
        `JSON parse failed.\nStandard error: ${stdError.message}\nJSON5 error: ${json5Error.message}\nInput: ${jsonStr.slice(0, 300)}`
      );
    }
  }
}

// --- Usage ---

const llmResponse = `
Here is the product data you requested:

\`\`\`json
{
  "product_id": "SKU-8821-B",
  "name": "Ergonomic Office Chair",
  "price": 299.99,
  "in_stock": true,
  "tags": ["furniture", "ergonomic", "office"]
}
\`\`\`

Let me know if you need the full catalog!
`;

const product = parseLlmJson(llmResponse);
console.log(`Product: ${product.name} — $${product.price}`);
// → Product: Ergonomic Office Chair — $299.99

마무리

LLM은 다섯 가지 예측 가능한 방식으로 JSON을 깨뜨리고, 각각에는 예측 가능한 해결책이 있습니다. 마크다운 펜스와 주변 산문은 겉모양일 뿐 — 정규식 몇 개가 안정적으로 처리합니다. 잘림이나 사소한 포맷팅 오류로 인한 구조적 손상은 json_repair가 만들어진 이유입니다. 구조는 맞지만 내용이 틀렸을 때 — 잘못된 키, 잘못된 타입 — 그것은 프롬프팅 문제이고, 오류 메시지를 모델에 되돌려 주는 재시도 루프가 최선의 도구입니다. 그리고 Structured Outputs를 쓸 수 있다면, 쓰세요 — 증상을 치료하는 대신 원천에서 문제를 제거합니다. 특정 응답이 이상하게 굴 때의 즉흥 디버깅에는 JSON Fixer와 JSON Formatter가 시간을 아껴줄 겁니다. parse_llm_json 유틸리티를 한 번 만들고, 최악의 과거 응답들에 대해 테스트하고, 넘어가세요 — 디버깅 시간을 써야 할 더 나은 문제들이 있습니다.

← All JSON articles Browse all categories →