LLMのレスポンスからJSONを確実にパースする方法

請求書データを含むJSONオブジェクトをモデルに依頼しました。プロンプトは明確でした： 「有効なJSONのみを返してください。説明は不要です。」 返ってきたのは、マークダウンのコードフェンス、 2文のコメント、JSONオブジェクト — そして下部には各フィールドを説明する親切な注釈。本番環境、午前2時、顧客のデータパイプラインが停止している状態でした。 LLM API の上に何かを構築しているなら、この痛みはすでに知っているでしょう。LLMはJSONシリアライザーではありません。 通常は有効なJSONを生成するテキストジェネレーターです — それが生成しなくなるまでは。この記事では、LLMがJSONを壊す5つの方法と、それぞれに対処するための実戦で鍛えられたパターンを取り上げます。

LLMがJSONを壊す5つの方法

これらはエッジケースではありません。本番環境では、これらひとつひとつが必ず起きます — たいていは、あなたがチェックするのをやめた瞬間に。

マークダウンのコードフェンス — 訓練データにJSONをそう提示するドキュメントやREADMEファイルが満載のため、モデルはJSONを```json\n...\n```で包みます。
末尾のコメント — モデルは閉じ括弧の後に1文または段落を追加します：「注：totalフィールドはUSDです。」
切り詰め — 長い出力はレスポンスがトークン制限に達したときにオブジェクトの途中で切られ、構造的に壊れたJSONと閉じ括弧のない状態が残ります。
ハルシネートされたキー — モデルはスキーマにないフィールド名を作り出します。invoice_numberを求めたのに、invoiceNumber、invoice_no、ref_idが — 時には同じレスポンス内で — 返ってきます。
間違った型 — 数値が文字列として届く（49.99の代わりに"49.99"）、真偽値が"true"として、配列がカンマ区切りの文字列として。型強制のバグが変装しています。

パターン1：マークダウンのコードフェンスを剥がす

これは最も一般的な破損で、最も修正しやすいものです。シンプルな正規表現が、言語タグが json、JSON、またはまったく欠けていても、フェンスを剥がします。これを他のどの処理よりも先に実行してください — コストゼロで、大きなクラスのエラーを防ぎます。

python

import re

def strip_code_fences(text: str) -> str:
    """Remove markdown code fences from LLM output."""
    # Handles ```json, ```JSON, ``` (no lang tag), etc.
    pattern = r'^```(?:json|JSON)?\s*\n?(.*?)\n?```$'
    match = re.search(pattern, text.strip(), re.DOTALL)
    if match:
        return match.group(1).strip()
    return text.strip()

# Example: model returned a fenced block
raw = """
```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1249.99,
  "currency": "USD"
}
```
"""

clean = strip_code_fences(raw)
invoice = json.loads(clean)  # now safe

function stripCodeFences(text) {
  // Handles ```json, ```JSON, bare ``` (no lang), etc.
  const match = text.trim().match(/^```(?:json|JSON)?\s*\n?([\s\S]*?)\n?```$/s);
  return match ? match[1].trim() : text.trim();
}

// raw response contains a triple-backtick fence (shown here as a single-quoted string)
const raw = '```json\n{\n  "invoice_number": "INV-2024-0192",\n  "vendor": "Acme Supplies",\n  "total": 1249.99\n}\n```';

const clean = stripCodeFences(raw);
const invoice = JSON.parse(clean); // safe

パターン2：正規表現でJSONを抽出する

モデルがJSONオブジェクトの前または後にテキストを追加するとき — 「こちらが抽出されたデータです：」、「変更が必要な場合はお知らせください。」 — フェンスを剥がすだけでは不十分です。最も外側の{...}ブロックを見つけて取り出す必要があります。コツは、ネストされたオブジェクトを正しく扱う貪欲マッチを使うことです。このアプローチはオブジェクト（{}）を扱います。スキーマが配列の場合は、それに応じて文字クラスを入れ替えてください。

python

import re
import json

def extract_json_object(text: str) -> str | None:
    """
    Extract the first complete JSON object from a string that may
    contain surrounding prose or commentary.
    """
    # Find the first { and last } to grab the outermost object
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if not match:
        # Fall back to array extraction if no object found
        match = re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

# Model returned prose + JSON + footnote
raw_response = """
Based on the document you provided, here is the structured data:

{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99},
    {"description": "Standing desk", "qty": 1, "unit_price": 649.99}
  ],
  "total": 1849.95
}

Note: unit prices are pre-tax. Let me know if you need the tax breakdown.
"""

json_str = extract_json_object(raw_response)
if json_str:
    invoice = json.loads(json_str)
    print(f"Parsed invoice: {invoice['invoice_number']}")
else:
    raise ValueError("No JSON object found in LLM response")

パターン3：構造エラーにはjson-repairを使う

切り詰めや小さな構造エラー — 閉じ括弧の欠落、クォートなしのキー、末尾のカンマ — は、正規表現による抽出では対処しきれません。 json-repair ライブラリは、まさにこれのために作られました。一連のヒューリスティックを適用し、ブラウザが不正なHTMLを許容するのと同じように、壊れたJSONから可能な限り多くの有効な構造を復元します。 pip install json-repairでインストールし、レスポンスを諦める前の最後の防衛ラインとしてパースパイプラインに組み込みましょう。

python

import json
import json_repair  # pip install json-repair

def parse_with_repair(text: str) -> dict | list | None:
    """
    Attempt standard parse first; fall back to json_repair for
    structurally broken responses (truncation, missing braces, etc.).
    """
    # First pass: clean up fences and extract the JSON substring
    cleaned = extract_json_object(strip_code_fences(text))
    if not cleaned:
        return None

    # Second pass: try the fast standard parse
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        pass

    # Third pass: let json_repair reconstruct broken structure
    try:
        repaired = json_repair.repair_json(cleaned, return_objects=True)
        return repaired if repaired else None
    except Exception:
        return None

# Works even on truncated output from a token-limited response
truncated = """
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "line_items": [
    {"description": "Office chairs", "qty": 4
"""

result = parse_with_repair(truncated)
# Returns {"invoice_number": "INV-2024-0192", "vendor": "Acme Supplies",
#          "line_items": [{"description": "Office chairs", "qty": 4}]}

手動デバッグのヒント： 特定の壊れたレスポンスを調査しているとき、 JSON Fixerに貼り付けると、json-repairが何をするのか正確に確認できます — またはJSONバリデーターを使用して、修復するか再プロンプトするかを決める前に、構文エラーの正確な行と文字位置を特定できます。

パターン4：明示的なプロンプトで再試行する

時には、最良のパーサーはモデル自身です。出力がjson-repairで修正できる範囲を超えてぐちゃぐちゃ — ハルシネートされたキー、まったく間違った構造、データよりプロースの方が多いレスポンス — なら、壊れた出力をパースエラーとともにモデルに送り返し、自分のミスを修正するよう求めましょう。モデルはこれが驚くほど得意です。リトライ回数は少なく（最大2–3回）、無限ループを避けるために試行を追跡してください。

python

import json
from openai import OpenAI

client = OpenAI()

def call_model(messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return response.choices[0].message.content

def extract_invoice_data(document_text: str, max_retries: int = 3) -> dict:
    """Extract structured invoice data with automatic retry on parse failure."""
    system_prompt = """Extract invoice data and return ONLY a JSON object with these fields:
{
  "invoice_number": string,
  "vendor": string,
  "issue_date": string (YYYY-MM-DD),
  "due_date": string (YYYY-MM-DD) or null,
  "line_items": [{"description": string, "qty": number, "unit_price": number}],
  "subtotal": number,
  "tax": number,
  "total": number,
  "currency": string (ISO 4217)
}
Return ONLY the JSON object. No markdown. No explanation."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Extract invoice data from:\n\n{document_text}"}
    ]

    for attempt in range(max_retries):
        raw = call_model(messages)

        try:
            cleaned = extract_json_object(strip_code_fences(raw))
            return json.loads(cleaned)
        except (json.JSONDecodeError, TypeError) as e:
            if attempt == max_retries - 1:
                raise ValueError(
                    f"Failed to parse JSON after {max_retries} attempts. "
                    f"Last error: {e}. Last response: {raw[:200]}"
                )

            # Feed the error back — the model often corrects itself
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": (
                    f"That response caused a JSON parse error: {e}\n"
                    f"Please return ONLY a valid JSON object. No markdown fences, "
                    f"no commentary, just the raw JSON."
                )
            })

    raise ValueError("Unexpected exit from retry loop")

パターン5：パースをスキップ — 代わりにStructured Outputsを使う

モデル呼び出しを制御でき、新しいAPIを使う余裕があるなら、structured outputsはこのほとんどの複雑さを完全に排除します。 OpenAI Structured Outputs （GPT-4o以降で利用可能）と Geminiのresponse schema はどちらも、モデルの出力をトークン生成レベルで制約します — デコーディング中に無効なトークンが抑制されるため、モデルが不正なJSONオブジェクトを返すことは数学的に不可能です。欠点：モデルの創造性をいくらか手放すことになり、これらのAPIは呼び出しあたり若干高価です。大量の抽出パイプラインには、たいていその価値があります。

python

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class LineItem(BaseModel):
    description: str
    qty: int
    unit_price: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    issue_date: str          # YYYY-MM-DD
    total: float
    currency: str            # ISO 4217
    line_items: list[LineItem]

def extract_invoice_structured(document_text: str) -> Invoice:
    """
    Extract invoice using OpenAI Structured Outputs.
    The API guarantees the response matches the Invoice schema —
    no manual parsing or repair needed.
    """
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {
                "role": "system",
                "content": "Extract invoice data from the provided document."
            },
            {"role": "user", "content": document_text}
        ],
        response_format=Invoice
    )
    return completion.choices[0].message.parsed

invoice = extract_invoice_structured(document_text)
print(f"Invoice {invoice.invoice_number}: ${invoice.total:.2f} {invoice.currency}")

本番対応のパーサー（Python）

4つの防御的パターンをひとつのユーティリティにまとめた本番抽出関数はこちらです。これは、1日あたり数千のLLMレスポンスを処理するサービスで私が実際に動かしているバージョンです。フェンスを剥がし、JSONサブストリングを抽出し、クリーンなパースを試み、 json_repairにフォールバックし、オプションで JSON Schemaに対して検証してから返します。 structured outputsを使っていないなら、これが土台です。

python

import re
import json
from typing import Any
import json_repair        # pip install json-repair
import jsonschema         # pip install jsonschema

def strip_code_fences(text: str) -> str:
    match = re.search(r'^```(?:\w+)?\s*\n?(.*?)\n?```$', text.strip(), re.DOTALL)
    return match.group(1).strip() if match else text.strip()

def extract_json_substring(text: str) -> str | None:
    match = re.search(r'\{.*\}', text, re.DOTALL) or re.search(r'\[.*\]', text, re.DOTALL)
    return match.group(0) if match else None

def parse_llm_json(text: str, schema: dict | None = None) -> Any:
    """
    Robustly parse JSON from LLM output.

    Steps:
      1. Strip markdown code fences
      2. Extract outermost JSON object/array (handles surrounding prose)
      3. Fast-path: standard json.loads
      4. Slow-path: json_repair for structurally broken responses
      5. Optional: validate against a JSON Schema

    Args:
        text:   Raw text returned by the LLM
        schema: Optional JSON Schema dict to validate the parsed result

    Returns:
        Parsed Python object (dict or list)

    Raises:
        ValueError: If parsing fails after all recovery attempts
        jsonschema.ValidationError: If schema validation fails
    """
    if not text or not text.strip():
        raise ValueError("LLM returned an empty response")

    # Step 1 — strip fences
    text = strip_code_fences(text)

    # Step 2 — extract JSON substring (handles prose before/after)
    json_str = extract_json_substring(text)
    if not json_str:
        raise ValueError(f"No JSON object or array found in response: {text[:200]!r}")

    # Step 3 — standard parse (fast path, no overhead)
    parsed = None
    try:
        parsed = json.loads(json_str)
    except json.JSONDecodeError as original_error:
        # Step 4 — repair and retry
        try:
            repaired = json_repair.repair_json(json_str, return_objects=True)
            if repaired is not None:
                parsed = repaired
        except Exception as repair_error:
            raise ValueError(
                f"JSON parse failed and repair also failed.\n"
                f"Parse error: {original_error}\n"
                f"Repair error: {repair_error}\n"
                f"Input (first 500 chars): {json_str[:500]!r}"
            ) from original_error

    if parsed is None:
        raise ValueError(f"Parsing returned None for input: {json_str[:200]!r}")

    # Step 5 — optional schema validation
    if schema is not None:
        jsonschema.validate(parsed, schema)  # raises ValidationError on mismatch

    return parsed


# --- Usage ---

INVOICE_SCHEMA = {
    "type": "object",
    "required": ["invoice_number", "vendor", "total"],
    "properties": {
        "invoice_number": {"type": "string"},
        "vendor":         {"type": "string"},
        "total":          {"type": "number"},
        "currency":       {"type": "string"},
        "line_items":     {"type": "array"}
    }
}

llm_response = """
Sure! Here's the structured data:

```json
{
  "invoice_number": "INV-2024-0192",
  "vendor": "Acme Supplies",
  "total": 1849.95,
  "currency": "USD",
  "line_items": [
    {"description": "Office chairs", "qty": 4, "unit_price": 299.99}
  ]
}
```

Let me know if you need any changes!
"""

invoice = parse_llm_json(llm_response, schema=INVOICE_SCHEMA)
print(f"Vendor: {invoice['vendor']}, Total: ${invoice['total']}")

JavaScript版

JavaScriptでの同じロジックです。修復ステップについては、json_repairに最も近い等価物はほぼ有効なJSONの寛容なパースのためのJSON5、または自分で軽量な修復ラッパーを書くことができます。クライアントサイドの作業では、 JSON.parse() に適切なtry/catchと正規表現フォールバックを組み合わせれば、本番ケースの大多数をカバーできます。

// npm install json5   (optional — for tolerant parsing of near-valid JSON)
import JSON5 from 'json5';

function stripCodeFences(text) {
  const match = text.trim().match(/^```(?:\w+)?\s*\n?([\s\S]*?)\n?```$/);
  return match ? match[1].trim() : text.trim();
}

function extractJsonSubstring(text) {
  // Greedy match for outermost object or array
  const objectMatch = text.match(/\{[\s\S]*\}/);
  if (objectMatch) return objectMatch[0];
  const arrayMatch = text.match(/\[[\s\S]*\]/);
  return arrayMatch ? arrayMatch[0] : null;
}

/**
 * Robustly parse JSON from LLM output.
 * Steps: strip fences → extract substring → JSON.parse → JSON5 fallback
 *
 * @param {string} text - Raw LLM response text
 * @returns {object|Array} Parsed JavaScript value
 * @throws {Error} If all parse attempts fail
 */
function parseLlmJson(text) {
  if (!text || !text.trim()) {
    throw new Error('LLM returned an empty response');
  }

  // Step 1 — strip markdown fences
  let cleaned = stripCodeFences(text);

  // Step 2 — extract JSON substring (skip surrounding prose)
  const jsonStr = extractJsonSubstring(cleaned);
  if (!jsonStr) {
    throw new Error(`No JSON object or array found in response: ${text.slice(0, 200)}`);
  }

  // Step 3 — standard JSON.parse (fast path)
  try {
    return JSON.parse(jsonStr);
  } catch (stdError) {
    // Step 4 — JSON5 tolerant parser (handles trailing commas, unquoted keys, etc.)
    try {
      return JSON5.parse(jsonStr);
    } catch (json5Error) {
      throw new Error(
        `JSON parse failed.\nStandard error: ${stdError.message}\nJSON5 error: ${json5Error.message}\nInput: ${jsonStr.slice(0, 300)}`
      );
    }
  }
}

// --- Usage ---

const llmResponse = `
Here is the product data you requested:

\`\`\`json
{
  "product_id": "SKU-8821-B",
  "name": "Ergonomic Office Chair",
  "price": 299.99,
  "in_stock": true,
  "tags": ["furniture", "ergonomic", "office"]
}
\`\`\`

Let me know if you need the full catalog!
`;

const product = parseLlmJson(llmResponse);
console.log(`Product: ${product.name} — $${product.price}`);
// → Product: Ergonomic Office Chair — $299.99

まとめ

LLMはJSONを5つの予測可能な方法で壊し、それぞれに予測可能な修正があります。マークダウンフェンスと周囲のプロースは表面的なもの — 数個の正規表現で確実に処理できます。切り詰めや軽微な整形エラーによる構造的ダメージこそ、json_repair が作られた理由です。構造は正しいが内容が間違っているとき — 悪いキー、間違った型 — それはプロンプトの問題であり、エラーメッセージをモデルに送り返すリトライループが最良のツールです。そしてStructured Outputs が使えるなら、使ってください — 症状を治療するのではなく、源で問題を排除します。特定のレスポンスが不調のときのアドホックなデバッグには、JSON Fixerと JSONフォーマッターが時間を節約してくれます。parse_llm_jsonユーティリティを一度構築し、最悪の過去レスポンスに対してテストし、先に進みましょう — デバッグ時間を費やす価値のあるもっと良い問題があります。

← All JSON articles Browse all categories →