Usare JSON Schema con OpenAI Structured Outputs

Hai spedito una feature che chiama GPT-4 per estrarre dati strutturati dalle fatture inviate dagli utenti. In sviluppo funziona perfettamente — il modello restituisce un oggetto JSON pulito ogni volta. Poi in produzione, alle 2 di notte, ti arriva un alert Sentry: JSON.parse: unexpected token. Il modello ha deciso di anteporre alla risposta "Certo! Ecco il JSON che hai chiesto:" prima del payload vero. Una settimana dopo, stessa feature, bug diverso: il modello restituisce totalAmount invece di total_amount, e la scrittura sul database a valle perde silenziosamente il campo. Se hai tirato avanti con il prompting per aggirare l'affidabilità dell'output degli LLM, OpenAI Structured Outputs è la soluzione che stavi aspettando.

Structured Outputs, rilasciato da OpenAI nell'agosto 2024, ti permette di fornire uno JSON Schema tramite il parametro response_format e di ricevere una risposta garantita come valida che corrisponde esattamente a quello schema. È diverso dalla vecchia modalità JSON ({"type": "json_object"}), che assicurava solo che l'output fosse JSON valido — non che rispettasse una forma particolare. È anche distinto dal function calling, che instrada l'output del modello in una chiamata a un tool ma aggiunge un suo strato di cerimonia. Structured Outputs è il percorso più pulito: descrivi la forma che vuoi, ti torna indietro esattamente quella forma, sempre. Sotto il cofano, OpenAI usa il decoding vincolato — il sampling dei token del modello è guidato dal tuo schema in modo che letteralmente non possa produrre una risposta non valida.

Il tuo primo output strutturato

python

from openai import OpenAI
import json

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "user",
            "content": "Extract the vendor name, invoice number, and total amount from this text: "
                       "Invoice #INV-2024-0892 from Acme Supplies Ltd. Total due: $1,450.00"
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice_extract",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "vendor_name":     {"type": "string"},
                    "invoice_number":  {"type": "string"},
                    "total_amount":    {"type": "number"}
                },
                "required": ["vendor_name", "invoice_number", "total_amount"],
                "additionalProperties": False
            }
        }
    }
)

data = json.loads(response.choices[0].message.content)
print(data)
# {"vendor_name": "Acme Supplies Ltd", "invoice_number": "INV-2024-0892", "total_amount": 1450.0}

Tre cose da notare qui. Primo, il modello è gpt-4o-2024-08-06 — Structured Outputs richiede un modello che lo supporti esplicitamente (lo snapshot -2024-08-06 o successivo per GPT-4o, o gpt-4o-mini). Secondo, response_format.type è "json_schema", non "json_object". Terzo, "strict": True è ciò che ti dà la garanzia — senza sei di nuovo in territorio best-effort. Il campo name è un'etichetta che il modello vede; non ha effetto sul parsing ma rende leggibili i log della tua API.

Progettare lo JSON Schema

Ecco uno schema più realistico per un task di estrazione da catalogo prodotti — il tipo che useresti per estrarre dati strutturati da descrizioni prodotto non strutturate, listing e-commerce o datasheet in PDF. Usa il Generatore JSON Schema per costruire e validare visivamente il tuo schema prima di collegarlo alle tue chiamate API.

json

{
  "type": "object",
  "properties": {
    "product_name": {
      "type": "string",
      "description": "The full commercial name of the product"
    },
    "sku": {
      "type": "string",
      "description": "Stock keeping unit identifier"
    },
    "price_usd": {
      "type": "number",
      "description": "Price in US dollars, numeric only"
    },
    "in_stock": {
      "type": "boolean"
    },
    "categories": {
      "type": "array",
      "items": { "type": "string" }
    },
    "dimensions": {
      "type": "object",
      "properties": {
        "width_cm":  { "type": "number" },
        "height_cm": { "type": "number" },
        "depth_cm":  { "type": "number" }
      },
      "required": ["width_cm", "height_cm", "depth_cm"],
      "additionalProperties": false
    }
  },
  "required": [
    "product_name", "sku", "price_usd",
    "in_stock", "categories", "dimensions"
  ],
  "additionalProperties": false
}

In strict mode tutte le proprietà devono essere in required. Non puoi avere campi opzionali. Se un campo potrebbe non esistere nei dati sorgente, usa un tipo union: {"type": ["string", "null"]} e includilo sempre in required.
additionalProperties deve essere false a ogni livello di oggetto. Questo si applica ricorsivamente — anche i tuoi oggetti annidati ne hanno bisogno, non solo la radice.
Tipi supportati in strict mode: string, number, integer, boolean, null, array, object. Le union di tipi (["string", "null"]) sono ammesse.
Niente $ref o schemi ricorsivi in strict mode. Tutto deve essere inlined. Se ti serve una definizione riutilizzabile, copiala.
Aggiungi campi description con generosità. Il modello li legge. Dire "Prezzo in dollari USA, solo numerico — non includere simboli di valuta" ti porta un output più pulito rispetto a sperare che il modello indovini.
Gli enum funzionano. {"type": "string", "enum": ["pending", "shipped", "delivered"]} è pienamente supportato e il modello emetterà sempre e solo uno di quei tre valori.

Strict mode vs non-strict

python

# Strict mode — guaranteed conformance, tighter schema rules
response_format_strict = {
    "type": "json_schema",
    "json_schema": {
        "name": "product_extract",
        "strict": True,   # <-- the key flag
        "schema": product_schema
    }
}

# Non-strict — more schema flexibility, best-effort conformance
response_format_lenient = {
    "type": "json_schema",
    "json_schema": {
        "name": "product_extract",
        "strict": False,
        "schema": product_schema
    }
}

Con strict: True, OpenAI pre-elabora il tuo schema la prima volta che viene usato e mette in cache il decoder vincolato. La prima chiamata con uno schema nuovo impiega un po' di più; le chiamate successive con lo stesso schema sono veloci. In cambio ottieni: l'output del modello è strutturalmente garantito — puoi chiamare json.loads() e poi accedere ai campi direttamente senza controlli difensivi. Cosa rinunci: $ref, anyOf su varianti strutturali, e schemi ricorsivi non sono supportati. La non-strict mode accetta un range più ampio di funzionalità di JSON Schema ma ricade in best-effort — il modello cerca di seguire lo schema ma non è vincolato a livello di token. Per le pipeline di estrazione in produzione, usa sempre la strict mode. Le restrizioni dello schema sono gestibili una volta che le capisci.

Oggetti e array annidati

Le strutture annidate funzionano bene, ma ogni oggetto annidato ha bisogno del suo "additionalProperties": false e del suo array "required" che elenca tutte le proprietà. Un errore comune è applicare le regole strict all'oggetto radice dimenticandosi dei figli — OpenAI rifiuterà lo schema con un errore di validazione.

python

from openai import OpenAI
import json

client = OpenAI()

order_schema = {
    "type": "object",
    "properties": {
        "order_id": {"type": "string"},
        "customer": {
            "type": "object",
            "properties": {
                "name":  {"type": "string"},
                "email": {"type": "string"}
            },
            "required": ["name", "email"],
            "additionalProperties": False   # required on nested objects too
        },
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity":    {"type": "integer"},
                    "unit_price":  {"type": "number"}
                },
                "required": ["description", "quantity", "unit_price"],
                "additionalProperties": False  # required on array item schemas too
            }
        },
        "total_usd": {"type": "number"}
    },
    "required": ["order_id", "customer", "line_items", "total_usd"],
    "additionalProperties": False
}

response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[{
        "role": "user",
        "content": (
            "Parse this order: Order #ORD-5531 for Jane Smith ([email protected]). "
            "2x Wireless Keyboard at $49.99 each, 1x USB Hub at $29.99. Total: $129.97"
        )
    }],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "order_extract",
            "strict": True,
            "schema": order_schema
        }
    }
)

order = json.loads(response.choices[0].message.content)
for item in order["line_items"]:
    print(f"${item['unit_price']:.2f} x{item['quantity']}  {item['description']}")

Gestire i rifiuti

Anche con Structured Outputs, il modello può rifiutarsi di rispondere — tipicamente quando il prompt innesca una policy sui contenuti (chiedendogli di estrarre dati da qualcosa di dannoso). Quando succede, finish_reason è "stop" ma message.content è null e message.refusal contiene il testo del rifiuto. Se non controlli questo, otterrai un AttributeError quando proverai a chiamare json.loads(None). Attenzione anche a finish_reason == "length" — se la risposta è stata troncata per via di max_tokens, il JSON sarà incompleto e non parsificabile indipendentemente da Structured Outputs.

python

import json
from openai import OpenAI

client = OpenAI()

def extract_invoice(raw_text: str) -> dict | None:
    response = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": f"Extract invoice fields: {raw_text}"}],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "invoice_extract",
                "strict": True,
                "schema": invoice_schema
            }
        },
        max_tokens=1024
    )

    choice = response.choices[0]

    if choice.finish_reason == "length":
        raise ValueError("Response truncated — increase max_tokens or simplify your schema")

    if choice.message.refusal:
        # Model refused to answer — log and return None rather than crashing
        print(f"Model refused: {choice.message.refusal}")
        return None

    return json.loads(choice.message.content)


result = extract_invoice("Invoice #2024-441 from BuildRight Inc., due $3,200 by Dec 15")
if result:
    print(result["vendor_name"], result["total_amount"])

Usare lo stesso schema per la validazione

Un pattern poco sfruttato: usa lo stesso JSON Schema che passi a OpenAI per validare anche i dati in arrivo da altre sorgenti — webhook, upload di file, API di terze parti. Questo ti dà una singola fonte di verità per la forma dei tuoi dati. In Python usa la libreria jsonschema. In Node.js usa Ajv. Puoi anche incollare il tuo schema nel JSON Validator per fare un rapido controllo di sanità manuale senza scrivere codice.

python

import json
import jsonschema
from jsonschema import validate, ValidationError

# The same schema used in your OpenAI call
invoice_schema = {
    "type": "object",
    "properties": {
        "vendor_name":    {"type": "string"},
        "invoice_number": {"type": "string"},
        "total_amount":   {"type": "number"}
    },
    "required": ["vendor_name", "invoice_number", "total_amount"],
    "additionalProperties": False
}

def validate_invoice(data: dict) -> bool:
    try:
        validate(instance=data, schema=invoice_schema)
        return True
    except ValidationError as e:
        print(f"Validation failed: {e.message}")
        print(f"  Path: {' -> '.join(str(p) for p in e.path)}")
        return False


# Validate a payload from a webhook — same schema, zero extra work
webhook_payload = json.loads(request_body)
if not validate_invoice(webhook_payload):
    return HTTPResponse(status=400, body="Invalid invoice payload")

# Validate the OpenAI output too, for belt-and-suspenders safety
llm_output = json.loads(openai_response.choices[0].message.content)
assert validate_invoice(llm_output), "LLM output failed schema validation — check schema definition"
print(f"Processing invoice {llm_output['invoice_number']} for ${llm_output['total_amount']}")

Costruisci il tuo schema più in fretta: Usa il Generatore JSON Schema per creare e raffinare visivamente il tuo schema — incolla un oggetto JSON di esempio e lui genera automaticamente uno schema di partenza. Copia il risultato direttamente nel tuo response_format di OpenAI.

Versione JavaScript / Node.js

L'SDK Node.js di OpenAI rispecchia l'API Python quasi esattamente. La differenza principale è che strict sta dentro all' oggetto json_schema nello stesso modo, e parsifichi la risposta con JSON.parse(). La validazione con Ajv è l'equivalente Node.js della libreria jsonschema di Python — è più veloce e ha un eccellente supporto a TypeScript.

import OpenAI from "openai";
import Ajv from "ajv";

const client = new OpenAI();
const ajv = new Ajv();

const invoiceSchema = {
  type: "object",
  properties: {
    vendor_name:    { type: "string" },
    invoice_number: { type: "string" },
    total_amount:   { type: "number" },
    line_items: {
      type: "array",
      items: {
        type: "object",
        properties: {
          description: { type: "string" },
          amount:      { type: "number" }
        },
        required: ["description", "amount"],
        additionalProperties: false
      }
    }
  },
  required: ["vendor_name", "invoice_number", "total_amount", "line_items"],
  additionalProperties: false
};

const validateInvoice = ajv.compile(invoiceSchema);

async function extractInvoice(rawText) {
  const response = await client.chat.completions.create({
    model: "gpt-4o-2024-08-06",
    messages: [{ role: "user", content: `Extract invoice fields: ${rawText}` }],
    response_format: {
      type: "json_schema",
      json_schema: {
        name: "invoice_extract",
        strict: true,
        schema: invoiceSchema
      }
    }
  });

  const choice = response.choices[0];

  if (choice.message.refusal) {
    throw new Error(`Model refused: ${choice.message.refusal}`);
  }

  const data = JSON.parse(choice.message.content);

  // Validate even though strict mode guarantees structure —
  // useful for catching schema drift between environments
  if (!validateInvoice(data)) {
    console.error("Schema validation errors:", validateInvoice.errors);
    throw new Error("Output failed schema validation");
  }

  return data;
}

const invoice = await extractInvoice(
  "Invoice #INV-881 from Nordic Parts AS. " +
  "3x Brake Pads at $28.50 each. Total: $85.50"
);

console.log(`${invoice.vendor_name} — Invoice ${invoice.invoice_number}`);
invoice.line_items.forEach(item =>
  console.log(`  ${item.description}: $${item.amount}`)
);

Tiriamo le somme

Structured Outputs elimina un'intera categoria di bug in produzione — quelli in cui il modello restituisce JSON quasi-giusto che rompe il tuo parser alle 2 di notte. Il flusso è lineare: progetta il tuo schema con cura (ogni oggetto annidato ha bisogno di additionalProperties: false e di un array required completo), imposta strict: true, e gestisci rifiuti e troncamenti esplicitamente. Una volta che lo schema è in posizione, puoi riusarlo in tutto il tuo stack — nella chiamata OpenAI, nella validazione dei webhook, nelle tue fixture di test — con librerie come jsonschema (Python) o Ajv (Node.js). Se parti da zero su uno schema, il Generatore JSON Schema è il modo più veloce per ottenere uno schema di base funzionante da un payload di esempio. I giorni del prompt-engineering per ottenere un output JSON affidabile sono finiti — usa lo strumento costruito per farlo.

← All JSON articles Browse all categories →