Analisi di XML in Python — ElementTree, lxml e Pattern Reali

La libreria standard di Python include un solido parser XML — nessun pip install richiesto. xml.etree.ElementTree gestisce la vasta maggioranza dell'XML reale: feed RSS, risposte SOAP, file di configurazione, file di risorse Android, POM Maven. Devi ricorrere a lxml solo quando hai bisogno di validazione schema XSD, XPath complesso o file davvero enormi. Vediamo entrambi, con esempi reali.

Basi di ElementTree — Analisi da Stringa o File

Il modulo xml.etree.ElementTree ti offre due punti di accesso: fromstring() per analizzare stringhe XML, e parse() per leggere direttamente da un file. Ecco un esempio pratico usando una struttura di feed RSS:

python

import xml.etree.ElementTree as ET

rss_xml = """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Engineering Blog</title>
    <link>https://blog.example.com</link>
    <description>Articles for developers</description>
    <item>
      <title>Understanding Database Indexes</title>
      <link>https://blog.example.com/db-indexes</link>
      <pubDate>Mon, 15 Jan 2024 09:00:00 GMT</pubDate>
      <category>Database</category>
    </item>
    <item>
      <title>REST API Design Patterns</title>
      <link>https://blog.example.com/rest-patterns</link>
      <pubDate>Wed, 17 Jan 2024 09:00:00 GMT</pubDate>
      <category>API</category>
    </item>
  </channel>
</rss>"""

# Analizza da una stringa
root = ET.fromstring(rss_xml)

# Analizza da un file (alternativa)
# tree = ET.parse('feed.xml')
# root = tree.getroot()

print(root.tag)           # rss
print(root.attrib)        # {'version': '2.0'}

channel = root.find('channel')
print(channel.find('title').text)  # Engineering Blog

find, findall e findtext — Ricerca nell'Albero

Questi tre metodi sono i tuoi strumenti principali per estrarre dati. Accettano tutti un'espressione di percorso semplice (come un XPath limitato) per navigare nell'albero degli elementi:

python

import xml.etree.ElementTree as ET

root = ET.fromstring(rss_xml)
channel = root.find('channel')

# find() — restituisce il primo elemento corrispondente, o None
first_item = channel.find('item')
print(first_item.find('title').text)  # Understanding Database Indexes

# findall() — restituisce una lista di tutti gli elementi corrispondenti
items = channel.findall('item')
print(len(items))  # 2

for item in items:
    title = item.findtext('title')      # findtext() restituisce .text direttamente
    link = item.findtext('link')
    pub_date = item.findtext('pubDate')
    print(f"{title} — {pub_date}")

# Percorso annidato con '/'
all_titles = channel.findall('item/title')
print([el.text for el in all_titles])
# ['Understanding Database Indexes', 'REST API Design Patterns']

# findtext() con un valore predefinito (evita AttributeError su elementi mancanti)
author = channel.findtext('item/author', default='Unknown Author')
print(author)  # Unknown Author

Usa findtext() con un valore predefinito. Se usi find().text e l'elemento non esiste, find() restituisce None e .text solleva un AttributeError. findtext('tag', default='') gestisce gli elementi mancanti in modo sicuro — molto più affidabile quando si analizza XML da fonti esterne.

Lettura degli Attributi

python

import xml.etree.ElementTree as ET

xml_str = """<catalog>
  <product id="P001" featured="true">
    <name>Mechanical Keyboard</name>
    <price currency="USD">189.00</price>
  </product>
  <product id="P002" featured="false">
    <name>USB-C Hub</name>
    <price currency="USD">49.99</price>
  </product>
</catalog>"""

root = ET.fromstring(xml_str)

for product in root.findall('product'):
    product_id = product.get('id')           # get() per gli attributi
    featured = product.get('featured', 'false')  # con valore predefinito
    name = product.findtext('name')
    price_el = product.find('price')
    price = float(price_el.text)
    currency = price_el.get('currency')

    print(f"{product_id}: {name} — {currency} {price} (featured: {featured})")

Gestione dei Namespace XML

I namespace sono la parte dell'analisi XML che fa gemere la maggior parte degli sviluppatori. In ElementTree, gli URI dei namespace appaiono tra parentesi graffe nei nomi dei tag: {http://...}tagname. Ecco come gestirli in modo pulito:

python

import xml.etree.ElementTree as ET

soap_xml = """<?xml version="1.0"?>
<soap:Envelope
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:inv="http://www.example.com/invoice">
  <soap:Body>
    <inv:GetInvoiceResponse>
      <inv:InvoiceId>INV-2024-0042</inv:InvoiceId>
      <inv:Amount currency="EUR">1250.00</inv:Amount>
      <inv:Status>Paid</inv:Status>
    </inv:GetInvoiceResponse>
  </soap:Body>
</soap:Envelope>"""

root = ET.fromstring(soap_xml)

# ElementTree espande i prefissi dei namespace in URI tra parentesi graffe
# Puoi definire una mappa dei namespace per ricerche XPath più pulite
ns = {
    'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
    'inv': 'http://www.example.com/invoice'
}

# Usa la mappa dei namespace in find/findall
invoice_id = root.find('.//inv:InvoiceId', ns).text
amount_el = root.find('.//inv:Amount', ns)
status = root.findtext('.//inv:Status', namespaces=ns)

print(invoice_id)                    # INV-2024-0042
print(amount_el.text)                # 1250.00
print(amount_el.get('currency'))     # EUR
print(status)                        # Paid

Costruzione Programmatica di XML

ElementTree ti permette anche di costruire XML da zero — utile quando hai bisogno di creare richieste SOAP o generare output XML:

python

import xml.etree.ElementTree as ET

# Costruisci un documento ordine
order = ET.Element('order', id='ORD-9981', status='pending')

customer = ET.SubElement(order, 'customer')
ET.SubElement(customer, 'name').text = 'Jane Smith'
ET.SubElement(customer, 'email').text = '[email protected]'

items = ET.SubElement(order, 'items')
for product_id, name, qty, price in [
    ('P001', 'Mechanical Keyboard', 1, 189.00),
    ('P002', 'USB-C Hub', 2, 49.99),
]:
    item = ET.SubElement(items, 'item', productId=product_id, qty=str(qty))
    ET.SubElement(item, 'name').text = name
    ET.SubElement(item, 'price', currency='USD').text = str(price)

# Serializza in stringa
ET.indent(order, space='  ')  # Python 3.9+ — formatta in modo leggibile sul posto
xml_output = ET.tostring(order, encoding='unicode', xml_declaration=True)
print(xml_output)

iterparse — Streaming di File XML di Grandi Dimensioni

Per file XML di grandi dimensioni (decine di MB o più), caricare l'intero documento in memoria con parse() è costoso. iterparse() trasmette il file in streaming (simile nell'approccio event-driven a SAX) e genera eventi man mano che gli elementi vengono incontrati, permettendoti di elaborare e scartare gli elementi man mano che procedi:

python

import xml.etree.ElementTree as ET

def process_large_feed(filepath):
    """Elabora un feed RSS/Atom di grandi dimensioni senza caricarlo tutto in memoria."""
    articles = []

    for event, elem in ET.iterparse(filepath, events=('end',)):
        if elem.tag == 'item':
            articles.append({
                'title': elem.findtext('title', ''),
                'link': elem.findtext('link', ''),
                'pub_date': elem.findtext('pubDate', ''),
            })
            # Fondamentale: pulisci l'elemento dopo l'elaborazione per liberare memoria
            elem.clear()

        if len(articles) >= 1000:
            yield from articles
            articles.clear()

    yield from articles  # restituisci gli eventuali rimanenti

for article in process_large_feed('large_feed.xml'):
    print(article['title'])

La chiamata elem.clear() dopo l'elaborazione di ogni elemento è la chiave per mantenere l'utilizzo della memoria piatto indipendentemente dalla dimensione del file. Senza di essa, ElementTree accumula tutti gli elementi analizzati in memoria e si perde il vantaggio dello streaming.

lxml — Quando Hai Bisogno di Più Potenza

La libreria lxml è una libreria XML veloce basata su C che estende l'API di ElementTree con supporto completo XPath 1.0, validazione schema XSD e trasformazioni XSLT. Installala con pip install lxml:

python

from lxml import etree

# lxml usa la stessa API di ElementTree nella maggior parte dei casi
root = etree.fromstring(rss_xml.encode())  # lxml ha bisogno di bytes, non str

# XPath 1.0 completo — molto più potente del sottoinsieme di ElementTree
items = root.xpath('//item[position() <= 2]/title/text()')
print(items)  # ['Understanding Database Indexes', 'REST API Design Patterns']

# XPath con predicati — ottieni elementi di una categoria specifica
db_items = root.xpath('//item[category="Database"]/title/text()')
print(db_items)  # ['Understanding Database Indexes']

# Validazione schema XSD
xsd_doc = etree.parse('schema.xsd')
schema = etree.XMLSchema(xsd_doc)
xml_doc = etree.parse('data.xml')

if schema.validate(xml_doc):
    print("Valid!")
else:
    for error in schema.error_log:
        print(f"Line {error.line}: {error.message}")

Strumenti Utili per il Lavoro con XML

Quando costruisci integrazioni XML in Python, questi strumenti browser aiutano con la parte dei dati: Formattatore XML per rendere leggibili le risposte API grezze, Validatore XML per verificare il formato corretto prima di configurare il parser, e XML in JSON quando preferisci lavorare con dict anziché elementi.

Conclusioni

xml.etree.ElementTree di Python copre la maggior parte degli scenari XML reali: usa find() e findall() con una mappa dei namespace per SOAP e feed con namespace, findtext() con valori predefiniti per evitare AttributeError su elementi mancanti, e iterparse() per file di grandi dimensioni che non puoi caricare in memoria tutti in una volta. Usa lxml quando hai bisogno di validazione XSD o del linguaggio di espressioni XPath completo. La libreria standard gestisce tutto il resto senza problemi.

← All XML articles Browse all categories →