PythonでXMLを解析する方法 — ElementTree、lxml、実践パターン

Pythonの標準ライブラリには堅固なXMLパーサーが付属しています — pip installは不要です。 xml.etree.ElementTreeは実世界のXMLの大部分を処理できます： RSSフィード、 SOAPレスポンス、設定ファイル、Androidリソースファイル、Maven POM。 lxmlを使うのは XSDスキーマ検証、複雑なXPath、または本当に巨大なファイルに直面した場合だけです。両方を実際の例とともに見ていきましょう。

ElementTreeの基礎 — 文字列またはファイルからの解析

xml.etree.ElementTree モジュールには2つのエントリーポイントがあります：XML文字列を解析するfromstring()と、ファイルから直接読み込むparse()です。RSSフィード構造を使った実践的な例を示します：

python

import xml.etree.ElementTree as ET

rss_xml = """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Engineering Blog</title>
    <link>https://blog.example.com</link>
    <description>Articles for developers</description>
    <item>
      <title>Understanding Database Indexes</title>
      <link>https://blog.example.com/db-indexes</link>
      <pubDate>Mon, 15 Jan 2024 09:00:00 GMT</pubDate>
      <category>Database</category>
    </item>
    <item>
      <title>REST API Design Patterns</title>
      <link>https://blog.example.com/rest-patterns</link>
      <pubDate>Wed, 17 Jan 2024 09:00:00 GMT</pubDate>
      <category>API</category>
    </item>
  </channel>
</rss>"""

# 文字列から解析
root = ET.fromstring(rss_xml)

# ファイルから解析（代替方法）
# tree = ET.parse('feed.xml')
# root = tree.getroot()

print(root.tag)           # rss
print(root.attrib)        # {'version': '2.0'}

channel = root.find('channel')
print(channel.find('title').text)  # Engineering Blog

find、findall、findtext — ツリーの検索

これら3つのメソッドがデータ抽出の主要ツールです。すべてが要素ツリーをナビゲートするためのシンプルなパス式（限定的なXPathのような）を受け入れます：

python

import xml.etree.ElementTree as ET

root = ET.fromstring(rss_xml)
channel = root.find('channel')

# find() — 最初の一致要素を返す、なければNone
first_item = channel.find('item')
print(first_item.find('title').text)  # Understanding Database Indexes

# findall() — 一致するすべての要素のリストを返す
items = channel.findall('item')
print(len(items))  # 2

for item in items:
    title = item.findtext('title')      # findtext()は.textを直接返す
    link = item.findtext('link')
    pub_date = item.findtext('pubDate')
    print(f"{title} — {pub_date}")

# '/'を使ったネストされたパス
all_titles = channel.findall('item/title')
print([el.text for el in all_titles])
# ['Understanding Database Indexes', 'REST API Design Patterns']

# デフォルト値付きfindtext()（欠落要素でのAttributeErrorを回避）
author = channel.findtext('item/author', default='著者不明')
print(author)  # 著者不明

デフォルト値付きのfindtext()を使用してください。find().textを使用して要素が存在しない場合、find()はNoneを返し、.textは AttributeErrorを発生させます。findtext('tag', default='')は欠落要素を優雅に処理します — 外部ソースからのXMLを解析する際ははるかに安全です。

属性の読み込み

python

import xml.etree.ElementTree as ET

xml_str = """<catalog>
  <product id="P001" featured="true">
    <name>Mechanical Keyboard</name>
    <price currency="USD">189.00</price>
  </product>
  <product id="P002" featured="false">
    <name>USB-C Hub</name>
    <price currency="USD">49.99</price>
  </product>
</catalog>"""

root = ET.fromstring(xml_str)

for product in root.findall('product'):
    product_id = product.get('id')           # 属性にはget()を使う
    featured = product.get('featured', 'false')  # デフォルト値付き
    name = product.findtext('name')
    price_el = product.find('price')
    price = float(price_el.text)
    currency = price_el.get('currency')

    print(f"{product_id}: {name} — {currency} {price} (注目: {featured})")

XML名前空間の処理

名前空間はほとんどの開発者がうんざりするXML解析の部分です。ElementTreeでは、名前空間のURIはタグ名の中括弧に現れます：{http://...}tagname。きれいに処理する方法を見てみましょう：

python

import xml.etree.ElementTree as ET

soap_xml = """<?xml version="1.0"?>
<soap:Envelope
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:inv="http://www.example.com/invoice">
  <soap:Body>
    <inv:GetInvoiceResponse>
      <inv:InvoiceId>INV-2024-0042</inv:InvoiceId>
      <inv:Amount currency="EUR">1250.00</inv:Amount>
      <inv:Status>Paid</inv:Status>
    </inv:GetInvoiceResponse>
  </soap:Body>
</soap:Envelope>"""

root = ET.fromstring(soap_xml)

# ElementTreeは名前空間プレフィックスをURIに中括弧で展開する
# よりクリーンなXPath風の検索のために名前空間マップを定義できる
ns = {
    'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
    'inv': 'http://www.example.com/invoice'
}

# find/findallで名前空間マップを使用
invoice_id = root.find('.//inv:InvoiceId', ns).text
amount_el = root.find('.//inv:Amount', ns)
status = root.findtext('.//inv:Status', namespaces=ns)

print(invoice_id)                    # INV-2024-0042
print(amount_el.text)                # 1250.00
print(amount_el.get('currency'))     # EUR
print(status)                        # Paid

プログラムによるXMLの構築

ElementTreeはXMLをゼロから構築することもできます — SOAPリクエストを構築したり、XMLを生成したりする場合に便利です：

python

import xml.etree.ElementTree as ET

# 注文ドキュメントを構築
order = ET.Element('order', id='ORD-9981', status='pending')

customer = ET.SubElement(order, 'customer')
ET.SubElement(customer, 'name').text = 'Jane Smith'
ET.SubElement(customer, 'email').text = '[email protected]'

items = ET.SubElement(order, 'items')
for product_id, name, qty, price in [
    ('P001', 'Mechanical Keyboard', 1, 189.00),
    ('P002', 'USB-C Hub', 2, 49.99),
]:
    item = ET.SubElement(items, 'item', productId=product_id, qty=str(qty))
    ET.SubElement(item, 'name').text = name
    ET.SubElement(item, 'price', currency='USD').text = str(price)

# 文字列にシリアライズ
ET.indent(order, space='  ')  # Python 3.9以降 — インプレースで整形
xml_output = ET.tostring(order, encoding='unicode', xml_declaration=True)
print(xml_output)

iterparse — 大きなXMLファイルのストリーミング

大きなXMLファイル（数十MB以上）では、parse()でドキュメント全体をメモリに読み込むのは負荷が大きいです。 iterparse() はファイルをストリームし（イベント駆動のSAXアプローチに精神的に似ています）、要素が見つかるたびにイベントを発火させ、処理しながら要素を廃棄できます：

python

import xml.etree.ElementTree as ET

def process_large_feed(filepath):
    """大きなRSS/Atomフィードをすべてメモリに読み込まずに処理する。"""
    articles = []

    for event, elem in ET.iterparse(filepath, events=('end',)):
        if elem.tag == 'item':
            articles.append({
                'title': elem.findtext('title', ''),
                'link': elem.findtext('link', ''),
                'pub_date': elem.findtext('pubDate', ''),
            })
            # 重要：処理後にメモリを解放するために要素をクリア
            elem.clear()

        if len(articles) >= 1000:
            yield from articles
            articles.clear()

    yield from articles  # 残りをyield

for article in process_large_feed('large_feed.xml'):
    print(article['title'])

各要素を処理した後のelem.clear()の呼び出しが、ファイルサイズに関わらずメモリ使用量を一定に保つ鍵です。これがなければ、ElementTreeはすべての解析済み要素をメモリに蓄積し、ストリーミングの利点が失われます。

lxml — より多くのパワーが必要な場合

lxml ライブラリは、完全な XPath 1.0サポート、 XSDスキーマ検証、XSLTトランスフォームでElementTreeのAPIを拡張するCベースの高速XMLライブラリです。 pip install lxmlでインストールできます：

python

from lxml import etree

# lxmlはほとんどの場合ElementTreeと同じAPIを使用
root = etree.fromstring(rss_xml.encode())  # lxmlはstrではなくbytesが必要

# 完全なXPath 1.0 — ElementTreeのサブセットよりはるかに強力
items = root.xpath('//item[position() <= 2]/title/text()')
print(items)  # ['Understanding Database Indexes', 'REST API Design Patterns']

# 述語付きXPath — 特定のカテゴリからアイテムを取得
db_items = root.xpath('//item[category="Database"]/title/text()')
print(db_items)  # ['Understanding Database Indexes']

# XSDスキーマ検証
xsd_doc = etree.parse('schema.xsd')
schema = etree.XMLSchema(xsd_doc)
xml_doc = etree.parse('data.xml')

if schema.validate(xml_doc):
    print("有効です！")
else:
    for error in schema.error_log:
        print(f"行{error.line}: {error.message}")

XML作業に役立つツール

PythonのXML統合を構築する際、これらのブラウザツールがデータ側を助けます： XMLフォーマッターで生のAPIレスポンスを読みやすくし、 XMLバリデーターでパーサーを接続する前に整形式を確認し、 XML to JSONで要素の代わりにdictで作業したい場合に変換できます。

まとめ

Pythonのxml.etree.ElementTreeは実世界のXMLシナリオの大部分をカバーします： SOAPや名前空間付きフィードには名前空間マップとfind()とfindall()を使い、欠落要素でのAttributeErrorを避けるためにデフォルト値付きのfindtext()を使い、一度にメモリに読み込めない大きなファイルにはiterparse()を使ってください。 XSD検証や完全なXPath式言語が必要な場合はlxmlを使いましょう。標準ライブラリはその他のすべてを問題なく処理します。

← All XML articles Browse all categories →