Python에서 XML 파싱하기 — ElementTree, lxml, 실전 패턴

Python 표준 라이브러리에는 훌륭한 XML 파서가 포함되어 있습니다 — pip 설치 불필요. xml.etree.ElementTree는 실제 XML의 대부분을 처리합니다: RSS 피드, SOAP 응답, 설정 파일, Android 리소스 파일, Maven POM. XSD 스키마 검증, 복잡한 XPath, 또는 정말 방대한 파일이 필요할 때만 lxml을 사용해야 합니다. 실제 예제로 두 가지 모두 살펴보겠습니다.

ElementTree 기초 — 문자열 또는 파일에서 파싱

xml.etree.ElementTree 모듈은 두 가지 진입점을 제공합니다: XML 문자열 파싱을 위한 fromstring()과 파일에서 직접 읽기 위한 parse(). RSS 피드 구조를 사용한 실용적인 예제입니다:

python

import xml.etree.ElementTree as ET

rss_xml = """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Engineering Blog</title>
    <link>https://blog.example.com</link>
    <description>Articles for developers</description>
    <item>
      <title>Understanding Database Indexes</title>
      <link>https://blog.example.com/db-indexes</link>
      <pubDate>Mon, 15 Jan 2024 09:00:00 GMT</pubDate>
      <category>Database</category>
    </item>
    <item>
      <title>REST API Design Patterns</title>
      <link>https://blog.example.com/rest-patterns</link>
      <pubDate>Wed, 17 Jan 2024 09:00:00 GMT</pubDate>
      <category>API</category>
    </item>
  </channel>
</rss>"""

# 문자열에서 파싱
root = ET.fromstring(rss_xml)

# 파일에서 파싱 (대안)
# tree = ET.parse('feed.xml')
# root = tree.getroot()

print(root.tag)           # rss
print(root.attrib)        # {'version': '2.0'}

channel = root.find('channel')
print(channel.find('title').text)  # Engineering Blog

find, findall, findtext — 트리 검색

이 세 가지 메서드는 데이터 추출을 위한 주요 도구입니다. 모두 요소 트리를 탐색하기 위해 단순한 경로 표현식(제한된 XPath와 유사)을 허용합니다:

python

import xml.etree.ElementTree as ET

root = ET.fromstring(rss_xml)
channel = root.find('channel')

# find() — 첫 번째 일치 요소 반환, 없으면 None
first_item = channel.find('item')
print(first_item.find('title').text)  # Understanding Database Indexes

# findall() — 모든 일치 요소의 리스트 반환
items = channel.findall('item')
print(len(items))  # 2

for item in items:
    title = item.findtext('title')      # findtext()는 .text를 직접 반환
    link = item.findtext('link')
    pub_date = item.findtext('pubDate')
    print(f"{title} — {pub_date}")

# '/'를 사용한 중첩 경로
all_titles = channel.findall('item/title')
print([el.text for el in all_titles])
# ['Understanding Database Indexes', 'REST API Design Patterns']

# 기본값을 가진 findtext() (요소가 없을 때 AttributeError 방지)
author = channel.findtext('item/author', default='Unknown Author')
print(author)  # Unknown Author

기본값과 함께 findtext()를 사용하세요. find().text를 사용하고 요소가 존재하지 않으면, find()는 None을 반환하고 .text는 AttributeError를 발생시킵니다. findtext('tag', default='')는 누락된 요소를 우아하게 처리합니다 — 외부 소스에서 XML을 파싱할 때 훨씬 더 안전합니다.

속성 읽기

python

import xml.etree.ElementTree as ET

xml_str = """<catalog>
  <product id="P001" featured="true">
    <name>Mechanical Keyboard</name>
    <price currency="USD">189.00</price>
  </product>
  <product id="P002" featured="false">
    <name>USB-C Hub</name>
    <price currency="USD">49.99</price>
  </product>
</catalog>"""

root = ET.fromstring(xml_str)

for product in root.findall('product'):
    product_id = product.get('id')           # 속성에는 get() 사용
    featured = product.get('featured', 'false')  # 기본값 포함
    name = product.findtext('name')
    price_el = product.find('price')
    price = float(price_el.text)
    currency = price_el.get('currency')

    print(f"{product_id}: {name} — {currency} {price} (featured: {featured})")

XML 네임스페이스 처리

네임스페이스는 대부분의 개발자를 괴롭히는 XML 파싱의 부분입니다. ElementTree에서 네임스페이스 URI는 태그 이름의 중괄호 안에 나타납니다: {http://...}tagname. 깔끔하게 처리하는 방법입니다:

python

import xml.etree.ElementTree as ET

soap_xml = """<?xml version="1.0"?>
<soap:Envelope
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:inv="http://www.example.com/invoice">
  <soap:Body>
    <inv:GetInvoiceResponse>
      <inv:InvoiceId>INV-2024-0042</inv:InvoiceId>
      <inv:Amount currency="EUR">1250.00</inv:Amount>
      <inv:Status>Paid</inv:Status>
    </inv:GetInvoiceResponse>
  </soap:Body>
</soap:Envelope>"""

root = ET.fromstring(soap_xml)

# ElementTree는 네임스페이스 접두사를 중괄호 안의 URI로 확장합니다
# 더 깔끔한 XPath 스타일 검색을 위해 네임스페이스 맵을 정의할 수 있습니다
ns = {
    'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
    'inv': 'http://www.example.com/invoice'
}

# find/findall에 네임스페이스 맵 사용
invoice_id = root.find('.//inv:InvoiceId', ns).text
amount_el = root.find('.//inv:Amount', ns)
status = root.findtext('.//inv:Status', namespaces=ns)

print(invoice_id)                    # INV-2024-0042
print(amount_el.text)                # 1250.00
print(amount_el.get('currency'))     # EUR
print(status)                        # Paid

프로그래밍 방식으로 XML 생성

ElementTree는 처음부터 XML을 구성할 수도 있습니다 — SOAP 요청을 빌드하거나 XML 출력을 생성해야 할 때 유용합니다:

python

import xml.etree.ElementTree as ET

# 주문 문서 빌드
order = ET.Element('order', id='ORD-9981', status='pending')

customer = ET.SubElement(order, 'customer')
ET.SubElement(customer, 'name').text = 'Jane Smith'
ET.SubElement(customer, 'email').text = '[email protected]'

items = ET.SubElement(order, 'items')
for product_id, name, qty, price in [
    ('P001', 'Mechanical Keyboard', 1, 189.00),
    ('P002', 'USB-C Hub', 2, 49.99),
]:
    item = ET.SubElement(items, 'item', productId=product_id, qty=str(qty))
    ET.SubElement(item, 'name').text = name
    ET.SubElement(item, 'price', currency='USD').text = str(price)

# 문자열로 직렬화
ET.indent(order, space='  ')  # Python 3.9+ — 인플레이스 보기 좋게 출력
xml_output = ET.tostring(order, encoding='unicode', xml_declaration=True)
print(xml_output)

iterparse — 대용량 XML 파일 스트리밍

수십 MB 이상의 대용량 XML 파일에서는 parse()로 전체 문서를 메모리에 로드하는 것이 비쌉니다. iterparse()는 파일을 스트리밍하여(이벤트 기반 SAX 방식과 비슷하게) 요소가 발견될 때 이벤트를 발생시키며, 진행하면서 요소를 처리하고 버릴 수 있게 해줍니다:

python

import xml.etree.ElementTree as ET

def process_large_feed(filepath):
    """메모리에 모두 로드하지 않고 대용량 RSS/Atom 피드 처리."""
    articles = []

    for event, elem in ET.iterparse(filepath, events=('end',)):
        if elem.tag == 'item':
            articles.append({
                'title': elem.findtext('title', ''),
                'link': elem.findtext('link', ''),
                'pub_date': elem.findtext('pubDate', ''),
            })
            # 핵심: 처리 후 요소를 지워 메모리 해제
            elem.clear()

        if len(articles) >= 1000:
            yield from articles
            articles.clear()

    yield from articles  # 나머지 항목 yield

for article in process_large_feed('large_feed.xml'):
    print(article['title'])

각 요소를 처리한 후 elem.clear()를 호출하는 것이 파일 크기에 관계없이 메모리 사용량을 일정하게 유지하는 핵심입니다. 이것 없이는 ElementTree가 파싱된 모든 요소를 메모리에 누적하여 스트리밍의 이점을 잃게 됩니다.

lxml — 더 많은 기능이 필요할 때

lxml 라이브러리는 ElementTree의 API를 완전한 XPath 1.0 지원, XSD 스키마 검증, XSLT 변환으로 확장하는 빠른 C 기반 XML 라이브러리입니다. pip install lxml로 설치하세요:

python

from lxml import etree

# lxml은 대부분의 경우 ElementTree와 같은 API 사용
root = etree.fromstring(rss_xml.encode())  # lxml은 str이 아닌 bytes 필요

# 완전한 XPath 1.0 — ElementTree의 부분집합보다 훨씬 강력
items = root.xpath('//item[position() <= 2]/title/text()')
print(items)  # ['Understanding Database Indexes', 'REST API Design Patterns']

# 조건자가 있는 XPath — 특정 카테고리의 항목 가져오기
db_items = root.xpath('//item[category="Database"]/title/text()')
print(db_items)  # ['Understanding Database Indexes']

# XSD 스키마 검증
xsd_doc = etree.parse('schema.xsd')
schema = etree.XMLSchema(xsd_doc)
xml_doc = etree.parse('data.xml')

if schema.validate(xml_doc):
    print("Valid!")
else:
    for error in schema.error_log:
        print(f"Line {error.line}: {error.message}")

XML 작업에 유용한 도구들

Python XML 통합을 구축할 때, 이 브라우저 도구들이 데이터 측면에서 도움이 됩니다: XML 포매터로 원시 API 응답을 읽기 좋게 만들고, XML 유효성 검사기로 파서를 연결하기 전에 올바른 형식을 확인하고, 요소 대신 딕셔너리로 작업하고 싶을 때는 XML to JSON을 사용하세요.

마무리

Python의 xml.etree.ElementTree는 대부분의 실제 XML 시나리오를 처리합니다: SOAP 및 네임스페이스가 있는 피드에는 네임스페이스 맵과 함께 find()와 findall()을 사용하고, 누락된 요소에서 AttributeError를 피하기 위해 기본값과 함께 findtext()를 사용하며, 한 번에 메모리에 로드할 수 없는 대용량 파일에는 iterparse()를 사용하세요. XSD 검증이나 완전한 XPath 표현식 언어가 필요할 때는 lxml을 사용하세요. 그 외 모든 것은 표준 라이브러리로 충분히 처리됩니다.

← All XML articles Browse all categories →