Parsing XML in Python — ElementTree, lxml, and Real-World Patterns

Python's standard library ships with a solid XML parser — no pip install required. xml.etree.ElementTree handles the vast majority of real-world XML: RSS feeds, SOAP responses, configuration files, Android resource files, Maven POMs. You only need to reach for lxml when you hit XSD schema validation, complex XPath, or truly massive files. Let's go through both, with real examples.

ElementTree Basics — Parsing from a String or File

The xml.etree.ElementTree module gives you two entry points: fromstring() for parsing XML strings, and parse() for reading directly from a file. Here's a practical example using an RSS feed structure:

python

import xml.etree.ElementTree as ET

rss_xml = """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Engineering Blog</title>
    <link>https://blog.example.com</link>
    <description>Articles for developers</description>
    <item>
      <title>Understanding Database Indexes</title>
      <link>https://blog.example.com/db-indexes</link>
      <pubDate>Mon, 15 Jan 2024 09:00:00 GMT</pubDate>
      <category>Database</category>
    </item>
    <item>
      <title>REST API Design Patterns</title>
      <link>https://blog.example.com/rest-patterns</link>
      <pubDate>Wed, 17 Jan 2024 09:00:00 GMT</pubDate>
      <category>API</category>
    </item>
  </channel>
</rss>"""

# Parse from a string
root = ET.fromstring(rss_xml)

# Parse from a file (alternative)
# tree = ET.parse('feed.xml')
# root = tree.getroot()

print(root.tag)           # rss
print(root.attrib)        # {'version': '2.0'}

channel = root.find('channel')
print(channel.find('title').text)  # Engineering Blog

find, findall, and findtext — Searching the Tree

These three methods are your primary tools for extracting data. They all accept a simple path expression (like a limited XPath) to navigate the element tree:

python

import xml.etree.ElementTree as ET

root = ET.fromstring(rss_xml)
channel = root.find('channel')

# find() — returns the first matching element, or None
first_item = channel.find('item')
print(first_item.find('title').text)  # Understanding Database Indexes

# findall() — returns a list of all matching elements
items = channel.findall('item')
print(len(items))  # 2

for item in items:
    title = item.findtext('title')      # findtext() returns .text directly
    link = item.findtext('link')
    pub_date = item.findtext('pubDate')
    print(f"{title} — {pub_date}")

# Nested path with '/'
all_titles = channel.findall('item/title')
print([el.text for el in all_titles])
# ['Understanding Database Indexes', 'REST API Design Patterns']

# findtext() with a default (avoids AttributeError on missing elements)
author = channel.findtext('item/author', default='Unknown Author')
print(author)  # Unknown Author

Use findtext() with a default. If you use find().text and the element doesn't exist, find() returns None and .text raises an AttributeError. findtext('tag', default='') handles missing elements gracefully — much safer when parsing XML from external sources.

Reading Attributes

python

import xml.etree.ElementTree as ET

xml_str = """<catalog>
  <product id="P001" featured="true">
    <name>Mechanical Keyboard</name>
    <price currency="USD">189.00</price>
  </product>
  <product id="P002" featured="false">
    <name>USB-C Hub</name>
    <price currency="USD">49.99</price>
  </product>
</catalog>"""

root = ET.fromstring(xml_str)

for product in root.findall('product'):
    product_id = product.get('id')           # get() for attributes
    featured = product.get('featured', 'false')  # with default
    name = product.findtext('name')
    price_el = product.find('price')
    price = float(price_el.text)
    currency = price_el.get('currency')

    print(f"{product_id}: {name} — {currency} {price} (featured: {featured})")

Handling XML Namespaces

Namespaces are the part of XML parsing that makes most developers groan. In ElementTree, namespace URIs appear in curly braces in tag names: {http://...}tagname. Here's how to deal with them cleanly:

python

import xml.etree.ElementTree as ET

soap_xml = """<?xml version="1.0"?>
<soap:Envelope
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:inv="http://www.example.com/invoice">
  <soap:Body>
    <inv:GetInvoiceResponse>
      <inv:InvoiceId>INV-2024-0042</inv:InvoiceId>
      <inv:Amount currency="EUR">1250.00</inv:Amount>
      <inv:Status>Paid</inv:Status>
    </inv:GetInvoiceResponse>
  </soap:Body>
</soap:Envelope>"""

root = ET.fromstring(soap_xml)

# ElementTree expands namespace prefixes to URIs in curly braces
# You can define a namespace map for cleaner XPath-style searches
ns = {
    'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
    'inv': 'http://www.example.com/invoice'
}

# Use the namespace map in find/findall
invoice_id = root.find('.//inv:InvoiceId', ns).text
amount_el = root.find('.//inv:Amount', ns)
status = root.findtext('.//inv:Status', namespaces=ns)

print(invoice_id)                    # INV-2024-0042
print(amount_el.text)                # 1250.00
print(amount_el.get('currency'))     # EUR
print(status)                        # Paid

Building XML Programmatically

ElementTree also lets you construct XML from scratch — useful when you need to build SOAP requests or generate XML output:

python

import xml.etree.ElementTree as ET

# Build an order document
order = ET.Element('order', id='ORD-9981', status='pending')

customer = ET.SubElement(order, 'customer')
ET.SubElement(customer, 'name').text = 'Jane Smith'
ET.SubElement(customer, 'email').text = '[email protected]'

items = ET.SubElement(order, 'items')
for product_id, name, qty, price in [
    ('P001', 'Mechanical Keyboard', 1, 189.00),
    ('P002', 'USB-C Hub', 2, 49.99),
]:
    item = ET.SubElement(items, 'item', productId=product_id, qty=str(qty))
    ET.SubElement(item, 'name').text = name
    ET.SubElement(item, 'price', currency='USD').text = str(price)

# Serialise to string
ET.indent(order, space='  ')  # Python 3.9+ — pretty-print in place
xml_output = ET.tostring(order, encoding='unicode', xml_declaration=True)
print(xml_output)

iterparse — Streaming Large XML Files

For large XML files (tens of MB or more), loading the whole document into memory with parse() is expensive. iterparse() streams the file (similar in spirit to the event-driven SAX approach) and fires events as elements are encountered, letting you process and discard elements as you go:

python

import xml.etree.ElementTree as ET

def process_large_feed(filepath):
    """Process a large RSS/Atom feed without loading it all into memory."""
    articles = []

    for event, elem in ET.iterparse(filepath, events=('end',)):
        if elem.tag == 'item':
            articles.append({
                'title': elem.findtext('title', ''),
                'link': elem.findtext('link', ''),
                'pub_date': elem.findtext('pubDate', ''),
            })
            # Critical: clear the element after processing to free memory
            elem.clear()

        if len(articles) >= 1000:
            yield from articles
            articles.clear()

    yield from articles  # yield any remaining

for article in process_large_feed('large_feed.xml'):
    print(article['title'])

The elem.clear() call after processing each element is the key to keeping memory usage flat regardless of file size. Without it, ElementTree accumulates all parsed elements in memory and you lose the benefit of streaming.

lxml — When You Need More Power

The lxml library is a fast C-based XML library that extends ElementTree's API with full XPath 1.0 support, XSD schema validation, and XSLT transforms. Install it with pip install lxml:

python

from lxml import etree

# lxml uses the same API as ElementTree in most cases
root = etree.fromstring(rss_xml.encode())  # lxml needs bytes, not str

# Full XPath 1.0 — much more powerful than ElementTree's subset
items = root.xpath('//item[position() <= 2]/title/text()')
print(items)  # ['Understanding Database Indexes', 'REST API Design Patterns']

# XPath with predicates — get items from a specific category
db_items = root.xpath('//item[category="Database"]/title/text()')
print(db_items)  # ['Understanding Database Indexes']

# XSD schema validation
xsd_doc = etree.parse('schema.xsd')
schema = etree.XMLSchema(xsd_doc)
xml_doc = etree.parse('data.xml')

if schema.validate(xml_doc):
    print("Valid!")
else:
    for error in schema.error_log:
        print(f"Line {error.line}: {error.message}")

Useful Tools for XML Work

When building Python XML integrations, these browser tools help with the data side: XML Formatter to readable-ify raw API responses, XML Validator to check well-formedness before wiring up your parser, and XML to JSON when you'd rather work with dicts instead of elements.

Wrapping Up

Python's xml.etree.ElementTree covers most real-world XML scenarios: use find() and findall() with a namespace map for SOAP and namespaced feeds, findtext() with defaults to avoid AttributeError on missing elements, and iterparse() for large files you can't load into memory all at once. Reach for lxml when you need XSD validation or the full XPath expression language. The standard library handles everything else just fine.

← All XML articles Browse all categories →