Python's standard library ships with a solid XML parser — no pip install required.
xml.etree.ElementTree handles the vast majority of real-world XML:
RSS feeds,
SOAP responses, configuration files, Android resource files, Maven POMs. You only need to
reach for lxml
when you hit XSD
schema validation, complex XPath, or truly massive files. Let's go through both, with real examples.
ElementTree Basics — Parsing from a String or File
The
xml.etree.ElementTree
module gives you two entry points: fromstring() for parsing XML strings, and
parse() for reading directly from a file. Here's a practical example using
an RSS feed structure:
import xml.etree.ElementTree as ET
rss_xml = """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Engineering Blog</title>
<link>https://blog.example.com</link>
<description>Articles for developers</description>
<item>
<title>Understanding Database Indexes</title>
<link>https://blog.example.com/db-indexes</link>
<pubDate>Mon, 15 Jan 2024 09:00:00 GMT</pubDate>
<category>Database</category>
</item>
<item>
<title>REST API Design Patterns</title>
<link>https://blog.example.com/rest-patterns</link>
<pubDate>Wed, 17 Jan 2024 09:00:00 GMT</pubDate>
<category>API</category>
</item>
</channel>
</rss>"""
# Parse from a string
root = ET.fromstring(rss_xml)
# Parse from a file (alternative)
# tree = ET.parse('feed.xml')
# root = tree.getroot()
print(root.tag) # rss
print(root.attrib) # {'version': '2.0'}
channel = root.find('channel')
print(channel.find('title').text) # Engineering Blogfind, findall, and findtext — Searching the Tree
These three methods are your primary tools for extracting data. They all accept a simple path expression (like a limited XPath) to navigate the element tree:
import xml.etree.ElementTree as ET
root = ET.fromstring(rss_xml)
channel = root.find('channel')
# find() — returns the first matching element, or None
first_item = channel.find('item')
print(first_item.find('title').text) # Understanding Database Indexes
# findall() — returns a list of all matching elements
items = channel.findall('item')
print(len(items)) # 2
for item in items:
title = item.findtext('title') # findtext() returns .text directly
link = item.findtext('link')
pub_date = item.findtext('pubDate')
print(f"{title} — {pub_date}")
# Nested path with '/'
all_titles = channel.findall('item/title')
print([el.text for el in all_titles])
# ['Understanding Database Indexes', 'REST API Design Patterns']
# findtext() with a default (avoids AttributeError on missing elements)
author = channel.findtext('item/author', default='Unknown Author')
print(author) # Unknown Authorfindtext() with a default. If you use find().text
and the element doesn't exist, find() returns None and .text
raises an AttributeError. findtext('tag', default='') handles missing
elements gracefully — much safer when parsing XML from external sources.Reading Attributes
import xml.etree.ElementTree as ET
xml_str = """<catalog>
<product id="P001" featured="true">
<name>Mechanical Keyboard</name>
<price currency="USD">189.00</price>
</product>
<product id="P002" featured="false">
<name>USB-C Hub</name>
<price currency="USD">49.99</price>
</product>
</catalog>"""
root = ET.fromstring(xml_str)
for product in root.findall('product'):
product_id = product.get('id') # get() for attributes
featured = product.get('featured', 'false') # with default
name = product.findtext('name')
price_el = product.find('price')
price = float(price_el.text)
currency = price_el.get('currency')
print(f"{product_id}: {name} — {currency} {price} (featured: {featured})")Handling XML Namespaces
Namespaces
are the part of XML parsing that makes most developers groan. In ElementTree,
namespace URIs appear in curly braces in tag names: {http://...}tagname. Here's how
to deal with them cleanly:
import xml.etree.ElementTree as ET
soap_xml = """<?xml version="1.0"?>
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:inv="http://www.example.com/invoice">
<soap:Body>
<inv:GetInvoiceResponse>
<inv:InvoiceId>INV-2024-0042</inv:InvoiceId>
<inv:Amount currency="EUR">1250.00</inv:Amount>
<inv:Status>Paid</inv:Status>
</inv:GetInvoiceResponse>
</soap:Body>
</soap:Envelope>"""
root = ET.fromstring(soap_xml)
# ElementTree expands namespace prefixes to URIs in curly braces
# You can define a namespace map for cleaner XPath-style searches
ns = {
'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
'inv': 'http://www.example.com/invoice'
}
# Use the namespace map in find/findall
invoice_id = root.find('.//inv:InvoiceId', ns).text
amount_el = root.find('.//inv:Amount', ns)
status = root.findtext('.//inv:Status', namespaces=ns)
print(invoice_id) # INV-2024-0042
print(amount_el.text) # 1250.00
print(amount_el.get('currency')) # EUR
print(status) # PaidBuilding XML Programmatically
ElementTree also lets you construct XML from scratch — useful when you need to build SOAP requests or generate XML output:
import xml.etree.ElementTree as ET
# Build an order document
order = ET.Element('order', id='ORD-9981', status='pending')
customer = ET.SubElement(order, 'customer')
ET.SubElement(customer, 'name').text = 'Jane Smith'
ET.SubElement(customer, 'email').text = '[email protected]'
items = ET.SubElement(order, 'items')
for product_id, name, qty, price in [
('P001', 'Mechanical Keyboard', 1, 189.00),
('P002', 'USB-C Hub', 2, 49.99),
]:
item = ET.SubElement(items, 'item', productId=product_id, qty=str(qty))
ET.SubElement(item, 'name').text = name
ET.SubElement(item, 'price', currency='USD').text = str(price)
# Serialise to string
ET.indent(order, space=' ') # Python 3.9+ — pretty-print in place
xml_output = ET.tostring(order, encoding='unicode', xml_declaration=True)
print(xml_output)iterparse — Streaming Large XML Files
For large XML files (tens of MB or more), loading the whole document into memory with
parse() is expensive.
iterparse()
streams the file (similar in spirit to the event-driven SAX approach) and fires events as elements are
encountered, letting you process and discard elements as you go:
import xml.etree.ElementTree as ET
def process_large_feed(filepath):
"""Process a large RSS/Atom feed without loading it all into memory."""
articles = []
for event, elem in ET.iterparse(filepath, events=('end',)):
if elem.tag == 'item':
articles.append({
'title': elem.findtext('title', ''),
'link': elem.findtext('link', ''),
'pub_date': elem.findtext('pubDate', ''),
})
# Critical: clear the element after processing to free memory
elem.clear()
if len(articles) >= 1000:
yield from articles
articles.clear()
yield from articles # yield any remaining
for article in process_large_feed('large_feed.xml'):
print(article['title'])The elem.clear() call after processing each element is the key to keeping
memory usage flat regardless of file size. Without it, ElementTree accumulates all parsed elements
in memory and you lose the benefit of streaming.
lxml — When You Need More Power
The
lxml
library is a fast C-based XML library that extends ElementTree's API with full
XPath 1.0 support,
XSD schema validation, and XSLT transforms. Install it with pip install lxml:
from lxml import etree
# lxml uses the same API as ElementTree in most cases
root = etree.fromstring(rss_xml.encode()) # lxml needs bytes, not str
# Full XPath 1.0 — much more powerful than ElementTree's subset
items = root.xpath('//item[position() <= 2]/title/text()')
print(items) # ['Understanding Database Indexes', 'REST API Design Patterns']
# XPath with predicates — get items from a specific category
db_items = root.xpath('//item[category="Database"]/title/text()')
print(db_items) # ['Understanding Database Indexes']
# XSD schema validation
xsd_doc = etree.parse('schema.xsd')
schema = etree.XMLSchema(xsd_doc)
xml_doc = etree.parse('data.xml')
if schema.validate(xml_doc):
print("Valid!")
else:
for error in schema.error_log:
print(f"Line {error.line}: {error.message}")Useful Tools for XML Work
When building Python XML integrations, these browser tools help with the data side: XML Formatter to readable-ify raw API responses, XML Validator to check well-formedness before wiring up your parser, and XML to JSON when you'd rather work with dicts instead of elements.
Wrapping Up
Python's xml.etree.ElementTree covers most real-world XML scenarios:
use find() and findall() with a namespace map for SOAP and namespaced
feeds, findtext() with defaults to avoid AttributeError on missing
elements, and iterparse() for large files you can't load into memory all at once.
Reach for lxml when you need XSD validation or the full XPath expression language.
The standard library handles everything else just fine.