HTML to XML Converter
Paste loose HTML. Get well-formed XML back.
What this tool does
If you have ever had to feed a copy-pasted HTML snippet into an XSLT pipeline or a strict XML parser, you already know the pain. Browsers are famously forgiving — the HTML parser in the WHATWG spec will silently close your <li> tags, accept unquoted attributes, and shrug at a missing </p>. An XML parser will not. Feed it the same markup and you get a parse error on line 3. This tool takes real-world HTML — the kind you actually paste from a CMS or a legacy template — and turns it into XML that a standards-compliant parser will accept on the first try.
It is more than just closing tags. The converter handles the full set of things that make HTML loose and XML strict: void elements like <br>, <img>, and <hr> get self-closed as <br/>; boolean attributes like checked and disabled are expanded to checked="checked"; unquoted attribute values get wrapped in double quotes; tag names are lowercased for consistency; and the handful of named HTML entities that XML does not know about ( , —, £, ×, and so on) get converted to numeric entities like   that every XML parser understands.
Comments come through intact. CDATA sections inside <script> and <style> are preserved. The output is pretty-printed so you can actually read it, and it will pass W3C validation as an XML document or slot straight into an XHTML 1.0 workflow. If what you really want is XHTML Strict, this gets you 95% of the way there — you can add the doctype and namespace at the top yourself.
How to use it
Three steps. Works the same whether you paste a single paragraph or a whole page template.
Paste your HTML (or try the sample)
Drop your HTML into the left editor as-is. Unclosed tags, bare boolean attributes, unquoted values, void elements without self-close — all fine. Click Load Sample if you want to see a realistic, messy example first.
You do not need to hand-fix anything. The whole point of this tool is that it does the cleanup for you. Paste it exactly the way it came out of your CMS, Word doc, or legacy template.
Hit Convert
Click the green Convert button. The tool parses your HTML with a forgiving parser, then re-serialises it through an XML writer so every tag is closed, every attribute is quoted, and every entity is legal XML.
Copy the XML
The right panel fills with indented, well-formed XML. Copy it into your XSLT input, your DOM test fixture, your EPUB build, or wherever else you need markup that a strict parser will not choke on.
When this actually comes in handy
Feeding HTML into XSLT or an XML pipeline
You have an XSLT stylesheet that turns content into a PDF, a feed, or another format — but your input is HTML from a CMS, not XML. Convert first, transform second.
XHTML-strict validation
Legacy intranet that still demands XHTML 1.0 Strict? Paste the sloppy HTML your editor produced, copy out the XML, drop a doctype on top, and you are done.
Converting blog markup for EPUB / e-readers
EPUB is XHTML under the hood and will flat-out reject a missing <code></p></code>. Clean up a chapter’s worth of blog HTML in one paste before packaging.
Cleanup for archival systems
Feeding old HTML into an XML-based archive (DSpace, Fedora Commons, anything JATS-flavoured)? Strict schemas do not care that the browser rendered it fine — convert it first.
Common questions
What actually makes HTML "loose" and XML "strict"?
Three big things. First, HTML has void elements (<br>, <img>, <input>, <hr>, <meta>, <link>) that are not supposed to have a closing tag — XML requires either <br></br> or <br/>. Second, HTML lets you omit closing tags entirely for things like <li> and <p> because the parsing algorithm figures them out. Third, HTML accepts attributes without quotes (class=foo) and attributes without values (disabled). XML rejects all three.
What happens to boolean attributes like `checked` and `disabled`?
They get expanded into the XHTML form. <input type=checkbox checked disabled> becomes <input type="checkbox" checked="checked" disabled="disabled"/>. Every attribute ends up with a name AND a quoted value, which is what XML requires — there is no such thing as an attribute without a value in XML.
How are HTML entities like ` ` and `—` handled?
XML only knows five named entities out of the box: &, <, >, ", '. Everything else — , —, £, © — is converted to its numeric form ( , —, £, ©) so any XML parser will accept them without a DTD.
Do comments survive the round trip?
Yes. <!-- flagged for review --> goes in, the same thing comes out. Comments are valid in both HTML and XML with the same syntax, so they pass through untouched — handy if you use them for editorial notes or build-time markers.
Does it lowercase tag and attribute names?
Yes, by default. <DIV Class="Foo"> becomes <div class="Foo">. Tag and attribute names are lowercased to match the XHTML 1.0 convention. Attribute VALUES are left alone — we do not touch the content you put inside quotes, because that is your data.
When is this not enough — what cases does the tool NOT handle?
Arbitrary <script> content is preserved as text, but the tool does not try to rewrite JavaScript to be XML-safe. If your JS uses bare < or & characters you will need to wrap it in <![CDATA[...]]> yourself (we preserve existing CDATA). Document fragments are fine; we do not synthesise a <?xml ?> declaration or a doctype — add those yourself if your downstream consumer wants them.
Other tools you may need
HTML to XML is one piece of the puzzle. These tools pair well with it: