Loading DTDs using DOM in Python
I use Python's xml.dom.minidom to process
In fact, it's worse than that. Entities just disappear in the processed
<?xml version="1.0"?> <!DOCTYPE times SYSTEM "times.dtd"> <maths> <mn>2</mn> <mo>×</mo> <mn>3</mn> <mo>=</mo> <mn>6</mn> </maths>
and then times.py
import sys, xml.dom.minidom sys.stdout.write(xml.dom.minidom.parse("times.xml").toxml())
> python times.py
2 3 = 6
I find the Python documentation for xml.dom quite daunting. From what I can tell you should be able to configure the whole experience—the parser, the entity resolver, one lump or two...
Until I work out how to do all that, my intermediate solution is to preprocess using xmllint to expand entities, before calling minidom: here's times2.py
import os, subprocess, sys, xml.dom.minidom cmd_fo = open("times_expanded.xml", "w") fail = subprocess.call("xmllint --loaddtd --noent " + "times.xml", shell=True, stdout=cmd_fo, stderr=sys.stderr, close_fds=(os.name=="posix"), universal_newlines=True) cmd_fo.close() sys.stdout.write(xml.dom.minidom.parse("times_expanded.xml").toxml())
which results in
> python times2.py
2 × 3 = 6
Not very elegant though, is it?
Good news everyone: there's an alternate solution over at Stack Overflow, but it's still not perfect: use lxml instead of xml.dom.minidom. Unfortunately lxml doesn't come with the standard Python distribution, so I had to use my package manager to install python-lxml.
This time with a times3.py
import sys from lxml import etree parser = etree.XMLParser(load_dtd=True) doc_DOM = etree.parse("times.xml", parser=parser) sys.stdout.write(etree.tostring(doc_DOM) + '\n')
> python times3.py
2 × 3 = 6
I think for the time being I'll stick with xmllint plus xml.dom.minidom, for greater portability.