Loading DTDs using DOM in Python

Posted on
15 Feb 2010

I use Python's xml.dom.minidom to process XML, but I'm a bit of a neophyte. I find it a really excellent approach for generating Fortran interfaces from our XML interface-specifications, but one thing's pretty inconvenient: entities don't get resolved, and we use a lot of entities.

In fact, it's worse than that. Entities just disappear in the processed DOM tree.

Given times.dtd


and times.xml

<?xml version="1.0"?>
<!DOCTYPE times SYSTEM "times.dtd">
<maths>
  <mn>2</mn>
  <mo>×</mo>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
</maths>

and then times.py

import sys, xml.dom.minidom
sys.stdout.write(xml.dom.minidom.parse("times.xml").toxml())

you get

> python times.py

  2
  
  3
  =
  6

Belgium! × has turned into .

I find the Python documentation for xml.dom quite daunting. From what I can tell you should be able to configure the whole experience—the parser, the entity resolver, one lump or two...

Until I work out how to do all that, my intermediate solution is to preprocess using xmllint to expand entities, before calling minidom: here's times2.py

import os, subprocess, sys, xml.dom.minidom
cmd_fo = open("times_expanded.xml", "w")
fail = subprocess.call("xmllint --loaddtd --noent " +
                       "times.xml",
                       shell=True,
                       stdout=cmd_fo,
                       stderr=sys.stderr,
                       close_fds=(os.name=="posix"),
                       universal_newlines=True)
cmd_fo.close()
sys.stdout.write(xml.dom.minidom.parse("times_expanded.xml").toxml())

which results in

> python times2.py

  2
  ×
  3
  =
  6

Not very elegant though, is it?

Good news everyone: there's an alternate solution over at Stack Overflow, but it's still not perfect: use lxml instead of xml.dom.minidom. Unfortunately lxml doesn't come with the standard Python distribution, so I had to use my package manager to install python-lxml.

This time with a times3.py

import sys
from lxml import etree

parser = etree.XMLParser(load_dtd=True)
doc_DOM = etree.parse("times.xml", parser=parser)
sys.stdout.write(etree.tostring(doc_DOM) + '\n')

we get

> python times3.py

  2
  ×
  3
  =
  6

I think for the time being I'll stick with xmllint plus xml.dom.minidom, for greater portability.