Looking for source preservation features in XML libs

Grzegorz Adam Hankiewicz

unread,

Dec 28, 2004, 5:12:19 AM12/28/04

to Python mailing list

Hi.

I'm looking for two specific features in XML libraries. One is two be
able to tell which source file line a tag starts and ends. Say, tag
<para> is located on line 34 column 7, and the matching </para> three
lines later on column 56.

Another feature is to be able to save the processed XML code in a way
that unmodified tags preserve the original identation. Or in the worst
case, all identation is lost, but I can control to some degree the
outlook of the final XML output.

I have looked at xml.minidom, elementtree and gnosis and haven found any
such features. Are there libs providing these?

--
Please don't send me private copies of your public answers. Thanks.

and-g...@doxdesk.com

unread,

Dec 28, 2004, 6:22:17 AM12/28/04

to

Grzegorz Adam Hankiewicz <gra...@titanium.sabren.com> wrote:

> I have looked at xml.minidom, elementtree and gnosis and haven't

> found any such features. Are there libs providing these?

pxdom (http://www.doxdesk.com/software/py/pxdom.html) has some of this,
but I think it's still way off what you're envisaging.

> One is to be able to tell which source file line a tag starts
> and ends.

You can get the file and line/column where a node begins in pxdom using
the non-standard property Node.pxdomLocation, which returns a DOM Level
3 DOMLocator object, eg.:

uri= node.pxdomLocation.uri
line= node.pxdomLocation.lineNumber
col= node.pxdomLocation.columnNumber

There is no way to get the location of an Element's end-tag, however.
Except guessing by looking at the positions of adjacent nodes, which is
kind of cheating and probably not reliable.

SAX processors can in theory use Locator information too, but AFAIK (?)
this isn't currently implemented.

> Another feature is to be able to save the processed XML code in a way
> that unmodified tags preserve the original identation.

Do you mean whitespace *inside* the start-tag? I don't know of any XML
processor that will do anything but ignore whitespace here; in XML
terms it is utterly insignificant and there is no place to store the
information in the infoset or DOM properties.

pxdom will preserve the *order* of the attributes, but even that is not
required by any XML standard.

> Or in the worst case, all identation is lost, but I can control to
> some degree the outlook of the final XML output.

The DOM Level 3 LS feature format-pretty-print (and PyXML's
PrettyPrint) influence whitespace in content. However if you do want
control of whitespace inside the tags themselves I don't know of any
XML tools that will do it. You might have to write your own serializer,
or hack it into a DOM implementation of your choice.
--
Andrew Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

Fredrik Lundh

unread,

Dec 28, 2004, 10:57:18 AM12/28/04

to pytho...@python.org

Grzegorz Adam Hankiewicz wrote:

> I'm looking for two specific features in XML libraries. One is two be
> able to tell which source file line a tag starts and ends. Say, tag
> <para> is located on line 34 column 7, and the matching </para> three
> lines later on column 56.
>

> Another feature is to be able to save the processed XML code in a way

> that unmodified tags preserve the original identation. Or in the worst

> case, all identation is lost, but I can control to some degree the
> outlook of the final XML output.
>

> I have looked at xml.minidom, elementtree and gnosis and haven found any

> such features. Are there libs providing these?

here's a custom parser that adds a "lineno" attribute to element nodes:

from elementtree import XMLTreeBuilder

class MyParser(XMLTreeBuilder.FancyTreeBuilder):
def start(self, elem):
elem.lineno = self.lineno

def parse(file):
# feed one line at a time, and keep track of the line number
lineno = 1
parser = MyParser()
for line in open(file).readlines():
parser.lineno = lineno
parser.feed(line)
lineno = lineno + 1
return parser.close()

for elem in parse("samples/simple.xml").getiterator():
print elem.tag, elem.lineno

(the FancyTreeBuilder is somewhat broken in 1.2.1 through 1.2.3, at least
if you're using Python 2.3 or later. or in other words, use ElementTree 1.2
or 1.2.4 if you want this to work).

the standard elementtree writer may modify the tags, but it preserves all
whitespace around them; depending on what you mean by "indentation",
that may or may not be what you want. (but if you want to preserve all
whitespace in an XML document, you shouldn't run it through an XML
parser...)

</F>