REALLY simple xml reader

Simon Pickles

unread,

Jan 27, 2008, 12:35:16 PM1/27/08

to pytho...@python.org

Hi

Can anyone suggest a really simple XML reader for python? I just want to
be able to do something like this:

xmlDoc = xml.open("file.xml")
element = xmlDoc.GetElement("foo/bar")

... to read the value of:

Thanks

Simon

--
Linux user #458601 - http://counter.li.org.

Diez B. Roggisch

unread,

Jan 27, 2008, 12:50:06 PM1/27/08

to

Simon Pickles schrieb:

> Hi
>
> Can anyone suggest a really simple XML reader for python? I just want to
> be able to do something like this:
>
> xmlDoc = xml.open("file.xml")
> element = xmlDoc.GetElement("foo/bar")
>
> ... to read the value of:
>
> <foo>
> <bar>42</bar>
> </foo>

Since python2.5, the ElementTree module is available in the standard
lib. Before 2.5, you can of course install it.

Your code then would look like this:

import xml.etree.ElementTree as et

doc = """

<foo>
<bar>42</bar>
</foo>
"""

root = et.fromstring(doc)

for bar in root.findall("bar"):
print bar.text

Diez

Mark Tolonen

unread,

Jan 27, 2008, 12:55:14 PM1/27/08

to

"Simon Pickles" <sipi...@hotmail.com> wrote in message
news:mailman.1148.120145...@python.org...

>>> from xml.etree import ElementTree as ET
>>> tree=ET.parse('file.xml')
>>> tree.find('bar').text
'42'
>>>

--Mark

Navtej Singh

unread,

Jan 27, 2008, 1:40:40 PM1/27/08

to Simon Pickles, pytho...@python.org

check the implementation of XMLNode class here
http://hsivonen.iki.fi/group-feed/flickrapi.py

HTH
N

On Jan 27, 2008 11:05 PM, Simon Pickles <sipi...@hotmail.com> wrote:
> Hi
>
> Can anyone suggest a really simple XML reader for python? I just want to
> be able to do something like this:
>
> xmlDoc = xml.open("file.xml")
> element = xmlDoc.GetElement("foo/bar")
>
> ... to read the value of:
>
> <foo>
> <bar>42</bar>
> </foo>
>
>
> Thanks
>
> Simon
>
> --
> Linux user #458601 - http://counter.li.org.
>
>
>

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Ricardo Aráoz

unread,

Jan 29, 2008, 7:33:14 AM1/29/08

to Diez B. Roggisch, pytho...@python.org

What about :

doc = """
<moo>
<bar>99</bar>
</moo>

Stefan Behnel

unread,

Jan 29, 2008, 6:41:14 AM1/29/08

to rar...@bigfoot.com, Diez B. Roggisch

Ricardo Aráoz wrote:
> What about :
>
> doc = """
> <moo>
> <bar>99</bar>
> </moo>
> <foo>
> <bar>42</bar>
> </foo>
> """

That's not an XML document, so what about it?

Stefan

Ricardo Aráoz

unread,

Jan 29, 2008, 9:53:44 AM1/29/08

to pytho...@python.org

> What about :
>
> doc = """
> <moo>
> <bar>99</bar>
> </moo>
> <foo>
> <bar>42</bar>
> </foo>
> """

That's not an XML document, so what about it?

Stefan

----------------------------------------------

Ok Stefan, I will pretend it was meant in good will.

I don't know zit about xml, but I might need to, and I am saving the
thread for when I need it. So I looked around and found some 'real'
XML document (see below). The question is, how to access <amount>s from
<debit>s (any category) but not <deposit>s.
Probably my previous example was not properly stated, what I meant to
convey is two substructures (namespaces, or whatever you call them in
XML) which have the same 'properties' <moo><bar> is not the same as
<foo><bar> as <debit><amount> is not the same as <deposit><amount>.
The examples given by Diez and Mark, though useful, don't seem to
address the problem.
Thanks for your help.

doc = """
<?xml version="1.0"?>
<checkbook balance-start="2460.62">
<title>expenses: january 2002</title>

<debit category="clothes">
<amount>31.19</amount>
<date><year>2002</year><month>1</month><day>3</day></date>
<payto>Walking Store</payto>
<description>shoes</description>
</debit>

<deposit category="salary">
<amount>1549.58</amount>
<date><year>2002</year><month>1</month><day>7</day></date>
<payor>Bob's Bolts</payor>
</deposit>

<debit category="withdrawal">
<amount>40</amount>
<date><year>2002</year><month>1</month><day>8</day></date>
<description>pocket money</description>
</debit>

<debit category="medical" check="855">
<amount>188.20</amount>
<date><year>2002</year><month>1</month><day>8</day></date>
<payto>Boston Endodontics</payto>
<description>cavity</description>
</debit>

<debit category="supplies">
<amount>10.58</amount>
<date><year>2002</year><month>1</month><day>10</day></date>
<payto>Exxon Saugus</payto>
<description>gasoline</description>
</debit>

<debit category="car">
<amount>909.56</amount>
<date><year>2002</year><month>1</month><day>14</day></date>
<payto>Honda North</payto>
<description>car repairs</description>
</debit>

<debit category="food">
<amount>24.30</amount>
<date><year>2002</year><month>1</month><day>15</day></date>
<payto>Johnny Rockets</payto>
<description>lunch</description>
</debit>
</checkbook>
"""

Stefan Behnel

unread,

Jan 29, 2008, 9:17:55 AM1/29/08

to rar...@bigfoot.com

Hi,

Ricardo Aráoz wrote:
> I don't know zit about xml, but I might need to, and I am saving the
> thread for when I need it. So I looked around and found some 'real'
> XML document (see below). The question is, how to access <amount>s from
> <debit>s (any category) but not <deposit>s.
>

> doc = """
> <?xml version="1.0"?>
> <checkbook balance-start="2460.62">
> <title>expenses: january 2002</title>
>
> <debit category="clothes">
> <amount>31.19</amount>
> <date><year>2002</year><month>1</month><day>3</day></date>
> <payto>Walking Store</payto>
> <description>shoes</description>
> </debit>
>
> <deposit category="salary">
> <amount>1549.58</amount>
> <date><year>2002</year><month>1</month><day>7</day></date>
> <payor>Bob's Bolts</payor>
> </deposit>

[...]
> </checkbook>
> """

Sure, no problem. Just use the XPath expression "//debit/amount", or maybe
"/checkbook/credit/amount", if you prefer. This is basically tree traversal,
so you can check the parents and the children as you see fit.

Stefan

Message has been deleted

Ivan Illarionov

unread,

Jan 30, 2008, 9:52:43 PM1/30/08

to

>>> from xml.etree import ElementTree as et
>>> from decimal import Decimal
>>>
>>> root = et.parse('file/with/your.xml')
>>> debits = dict((debit.attrib['category'], Decimal(debit.find('amount').text)) for debit in root.findall('debit'))
>>>
>>> for cat, amount in debits.items():
... print '%s: %s' % (cat, amount)
...
food: 24.30
car: 909.56
medical: 188.20
savings: 25
withdrawal: 40
supplies: 10.58
clothes: 31.19

Ricardo Aráoz

unread,

Jan 31, 2008, 6:38:51 AM1/31/08

to Ivan Illarionov, pytho...@python.org

Thanks Ivan, it seems a elegant API, and easy to use.
I tried to play a little with it but unfortunately could not get it off
the ground. I kept getting
>>> root = et.fromstring(doc)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "E:\Python25\lib\xml\etree\ElementTree.py", line 963, in XML
parser.feed(text)
File "E:\Python25\lib\xml\etree\ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
ExpatError: XML or text declaration not at start of entity: line 2, column 0

But it's probably my lack of knowledge on the subject. Well, I guess
there is no free ride and I'll take a look at ElementTree as soon as I
have some spare time, looks promising.
One last question. Am I right to believe "debit" would be an "et" object
of the same class as "root"?
THX

Diez B. Roggisch

unread,

Jan 31, 2008, 6:00:43 AM1/31/08

to

Ricardo Aráoz schrieb:

That's a problem in your XML not being XML. Has nothing to do with
element-tree - as one sees from the error-message "ExpatError". If you
show it to us, we might see why.

> But it's probably my lack of knowledge on the subject. Well, I guess
> there is no free ride and I'll take a look at ElementTree as soon as I
> have some spare time, looks promising.
> One last question. Am I right to believe "debit" would be an "et" object
> of the same class as "root"?

Yes.

Diez

Ricardo Aráoz

unread,

Jan 31, 2008, 7:39:16 AM1/31/08

to Diez B. Roggisch, pytho...@python.org

Diez B. Roggisch wrote:
> Ricardo Aráoz schrieb:

>> Thanks Ivan, it seems a elegant API, and easy to use.
>> I tried to play a little with it but unfortunately could not get it off
>> the ground. I kept getting
>>>>> root = et.fromstring(doc)
>> Traceback (most recent call last):
>> File "<input>", line 1, in <module>
>> File "E:\Python25\lib\xml\etree\ElementTree.py", line 963, in XML
>> parser.feed(text)
>> File "E:\Python25\lib\xml\etree\ElementTree.py", line 1245, in feed
>> self._parser.Parse(data, 0)
>> ExpatError: XML or text declaration not at start of entity: line 2, column 0
>
> That's a problem in your XML not being XML. Has nothing to do with
> element-tree - as one sees from the error-message "ExpatError". If you
> show it to us, we might see why.
>

Sure,

I thought this was proper XML as it comes straight out from an O'Reilly
XML book.

Cheers

Diez B. Roggisch

unread,

Jan 31, 2008, 7:13:38 AM1/31/08

to

Ricardo Aráoz schrieb:

> Diez B. Roggisch wrote:
>> Ricardo Aráoz schrieb:
>>> Thanks Ivan, it seems a elegant API, and easy to use.
>>> I tried to play a little with it but unfortunately could not get it off
>>> the ground. I kept getting
>>>>>> root = et.fromstring(doc)
>>> Traceback (most recent call last):
>>> File "<input>", line 1, in <module>
>>> File "E:\Python25\lib\xml\etree\ElementTree.py", line 963, in XML
>>> parser.feed(text)
>>> File "E:\Python25\lib\xml\etree\ElementTree.py", line 1245, in feed
>>> self._parser.Parse(data, 0)
>>> ExpatError: XML or text declaration not at start of entity: line 2, column 0
>> That's a problem in your XML not being XML. Has nothing to do with
>> element-tree - as one sees from the error-message "ExpatError". If you
>> show it to us, we might see why.
>>
>
> Sure,
>
> doc = """
> <?xml version="1.0"?>

It's not allowed to have a newline before the <?xml ...>

Put it on the line above, and things will work.

Diez

Steve Holden

unread,

Jan 31, 2008, 7:41:57 AM1/31/08

to pytho...@python.org

If you don't think that looks pretty enough just escape the first
newline in the string constant to have the parser ignore it:

doc = """\
<?xml version="1.0"?>

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Ben Finney

unread,

Jan 31, 2008, 8:40:01 AM1/31/08

to

Steve Holden <st...@holdenweb.com> writes:

> Diez B. Roggisch wrote:
> > Ricardo Aráoz schrieb:

> >> doc = """
> >> <?xml version="1.0"?>
> >
> > It's not allowed to have a newline before the <?xml ...>
> >
> > Put it on the line above, and things will work.
> >
> If you don't think that looks pretty enough just escape the first
> newline in the string constant to have the parser ignore it:

Quite apart from a human thinking it's pretty or not pretty, it's *not
valid XML* if the XML declaration isn't immediately at the start of
the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
parsers will (correctly) reject such a document.

> doc = """\
> <?xml version="1.0"?>

This is fine.

--
\ "True greatness is measured by how much freedom you give to |
`\ others, not by how much you can coerce others to do what you |
_o__) want." —Larry Wall |
Ben Finney

Steve Holden

unread,

Jan 31, 2008, 9:10:50 AM1/31/08

to pytho...@python.org

Ben Finney wrote:
> Steve Holden <st...@holdenweb.com> writes:
>
>> Diez B. Roggisch wrote:
>>> Ricardo Aráoz schrieb:
>>>> doc = """
>>>> <?xml version="1.0"?>
>>> It's not allowed to have a newline before the <?xml ...>
>>>
>>> Put it on the line above, and things will work.
>>>
>> If you don't think that looks pretty enough just escape the first
>> newline in the string constant to have the parser ignore it:
>
> Quite apart from a human thinking it's pretty or not pretty, it's *not
> valid XML* if the XML declaration isn't immediately at the start of
> the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
> parsers will (correctly) reject such a document.
>
>> doc = """\
>> <?xml version="1.0"?>
>
> This is fine.
>

Sure. The only difference in "prettiness" I was referring to was the
difference betwee

doc = """<?xml ...
<stuff on the left-hand margin>
...

and

doc = """\
<?xml ...
<stuff on the left-hand margin>
...

In other words, Python source-code prettiness.

Steven D'Aprano

unread,

Jan 31, 2008, 10:50:26 AM1/31/08

to

On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:

> Quite apart from a human thinking it's pretty or not pretty, it's *not
> valid XML* if the XML declaration isn't immediately at the start of the
> document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
> parsers will (correctly) reject such a document.

You know, I'd really like to know what the designers were thinking when
they made this decision.

"You know Bob, XML just isn't hostile enough to anyone silly enough to
believe it's 'human-readable'. What can we do to make it more hostile?"

"Well Fred, how about making the XML declaration completely optional, so
you can leave it out and still be vald XML, but if you include it, you're
not allowed to precede it with semantically neutral whitespace?"

"I take my hat off to you."

This is legal XML:

"""<?xml version="1.0"?>
<greeting>Hello, world!</greeting>"""

and so is this:

"""
<greeting >Hello, world!</greeting >"""

but not this:

""" <?xml version="1.0"?>
<greeting>Hello, world!</greeting>"""

You can't get this sort of stuff except out of a committee.

--
Steven

Diez B. Roggisch

unread,

Jan 31, 2008, 11:35:39 AM1/31/08

to

Steven D'Aprano schrieb:

do not forget to mention that trailing whitespace is evil as well!!!

And that for some reason some characters below \x20 aren't allowed in
XML even if it's legal utf-8 - for what reason I'd really like to know...

Diez

Stefan Behnel

unread,

Jan 31, 2008, 12:35:17 PM1/31/08

to Steven D'Aprano

Hi,

Steven D'Aprano wrote:
> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>
>> Quite apart from a human thinking it's pretty or not pretty, it's *not
>> valid XML* if the XML declaration isn't immediately at the start of the
>> document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
>> parsers will (correctly) reject such a document.
>
> You know, I'd really like to know what the designers were thinking when
> they made this decision.

[had a good laugh here]

> This is legal XML:
>
> """<?xml version="1.0"?>
> <greeting>Hello, world!</greeting>"""
>
> and so is this:
>
> """
> <greeting >Hello, world!</greeting >"""
>
>
> but not this:
>
> """ <?xml version="1.0"?>
> <greeting>Hello, world!</greeting>"""

It's actually not that stupid. When you leave out the declaration, then the
XML is UTF-8 encoded (by spec), so normal ASCII whitespace doesn't matter.
It's just like the declaration had come *before* the whitespace, at the very
beginning of the byte stream.

But if you add a declaration, then the encoding can change for the whole
document (including the declaration!), so you have to give the parser a chance
to actually parse the declaration. How is it supposed to know that the
whitespace before the declaration *is* whitespace before it knows the encoding?

Stefan

Stefan Behnel

unread,

Jan 31, 2008, 1:05:25 PM1/31/08

to Steven D'Aprano

Stefan Behnel wrote:
> Steven D'Aprano wrote:
>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>>
>>> Quite apart from a human thinking it's pretty or not pretty, it's *not
>>> valid XML* if the XML declaration isn't immediately at the start of the
>>> document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
>>> parsers will (correctly) reject such a document.
>> You know, I'd really like to know what the designers were thinking when
>> they made this decision.
> [had a good laugh here]
>> This is legal XML:
>>
>> """<?xml version="1.0"?>
>> <greeting>Hello, world!</greeting>"""
>>
>> and so is this:
>>
>> """
>> <greeting >Hello, world!</greeting >"""
>>
>>
>> but not this:
>>
>> """ <?xml version="1.0"?>
>> <greeting>Hello, world!</greeting>"""
>
> It's actually not that stupid. When you leave out the declaration, then the
> XML is UTF-8 encoded (by spec), so normal ASCII whitespace doesn't matter.

Sorry, strip the "ASCII" here. From the XML spec POV, your example

"""
<greeting >Hello, world!</greeting >"""

is exactly equivalent to

"""<?xml version='1.0' encoding='utf-8'?>
<greeting >Hello, world!</greeting >"""

and whitespace between the declaration and the root element is allowed. It's
just not allowed *before* the declaration, which in your case was left out,
thus implying the default declaration.

Stefan

Ben Finney

unread,

Jan 31, 2008, 3:51:56 PM1/31/08

to

Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:

> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>
> > Quite apart from a human thinking it's pretty or not pretty, it's *not
> > valid XML* if the XML declaration isn't immediately at the start of the
> > document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many XML
> > parsers will (correctly) reject such a document.
>
> You know, I'd really like to know what the designers were thinking when
> they made this decision.

Probably much the same that the designers of the Unix shebang ("#!")
or countless other "figure out whether the bitstream is a specific
type" were thinking:

It's better to be as precise as possible so that failure can be
unambiguous, than to have more-complex parsing rules that lead to
ambiguity in implementation.

Also, for XML documents, they were probably thinking that the
documents will be machine-generated most of the time. As far as I can
tell, they were right in that.

Given that, I think the choice of precise parsing rules that are
simple to implement correctly (even if the rules themselves are
necessarily complex) is a better one.

--
\ "Those who will not reason, are bigots, those who cannot, are |
`\ fools, and those who dare not, are slaves." —"Lord" George |
_o__) Gordon Noel Byron |
Ben Finney

Ricardo Aráoz

unread,

Jan 31, 2008, 6:55:54 PM1/31/08

to Diez B. Roggisch, pytho...@python.org

Worked ok. Thanks a lot to you all, I'm not using it right now but I've
had a taste and now know where to look and how to start.
Cheers

Ivan Illarionov

unread,

Jan 31, 2008, 5:58:22 PM1/31/08

to

> Also, for XML documents, they were probably thinking that the
> documents will be machine-generated most of the time. As far as I can
> tell, they were right in that.

If anybody has to deal with human-generated XML/HTML in Python it may
be better to use something like http://www.crummy.com/software/BeautifulSoup/

Bad XML markup is part of our life and there are great tools for this
use-case too.

Stefan Behnel

unread,

Feb 1, 2008, 1:00:54 AM2/1/08

to

The good thing about 'bad XML' is that it's not XML, which is easy to tell.

Stefan

Steven D'Aprano

unread,

Feb 1, 2008, 8:44:50 PM2/1/08

to

The same way it knows that "<?xml" is "<?xml" before it sees the
encoding. If the parser knows that the hex bytes

3c 3f 78 6d 6c

(or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free to
swap the byte order)

mean "<?xml"

then it can equally know that bytes

20 09 0a

are whitespace. According to the XML standard, what else could they be?

--
Steven

Stefan Behnel

unread,

Feb 2, 2008, 1:24:36 AM2/2/08

to Steven D'Aprano

Steven D'Aprano wrote:
> The same way it knows that "<?xml" is "<?xml" before it sees the
> encoding. If the parser knows that the hex bytes
>
> 3c 3f 78 6d 6c
>
> (or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free to
> swap the byte order)
>
> mean "<?xml"
>
> then it can equally know that bytes
>
> 20 09 0a
>
> are whitespace. According to the XML standard, what else could they be?

So, what about all the other unicode whitespace characters? And what about
different encodings and byte orders that move the bytes around? Is it ok for a
byte stream to start with "00 20" or does it have to start with "20 00"? What
about "00 20 00 00" and "00 00 00 20"? Are you sure that means 0x20 encoded in
4 bytes, or is it actually the unicode character 0x2000? What complexity do
you want to put into the parser here?

"In the face of ambiguity, refuse the temptation to guess"

Stefan

Steven D'Aprano

unread,

Feb 2, 2008, 5:44:39 AM2/2/08

to

On Sat, 02 Feb 2008 07:24:36 +0100, Stefan Behnel wrote:

> Steven D'Aprano wrote:
>> The same way it knows that "<?xml" is "<?xml" before it sees the
>> encoding. If the parser knows that the hex bytes
>>
>> 3c 3f 78 6d 6c
>>
>> (or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free
>> to swap the byte order)
>>
>> mean "<?xml"
>>
>> then it can equally know that bytes
>>
>> 20 09 0a
>>
>> are whitespace. According to the XML standard, what else could they be?
>
> So, what about all the other unicode whitespace characters?

What about them? They aren't part of the XML spec, which defines
whitespace as the code points #x20, #x9, #xD and #xA. (Okay, I forgot
carriage return. Oops.) You don't have to support arbitrary whitespace,
only those four characters.

> And what
> about different encodings and byte orders that move the bytes around?

What about them? The Byte Order Mark is optional in the case of UTF-8,
and compulsory in the case of UTF-16. I quote:

"Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not
part of either the markup or the character data of the XML document. XML
processors must be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents."

So if your XML document is written in UTF-8, you don't need a BOM
(although you can use one if you wish) and if it is in UTF-16 you *must*
have one, even before the '<?xml'. If you don't, how will the parser
recognise the characters '<?xml', not to mention the characters
'encoding' and 'utf-16'?

> Is
> it ok for a byte stream to start with "00 20" or does it have to start
> with "20 00"?

If you're using UTF-16, the byte stream MUST start with the BOM, so no,
the above is illegal. If the BOM has already been seen, then it will tell
the XML parser which order is legal, depending on whether the BOM was FF
FE or FE FF.

If you're using UTF-8, the byte streams "00 20" and "20 00" would both be
illegal: in UTF-8, the null byte is the unicode code point #x0, which is
illegal in XML.

Support for any other encoding is entirely optional. A parser may choose
to support other encodings, or not, and deal with them appropriately. But
whatever encodings you support, the same issue comes up: if you can
recognise '<?xml' before seeing the encoding, why can't you recognise
whitespace?

> What about "00 20 00 00" and "00 00 00 20"? Are you sure
> that means 0x20 encoded in 4 bytes, or is it actually the unicode
> character 0x2000? What complexity do you want to put into the parser
> here?

I'm not putting any complexity into the parser that the XML standard
doesn't already demand. Perhaps you should read it yourself:

http://www.w3.org/TR/xml/

In particular, note that a parser must be prepared to accept leading
whitespace at the start of a document, and only reject it if it comes
across a XML declaration.

> "In the face of ambiguity, refuse the temptation to guess"

What ambiguity, and what guess?

My earlier question wasn't rhetorical. I asked "According to the XML
standard, what else could they [whitespace] be?". Just implying that they
are ambiguous doesn't actually make them ambiguous.

I don't believe there is an ambiguity at all. That's what makes the
prohibition on leading whitespace before the '<?xml' tag all the more
puzzling: there doesn't seem to be any good reason for it.

If I am wrong, then will somebody please put me out of my misery and tell
me what leading whitespace could be mistaken for, in what circumstances?

--
Steven

Steven D'Aprano

unread,

Feb 2, 2008, 7:39:19 AM2/2/08

to

On Fri, 01 Feb 2008 07:51:56 +1100, Ben Finney wrote:

> Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
>
>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>>
>> > Quite apart from a human thinking it's pretty or not pretty, it's
>> > *not valid XML* if the XML declaration isn't immediately at the start
>> > of the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many
>> > XML parsers will (correctly) reject such a document.
>>
>> You know, I'd really like to know what the designers were thinking when
>> they made this decision.
>
> Probably much the same that the designers of the Unix shebang ("#!") or
> countless other "figure out whether the bitstream is a specific type"
> were thinking:

There's no real comparison with the shebang '#!'. It is important that
the shell can recognise a shebang with a single look-up for speed, and
the shell doesn't have to deal with the complexities of Unicode: if you
write your script in UTF-16, bash will complain that it can't execute the
binary file. The shell cares whether or not the first two bytes are 23
21. An XML parser doesn't care about bytes, it cares about tags.

It isn't good enough for an XML parser to grab the first five bytes of a
file and say "That's legal XML!" in the same way that the shell can look
at the first two bytes of a script and say "That's a shebang!". An XML
parser must actually *parse*, even to determine whether or not it is
looking at XML. Any such parser must be prepared to accept leading
whitespace at the beginning of a file, and only reject it once it reaches
an XML declaration tag, if any. When parsing a stream of bytes like this:

ef bb bf 20 20 20 20 0a 09 3c 3f 78 6d 6c

the parser doesn't know it is illegal until it has seen the fourteenth
byte. That's the worst of both worlds: you have to provisionally accept
whitespace just in case the XML declaration is missing, so you don't save
any complexity, but if the declaration is there, you reject a perfectly
fine document for an apparently arbitrary reason.

> It's better to be as precise as possible so that failure can be
> unambiguous, than to have more-complex parsing rules that lead to
> ambiguity in implementation.

Precision and complexity are orthogonal attributes. "All valid documents
must begin with the sequence of bytes representing the first 8093 digits
of pi to the power of e in base 256" is very precise and completely
unambiguous. There's one and only one byte sequence that satisfies such a
requirement. But it is also very complex. On the other hand, "valid
documents must begin with a number" is not complex at all, but very
imprecise: what counts as a number? Is the word "one" a number?

A good example of how precision doesn't need to be the enemy of
flexibility and simplicity: Python's rule dealing with imports from
__future__ is precise. Any import from __future__ must be the first
executable line in a module:

(1) There's no ambiguity. The first executable line is well-defined in
the context of a Python program.

(2) The restriction is not arbitrary. There's a good technical reason for
it, the rule doesn't needlessly restrict what you can do.

(3) It is human-friendly: you can precede the import by a shebang line, a
doc string, any other bare strings (so long as they aren't assigned to a
name), comments and empty lines.

--
Steven

Stefan Behnel

unread,

Feb 2, 2008, 9:48:28 AM2/2/08

to Steven D'Aprano

Steven D'Aprano wrote:
> On Fri, 01 Feb 2008 07:51:56 +1100, Ben Finney wrote:
>
>> Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
>>
>>> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote:
>>>
>>>> Quite apart from a human thinking it's pretty or not pretty, it's
>>>> *not valid XML* if the XML declaration isn't immediately at the start
>>>> of the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many
>>>> XML parsers will (correctly) reject such a document.
>>> You know, I'd really like to know what the designers were thinking when
>>> they made this decision.
>> Probably much the same that the designers of the Unix shebang ("#!") or
>> countless other "figure out whether the bitstream is a specific type"
>> were thinking:
>
> There's no real comparison with the shebang '#!'. It is important that
> the shell can recognise a shebang with a single look-up for speed, and
> the shell doesn't have to deal with the complexities of Unicode: if you
> write your script in UTF-16, bash will complain that it can't execute the
> binary file. The shell cares whether or not the first two bytes are 23
> 21. An XML parser doesn't care about bytes, it cares about tags.

Or rather about unicode code points.

I actually think that you can compare the two. The shell can read the shebang,
recognise it, and continue reading up to the first newline to see what needs
to be done. That's one simple stream, no problem.

Same for the XML parser. It reads the stream and it will not have to look
back, even if the declaration requests a new encoding. Just like the shebang
has to be at the beginning, the declaration has to be there, too.

All I'm saying is that there is a point where you have to draw the line, and
the XML spec says, that the XML declaration must be at the beginning of the
document, and that it may be followed by whitespace. I think that's clear and
simple.

It admit that it's questionable if it should be allowed to omit the
declaration, but since there is only one case where you are allowed to do
that, I'm somewhat fine with this special case.

Stefan