Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

encoding="utf8" ignored when parsing XML

3,911 views
Skip to first unread message

Skip Montanaro

unread,
Dec 27, 2016, 10:06:04 AM12/27/16
to
I am trying to parse some XML which doesn't specify an encoding (Python 2.7.12 via Anaconda on RH Linux), so it barfs when it encounters non-ASCII data. No great surprise there, but I'm having trouble getting it to use another encoding. First, I tried specifying the encoding when opening the file:

f = io.open(fname, encoding="utf8")
root = xml.etree.ElementTree.parse(f).getroot()

but that had no effect. Then, when calling xml.etree.ElementTree.parse I included an XMLParser object:

parser = xml.etree.ElementTree.XMLParser(encoding="utf8")
root = xml.etree.ElementTree.parse(f, parser=parser).getroot()

Same-o, same-o:

unicode error 'ascii' codec can't encode characters in position 1113-1116: ordinal not in range(128)

So, why does it continue to insist on using an ASCII codec? My locale's preferred encoding is:

>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'

which I presume is the official way to spell "ascii".

The chardetect command (part of the chardet package) tells me it looks like utf8 with high confidence:

% chardetect < ~/tmp/trash
<stdin>: utf-8 with confidence 0.99

I took a look at the code, and tracked the encoding I specified all the way down to the creation of the expat parser. What am I missing?

Skip

Peter Otten

unread,
Dec 27, 2016, 10:26:40 AM12/27/16
to
Skip Montanaro wrote:

> I am trying to parse some XML which doesn't specify an encoding (Python
> 2.7.12 via Anaconda on RH Linux), so it barfs when it encounters non-ASCII
> data. No great surprise there, but I'm having trouble getting it to use
> another encoding. First, I tried specifying the encoding when opening the
> file:
>
> f = io.open(fname, encoding="utf8")
> root = xml.etree.ElementTree.parse(f).getroot()
>
> but that had no effect.

Isn't UTF-8 the default?

Try opening the file in binary mode then:

with io.open(fname, "rb") as f:
root = xml.tree.ElementTree.parse(f).getroot()


Skip Montanaro

unread,
Dec 27, 2016, 10:47:49 AM12/27/16
to
Peter> Isn't UTF-8 the default?

Apparently not. I believe in my reading it said that it used whatever
locale.getpreferredencoding() returned. That's problematic when you
live in a country that thinks ASCII is everything. Personally, I think
UTF-8 should be the default, but that train's long left the station,
at least for Python 2.x.

> Try opening the file in binary mode then:
>
> with io.open(fname, "rb") as f:
> root = xml.tree.ElementTree.parse(f).getroot()

Thanks, that worked. Would appreciate an explanation of why binary
mode was necessary. It would seem that since the file contents are
text, just in a non-ASCII encoding, that specifying the encoding when
opening the file should do the trick.

Skip

Peter Otten

unread,
Dec 27, 2016, 11:11:17 AM12/27/16
to
Skip Montanaro wrote:

> Peter> Isn't UTF-8 the default?
>
> Apparently not.

Sorry, I meant the default for XML.
My tentative explanation would be: If you open the file as text it will be
successfully decoded, i. e.

io.open(fname, encoding="UTF-8").read()

works, but to go back to the bytes that the XML parser needs the "preferred
encoding", in your case ASCII, will be used.

Since there are non-ascii characters you get a UnicodeEncodeError.


Peter Otten

unread,
Dec 27, 2016, 11:20:08 AM12/27/16
to
Peter Otten wrote:

> works, but to go back to the bytes that the XML parser needs the
> "preferred encoding", in your case ASCII, will be used.

Correction: it's probably sys.getdefaultencoding() rather than
locale.getdefaultencoding(). So all systems with a sane configuration will
behave the same way as yours.

Steve D'Aprano

unread,
Dec 27, 2016, 8:40:45 PM12/27/16
to
On Wed, 28 Dec 2016 02:05 am, Skip Montanaro wrote:

> I am trying to parse some XML which doesn't specify an encoding (Python
> 2.7.12 via Anaconda on RH Linux), so it barfs when it encounters non-ASCII
> data. No great surprise there, but I'm having trouble getting it to use
> another encoding. First, I tried specifying the encoding when opening the
> file:
>
> f = io.open(fname, encoding="utf8")
> root = xml.etree.ElementTree.parse(f).getroot()

The documentation for ET.parse is pretty sparse

https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.parse


but we can infer that it should take bytes as argument, not Unicode, since
it does its own charset processing. (The optional parser argument takes an
encoding argument which defaults to UTF-8.)

So that means using the built-in open(), or io.open() in binary mode.

You open the file and read in bytes from disk, *decoding* those bytes into a
UTF-8 Unicode string. Then the ET parser tries to decode the Unicode string
into Unicode, which it does by first *encoding* it back to bytes using the
default encoding (namely ASCII), and that's where it blows up.

This particular error is a Python2-ism, since Python2 tries hard to let you
mix byte strings and unicode strings together, hence it will try implicitly
encoding/decoding strings to try to get them to fit together. Python3 does
not do this.

You can easily simulate this error at the REPL:



py> u"µ".decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position
0: ordinal not in range(128)


The give-away is that you're intending to do a *decode* operation but get an
*encode* error. That tells you that Python2 is trying to be helpful :-)

(Remember: Unicode strings encode to bytes, and bytes decode back to
strings.)


You're trying to read bytes from a file on disk and get Unicode strings out:

bytes in file --> XML parser --> Unicode

so that counts as a decoding operation. But you're getting an encoding
error -- that's the smoking gun that suggests a dubious Unicode->bytes
step, using the default encoding (ASCII):

bytes in file --> io.open().read() --> Unicode --> XML Parser --> decode to
bytes using ASCII --> encode back to Unicode using UTF-8

And that suggests that the fix is to open the file without any charset
processing, i.e. use the builtin open() instead of io.open().

bytes in file --> builtin open().read() --> bytes --> XML Parser --> Unicode


I think you can even skip the 'rb' mode part: the real problem is that you
must not feed a Unicode string to the XML parser.



> but that had no effect. Then, when calling xml.etree.ElementTree.parse I
> included an XMLParser object:
>
> parser = xml.etree.ElementTree.XMLParser(encoding="utf8")
> root = xml.etree.ElementTree.parse(f, parser=parser).getroot()

That's the default, so there's no functional change here. Hence, the same
error.



--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

0 new messages