minidom and encoding problem

Ehab Teima

unread,

Jun 5, 2002, 7:14:14 PM6/5/02

to

Hi,

I'm using Python 2.1. I wrote classes to create xml document from
scratch. The code worked fine until I hit an encoding problem. The
classes can read text and insert it as is to xml document using
creatTextNode. This text had characters > 127, and I got this error.

self._doc=xml.dom.minidom.parse(self._xml_filename)
File "D:\Python21\lib\xml\dom\minidom.py", line 910, in parse
return _doparse(pulldom.parse, args, kwargs)
File "D:\Python21\lib\xml\dom\minidom.py", line 902, in _doparse
toktype, rootNode = events.getEvent()
File "D:\Python21\lib\xml\dom\pulldom.py", line 234, in getEvent
self.parser.feed(buf)
File "D:\Python21\lib\xml\sax\expatreader.py", line 92, in feed
self._err_handler.fatalError(exc)
File "D:\Python21\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:75:1: not well-formed

I know it's not possible to add an enconding attribute using writexml,
so the generated document only has <?xml version="1.0"?>. Is there any
way to get around this problem. I'd like to be able at least to parse
the document while reading using the proper encoding, such as,
encoding="ISO-8859-1". I'm only using minidom, any ideas how?

Another question:
Does any body know how to get the rootnode of a document? If I know
the root node, I can add the proper header and then write the root
node using writexml.

Thanks,
Ehab

Martin v. Loewis

unread,

Jun 6, 2002, 3:14:37 AM6/6/02

to

ehab_...@hotmail.com (Ehab Teima) writes:

> I'm using Python 2.1. I wrote classes to create xml document from
> scratch. The code worked fine until I hit an encoding problem. The
> classes can read text and insert it as is to xml document using
> creatTextNode. This text had characters > 127, and I got this error.

This is a bug in your code. You must not insert (byte) string in a DOM
tree; always use Unicode objects.

> I know it's not possible to add an enconding attribute using writexml,
> so the generated document only has <?xml version="1.0"?>. Is there any
> way to get around this problem.

Yes. Use Unicode strings when creating text nodes. When producing the
serialized document through .toxml, you will find that it produces a
Unicode string. Since (as you notice) the document has no encoding
declaration, you need to .encode("UTF-8") that string before saving it
into a file.

> Does any body know how to get the rootnode of a document? If I know
> the root node, I can add the proper header and then write the root
> node using writexml.

The document element is available through .documentElement on the
Document.

Regards,
Martin

Timo Linna

unread,

Jun 6, 2002, 9:34:45 AM6/6/02

to

"Ehab Teima" <ehab_...@hotmail.com> wrote in message
news:17aafe08.02060...@posting.google.com...
> Hi,

>
> I know it's not possible to add an enconding attribute using writexml,
> so the generated document only has <?xml version="1.0"?>. Is there any
> way to get around this problem. I'd like to be able at least to parse
> the document while reading using the proper encoding, such as,
> encoding="ISO-8859-1". I'm only using minidom, any ideas how?

Thanks to Python, you could redefine minidom's writexml-function:

import xml.dom.minidom

def encoded_writexml(self, writer, indent="", addindent="", newl=""):
writer.write('<?xml version="1.0" encoding="ISO-8859-1" ?>\n') # My
change
for node in self.childNodes:
node.writexml(writer, indent, addindent, newl)

xml.dom.minidom.Document.writexml = encoded_writexml

-- timo

Ehab Teima

unread,

Jun 6, 2002, 6:19:53 PM6/6/02

to

mar...@v.loewis.de (Martin v. Loewis) wrote in message news:<m37klcu...@mira.informatik.hu-berlin.de>...

> ehab_...@hotmail.com (Ehab Teima) writes:
>
> > I'm using Python 2.1. I wrote classes to create xml document from
> > scratch. The code worked fine until I hit an encoding problem. The
> > classes can read text and insert it as is to xml document using
> > creatTextNode. This text had characters > 127, and I got this error.
>
> This is a bug in your code. You must not insert (byte) string in a DOM
> tree; always use Unicode objects.

I do not have control over the sent text. The issue started when some
bullets were copied from a word document and pasted into a file and
the whole file was passed to my classes. I cound not find a way to
convert this text to UTF-8 or anything else. Is there a way to prevent
this from happening?

>
> > I know it's not possible to add an enconding attribute using writexml,
> > so the generated document only has <?xml version="1.0"?>. Is there any
> > way to get around this problem.
>
> Yes. Use Unicode strings when creating text nodes. When producing the
> serialized document through .toxml, you will find that it produces a
> Unicode string. Since (as you notice) the document has no encoding
> declaration, you need to .encode("UTF-8") that string before saving it
> into a file.

I tried to encode the string using different encodings but I could
not. Here is what I got when I tried .encode("UTF-8"):

UnicodeError: ASCII decoding error: ordinal not in range(128)

Fredrik Lundh

unread,

Jun 6, 2002, 6:57:31 PM6/6/02

to

Ehab Teima wrote:

> > This is a bug in your code. You must not insert (byte) string in a DOM
> > tree; always use Unicode objects.
>
> I do not have control over the sent text. The issue started when some
> bullets were copied from a word document and pasted into a file and
> the whole file was passed to my classes.

if you don't know what encoding the file is using, what
makes you think Python can figure it out?

> I tried to encode the string using different encodings but I could
> not.

the string is already encoded. you need to *decode* it.

> Here is what I got when I tried .encode("UTF-8"):
> UnicodeError: ASCII decoding error: ordinal not in range(128)

this means that you have non-ASCII characters in an
ASCII string. to convert this to a unicode string, use

u = s.decode(encoding)

where "encoding" is the source encoding (if you haven't
the slightest idea, try "iso-8859-1")

also see:

http://effbot.org/guides/unicode-objects.htm

</F>

Martin v. Loewis

unread,

Jun 7, 2002, 2:05:37 AM6/7/02

to

ehab_...@hotmail.com (Ehab Teima) writes:

> > This is a bug in your code. You must not insert (byte) string in a DOM
> > tree; always use Unicode objects.
>
> I do not have control over the sent text.

[I assume that the "sent text" is also the one that you pass to
createTextNode].

Even if you don't have that control, you still need to know what
encoding it uses. If you don't know the encoding, you cannot put it
into XML documents.

> The issue started when some bullets were copied from a word document
> and pasted into a file and the whole file was passed to my
> classes. I cound not find a way to convert this text to UTF-8 or
> anything else.

You don't need to convert it to UTF-8, you need to convert it to
Unicode objects. You can use the unicode() builtin to do that.

> Is there a way to prevent this from happening?

What is "this", and why do you want to prevent it from happening?

Regards,
Martin