Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

minidom and encoding problem

42 views
Skip to first unread message

Ehab Teima

unread,
Jun 5, 2002, 7:14:14 PM6/5/02
to
Hi,

I'm using Python 2.1. I wrote classes to create xml document from
scratch. The code worked fine until I hit an encoding problem. The
classes can read text and insert it as is to xml document using
creatTextNode. This text had characters > 127, and I got this error.

self._doc=xml.dom.minidom.parse(self._xml_filename)
File "D:\Python21\lib\xml\dom\minidom.py", line 910, in parse
return _doparse(pulldom.parse, args, kwargs)
File "D:\Python21\lib\xml\dom\minidom.py", line 902, in _doparse
toktype, rootNode = events.getEvent()
File "D:\Python21\lib\xml\dom\pulldom.py", line 234, in getEvent
self.parser.feed(buf)
File "D:\Python21\lib\xml\sax\expatreader.py", line 92, in feed
self._err_handler.fatalError(exc)
File "D:\Python21\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:75:1: not well-formed

I know it's not possible to add an enconding attribute using writexml,
so the generated document only has <?xml version="1.0"?>. Is there any
way to get around this problem. I'd like to be able at least to parse
the document while reading using the proper encoding, such as,
encoding="ISO-8859-1". I'm only using minidom, any ideas how?

Another question:
Does any body know how to get the rootnode of a document? If I know
the root node, I can add the proper header and then write the root
node using writexml.

Thanks,
Ehab

Martin v. Loewis

unread,
Jun 6, 2002, 3:14:37 AM6/6/02
to
ehab_...@hotmail.com (Ehab Teima) writes:

> I'm using Python 2.1. I wrote classes to create xml document from
> scratch. The code worked fine until I hit an encoding problem. The
> classes can read text and insert it as is to xml document using
> creatTextNode. This text had characters > 127, and I got this error.

This is a bug in your code. You must not insert (byte) string in a DOM
tree; always use Unicode objects.

> I know it's not possible to add an enconding attribute using writexml,
> so the generated document only has <?xml version="1.0"?>. Is there any
> way to get around this problem.

Yes. Use Unicode strings when creating text nodes. When producing the
serialized document through .toxml, you will find that it produces a
Unicode string. Since (as you notice) the document has no encoding
declaration, you need to .encode("UTF-8") that string before saving it
into a file.

> Does any body know how to get the rootnode of a document? If I know
> the root node, I can add the proper header and then write the root
> node using writexml.

The document element is available through .documentElement on the
Document.

Regards,
Martin

Timo Linna

unread,
Jun 6, 2002, 9:34:45 AM6/6/02
to

"Ehab Teima" <ehab_...@hotmail.com> wrote in message
news:17aafe08.02060...@posting.google.com...
> Hi,

>
> I know it's not possible to add an enconding attribute using writexml,
> so the generated document only has <?xml version="1.0"?>. Is there any
> way to get around this problem. I'd like to be able at least to parse
> the document while reading using the proper encoding, such as,
> encoding="ISO-8859-1". I'm only using minidom, any ideas how?

Thanks to Python, you could redefine minidom's writexml-function:

import xml.dom.minidom

def encoded_writexml(self, writer, indent="", addindent="", newl=""):
writer.write('<?xml version="1.0" encoding="ISO-8859-1" ?>\n') # My
change
for node in self.childNodes:
node.writexml(writer, indent, addindent, newl)

xml.dom.minidom.Document.writexml = encoded_writexml

-- timo

Ehab Teima

unread,
Jun 6, 2002, 6:19:53 PM6/6/02
to
mar...@v.loewis.de (Martin v. Loewis) wrote in message news:<m37klcu...@mira.informatik.hu-berlin.de>...

> ehab_...@hotmail.com (Ehab Teima) writes:
>
> > I'm using Python 2.1. I wrote classes to create xml document from
> > scratch. The code worked fine until I hit an encoding problem. The
> > classes can read text and insert it as is to xml document using
> > creatTextNode. This text had characters > 127, and I got this error.
>
> This is a bug in your code. You must not insert (byte) string in a DOM
> tree; always use Unicode objects.

I do not have control over the sent text. The issue started when some
bullets were copied from a word document and pasted into a file and
the whole file was passed to my classes. I cound not find a way to
convert this text to UTF-8 or anything else. Is there a way to prevent
this from happening?


>
> > I know it's not possible to add an enconding attribute using writexml,
> > so the generated document only has <?xml version="1.0"?>. Is there any
> > way to get around this problem.
>
> Yes. Use Unicode strings when creating text nodes. When producing the
> serialized document through .toxml, you will find that it produces a
> Unicode string. Since (as you notice) the document has no encoding
> declaration, you need to .encode("UTF-8") that string before saving it
> into a file.

I tried to encode the string using different encodings but I could
not. Here is what I got when I tried .encode("UTF-8"):

UnicodeError: ASCII decoding error: ordinal not in range(128)

Fredrik Lundh

unread,
Jun 6, 2002, 6:57:31 PM6/6/02
to
Ehab Teima wrote:

> > This is a bug in your code. You must not insert (byte) string in a DOM
> > tree; always use Unicode objects.
>
> I do not have control over the sent text. The issue started when some
> bullets were copied from a word document and pasted into a file and
> the whole file was passed to my classes.

if you don't know what encoding the file is using, what
makes you think Python can figure it out?

> I tried to encode the string using different encodings but I could
> not.

the string is already encoded. you need to *decode* it.

> Here is what I got when I tried .encode("UTF-8"):
> UnicodeError: ASCII decoding error: ordinal not in range(128)

this means that you have non-ASCII characters in an
ASCII string. to convert this to a unicode string, use

u = s.decode(encoding)

where "encoding" is the source encoding (if you haven't
the slightest idea, try "iso-8859-1")

also see:

http://effbot.org/guides/unicode-objects.htm

</F>


Martin v. Loewis

unread,
Jun 7, 2002, 2:05:37 AM6/7/02
to
ehab_...@hotmail.com (Ehab Teima) writes:

> > This is a bug in your code. You must not insert (byte) string in a DOM
> > tree; always use Unicode objects.
>
> I do not have control over the sent text.

[I assume that the "sent text" is also the one that you pass to
createTextNode].

Even if you don't have that control, you still need to know what
encoding it uses. If you don't know the encoding, you cannot put it
into XML documents.

> The issue started when some bullets were copied from a word document
> and pasted into a file and the whole file was passed to my
> classes. I cound not find a way to convert this text to UTF-8 or
> anything else.

You don't need to convert it to UTF-8, you need to convert it to
Unicode objects. You can use the unicode() builtin to do that.

> Is there a way to prevent this from happening?

What is "this", and why do you want to prevent it from happening?

Regards,
Martin

0 new messages