Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Convert xml symbol notation

11 views
Skip to first unread message

dumbkiwi

unread,
Apr 6, 2007, 6:39:01 PM4/6/07
to
Hi,

I'm working on a script to download and parse a web page, and it
includes xml symbol notation, such as ' for the ' character. Does
anyone know of a pre-existing python script/lib to convert the xml
notation back to the actual symbol it represents?

Gabriel Genellina

unread,
Apr 7, 2007, 1:23:07 AM4/7/07
to
dumbkiwi wrote:

Try the htmlentitydefs module.

--
Gabriel Genellina

dumbkiwi

unread,
Apr 7, 2007, 3:12:11 AM4/7/07
to

Is that a standard module? I can't see it anywhere - googled it.


Gabriel Genellina

unread,
Apr 6, 2007, 11:03:19 PM4/6/07
to pytho...@python.org
dumbkiwi wrote:

Sure! For quite a while, at least, since Python 1.5 (I can't go earlier
in time...)
http://svn.python.org/view/python/trunk/Lib/htmlentitydefs.py
Added Wed Sep 27 16:22:08 1995 UTC (11 years, 6 months ago) by guido

--
Gabriel Genellina


"Martin v. Löwis"

unread,
Apr 7, 2007, 4:47:50 AM4/7/07
to Gabriel Genellina
>> I'm working on a script to download and parse a web page, and it
>> includes xml symbol notation, such as ' for the ' character. Does
>> anyone know of a pre-existing python script/lib to convert the xml
>> notation back to the actual symbol it represents?
>
> Try the htmlentitydefs module.

That won't help: this is a character reference, not an entity reference.
htmlentitydefs only contains the definitions of entities.

Regards,
Martin

"Martin v. Löwis"

unread,
Apr 7, 2007, 4:52:03 AM4/7/07
to dumbkiwi

If you have this given in an XML file (rather than an HTML file which
is not well-formed XML), you could use an XML parser for the entire
file. This would automatically unescape character references. Likewise,
you can parse it with HTMLParser, which will invoke the handle_charref
method for these.

If you just want to unescape references, you can use the code in

http://effbot.org/zone/re-sub.htm

HTH,
Martin

Gabriel Genellina

unread,
Apr 7, 2007, 4:43:10 PM4/7/07
to
Martin v. Löwis wrote:

> >> I'm working on a script to download and parse a web page, and it
> >> includes xml symbol notation, such as ' for the ' character. Does
> >

> > Try the htmlentitydefs module.
>
> That won't help: this is a character reference, not an entity reference.
> htmlentitydefs only contains the definitions of entities.

Ouch! Sorry.

--
Gabriel Genellina

0 new messages