Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Convert xml symbol notation

2 views

Skip to first unread message

dumbkiwi

unread,

Apr 6, 2007, 6:39:01 PM4/6/07

Hi,

I'm working on a script to download and parse a web page, and it
includes xml symbol notation, such as ' for the ' character. Does
anyone know of a pre-existing python script/lib to convert the xml
notation back to the actual symbol it represents?

Gabriel Genellina

unread,

Apr 7, 2007, 1:23:07 AM4/7/07

dumbkiwi wrote:

Try the htmlentitydefs module.

--
Gabriel Genellina

dumbkiwi

unread,

Apr 7, 2007, 3:12:11 AM4/7/07

Is that a standard module? I can't see it anywhere - googled it.

Gabriel Genellina

unread,

Apr 6, 2007, 11:03:19 PM4/6/07

to pytho...@python.org

dumbkiwi wrote:

Sure! For quite a while, at least, since Python 1.5 (I can't go earlier
in time...)
http://svn.python.org/view/python/trunk/Lib/htmlentitydefs.py
Added Wed Sep 27 16:22:08 1995 UTC (11 years, 6 months ago) by guido

--
Gabriel Genellina

"Martin v. Löwis"

unread,

Apr 7, 2007, 4:47:50 AM4/7/07

to Gabriel Genellina

>> I'm working on a script to download and parse a web page, and it
>> includes xml symbol notation, such as ' for the ' character. Does
>> anyone know of a pre-existing python script/lib to convert the xml
>> notation back to the actual symbol it represents?
>
> Try the htmlentitydefs module.

That won't help: this is a character reference, not an entity reference.
htmlentitydefs only contains the definitions of entities.

Regards,
Martin

"Martin v. Löwis"

unread,

Apr 7, 2007, 4:52:03 AM4/7/07

to dumbkiwi

If you have this given in an XML file (rather than an HTML file which
is not well-formed XML), you could use an XML parser for the entire
file. This would automatically unescape character references. Likewise,
you can parse it with HTMLParser, which will invoke the handle_charref
method for these.

If you just want to unescape references, you can use the code in

http://effbot.org/zone/re-sub.htm

HTH,
Martin

Gabriel Genellina

unread,

Apr 7, 2007, 4:43:10 PM4/7/07

Martin v. Löwis wrote:

> >> I'm working on a script to download and parse a web page, and it
> >> includes xml symbol notation, such as ' for the ' character. Does
> >

> > Try the htmlentitydefs module.
>
> That won't help: this is a character reference, not an entity reference.
> htmlentitydefs only contains the definitions of entities.

Ouch! Sorry.

--
Gabriel Genellina

0 new messages