I'm working on a script to download and parse a web page, and it
includes xml symbol notation, such as ' for the ' character. Does
anyone know of a pre-existing python script/lib to convert the xml
notation back to the actual symbol it represents?
Try the htmlentitydefs module.
--
Gabriel Genellina
Is that a standard module? I can't see it anywhere - googled it.
Sure! For quite a while, at least, since Python 1.5 (I can't go earlier
in time...)
http://svn.python.org/view/python/trunk/Lib/htmlentitydefs.py
Added Wed Sep 27 16:22:08 1995 UTC (11 years, 6 months ago) by guido
--
Gabriel Genellina
That won't help: this is a character reference, not an entity reference.
htmlentitydefs only contains the definitions of entities.
Regards,
Martin
If you have this given in an XML file (rather than an HTML file which
is not well-formed XML), you could use an XML parser for the entire
file. This would automatically unescape character references. Likewise,
you can parse it with HTMLParser, which will invoke the handle_charref
method for these.
If you just want to unescape references, you can use the code in
http://effbot.org/zone/re-sub.htm
HTH,
Martin
> >> I'm working on a script to download and parse a web page, and it
> >> includes xml symbol notation, such as ' for the ' character. Does
> >
> > Try the htmlentitydefs module.
>
> That won't help: this is a character reference, not an entity reference.
> htmlentitydefs only contains the definitions of entities.
Ouch! Sorry.
--
Gabriel Genellina