[web2py] html unescape - if anyone needs it

218 views
Skip to first unread message

RobertVa

unread,
May 23, 2010, 2:41:43 PM5/23/10
to web2py-users
Hi.

I found function to unescape html data, which I believe would be very
prudent to put into framework itself.

from htmlentitydefs import name2codepoint
def replace_entities(match):
try:
ent = match.group(1)
if ent[0] == "#":
if ent[1] == 'x' or ent[1] == 'X':
return unichr(int(ent[2:], 16))
else:
return unichr(int(ent[1:], 10))
return unichr(name2codepoint[ent])
except:
return match.group()

entity_re = re.compile(r'&(#?[A-Za-z0-9]+?);')

def html_unescape(data):
return entity_re.sub(replace_entities, data)


Tnx to author.
http://blog.client9.com/2008/10/html-unescape-in-python.html

Yarko Tymciurak

unread,
May 23, 2010, 2:59:54 PM5/23/10
to web2py-users
Have you looked at the XML() helper? http://www.web2py.com/book/default/section/5/2?search=XML

RobertVa

unread,
May 23, 2010, 3:20:00 PM5/23/10
to web2py-users
I did.

It has xmlescape function, but reverse function (unescape) is not
defined.

mdipierro

unread,
May 24, 2010, 4:25:54 PM5/24/10
to web2py-users
I liked your suggestion and I used it to make
gluon.html.web2pyHTMLParser, take a look and let me know what you
think.

RobertVa

unread,
May 25, 2010, 1:58:51 PM5/25/10
to web2py-users
This is very useful. I'm just making new agreggator and this will come
in handy. For scraping purposes.
As I see it, this would be some sort of jquery for HTML in
python. :))))

mdipierro

unread,
May 25, 2010, 2:23:49 PM5/25/10
to web2py-users
yes. If you just do str(TAG(text)) this will un-escape te text as you
suggest (but to utf8 not unicode).
Reply all
Reply to author
Forward
0 new messages