Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

html escape sequences

64 views
Skip to first unread message

Will McGugan

unread,
Mar 18, 2005, 6:06:52 AM3/18/05
to
Hi,

I'd like to replace html escape sequences, like &nbsp and &#39 with
single characters. Is there a dictionary defined somewhere I can use to
replace these sequences?

Thanks,

Will McGugan

Leif K-Brooks

unread,
Mar 18, 2005, 6:46:20 AM3/18/05
to
Will McGugan wrote:
> I'd like to replace html escape sequences, like &nbsp and &#39 with
> single characters. Is there a dictionary defined somewhere I can use to
> replace these sequences?

How about this?

import re
from htmlentitydefs import name2codepoint

_entity_re = re.compile(r'&(?:(#)(\d+)|([^;]+));')

def _repl_func(match):
if match.group(1): # Numeric character reference
return unichr(int(match.group(2)))
else:
return unichr(name2codepoint[match.group(3)])

def handle_html_entities(string):
return _entity_re.sub(_repl_func, string)

Will McGugan

unread,
Mar 18, 2005, 6:53:27 AM3/18/05
to

muchas gracias!

Will McGugan

0 new messages