Simple character translation problem

David Eppstein

unread,

Sep 19, 2001, 12:08:13 PM9/19/01

to

I have: user input text, in Mac character set encoding
I want: ASCII with HTML-entities coding the accented characters.

E.g. "café" should become "café".
Is there code already in Python to do this easily?
--
David Eppstein UC Irvine Dept. of Information & Computer Science
epps...@ics.uci.edu http://www.ics.uci.edu/~eppstein/

Martin von Loewis

unread,

Sep 21, 2001, 11:08:29 AM9/21/01

to

David Eppstein <epps...@ics.uci.edu> writes:

> I have: user input text, in Mac character set encoding
> I want: ASCII with HTML-entities coding the accented characters.
>
> E.g. "café" should become "café".
> Is there code already in Python to do this easily?

First, you should convert the string into a Unicode string, using the
proper codec. Then, there is an easy approach and a difficult one. The
easy one is to convert all non-ASCII characters (i.e. those with
ordinals > 127) into character entities, i.e. using the &#digits;
notation.

Or, you could try to use external entities where possible. For that,
please have a look at htmlentitydefs.entitydefs. Using that is not
straight forward: you have to invert the dictionary, and you have to
convert the keys into Unicode keys. For the keys that are
single-character strings (e.g. '\306'), you can use the Unicode
character with the same ordinal. For characters above 255, you have to
convert between the character entity and a Unicode character.

If you can come up with patches to htmlentitydefs that make use of
Unicode, please do so and submit them to sf.net/projects/python.

Regards,
Martin

Steffen Ries

unread,

Sep 22, 2001, 9:41:17 AM9/22/01

to

Martin von Loewis <loe...@informatik.hu-berlin.de> writes:

> David Eppstein <epps...@ics.uci.edu> writes:
>
> > I have: user input text, in Mac character set encoding
> > I want: ASCII with HTML-entities coding the accented characters.
> >
> > E.g. "café" should become "café".
> > Is there code already in Python to do this easily?

...

> Or, you could try to use external entities where possible. For that,
> please have a look at htmlentitydefs.entitydefs. Using that is not
> straight forward: you have to invert the dictionary, and you have to
> convert the keys into Unicode keys. For the keys that are
> single-character strings (e.g. '\306'), you can use the Unicode
> character with the same ordinal. For characters above 255, you have to
> convert between the character entity and a Unicode character.

Ok, I'll bite:
--8<--
_u2html = {} # unicode to html mapping

def _make_u2html():
from htmlentitydefs import entitydefs

def c2u(c):
if len(c) == 1:
return unicode(c, 'latin1')
if c.startswith('&#'):
return unichr(int(c[2:-1]))

for entity,val in entitydefs.items():
_u2html[c2u(val)] = "&%s;" % entity

def htmlentityEncode(s):
"""
convert unicode string s to ascii, replace non-ascii characters with
html entitydef or "?"
"""

if not _u2html:
_make_u2html()

l = [_u2html.get(c, c) for c in s]

return ''.join(l).encode('ascii', 'replace')
--8<--

>>> htmlentityEncode(u"café")
'café'

/steffen
--
steffe...@sympatico.ca <> Gravity is a myth -- the Earth sucks!