On Nov 11, 2009, at 9:31 PM, Denis Arnaud wrote:
> On Nov 8, 10:36 pm, Andrew Dalke <andrewda...@gmail.com> wrote:
>> Those should be
>> 6943439 ... I have no idea ...
>
> I guess that record 6943439 is encoded in UTF8. If so, it would
> correspond to:
> كاف الجاع
Since I have a terminal with UTF-8 encoding, and since that output
comes direct from my Python program, this also works.
>>> print '\xd9\x83\xd8\xa7\xd9\x81 \xd8\xa7\xd9\x84\xd8\xac\xd8\xa7
\xd8\xb9'
كاف الجاع
>>>
I was about to do
unicode('\xd9\x83\xd8\xa7\xd9\x81 \xd8\xa7\xd9\x84\xd8\xac\xd8\xa7\xd8
\xb9', "utf").encode("utf8")
when I realized that was rather pointless.
In any case, the "I have no idea" means that I don't know what goes
in the ASCII field for that name. The others I could handle by eye,
picking the closest looking ASCII character or by doing:
# Get the raw UTF-8 bytes
name = field[ascii_name_column]
# Convert to Unicode
name = unicode(name, "utf8")
# Normalize in a way that you'll have to consult
# the Unicode references for, then convert to ASCII.
# short version: split out diacritics into composed characters
# then remove the composing characters.
name = unicodedata.normalize("NFKD", name).encode("ASCII",
"ignore")
but with pure Arabic this ended up as " ", which is the space in the
middle.
Andrew
da...@dalkescientific.com