non-asciinames in cities1000.txt

65 views
Skip to first unread message

Andrew Dalke

unread,
Nov 8, 2009, 4:36:17 PM11/8/09
to GeoNames
The documentation at
http://download.geonames.org/export/dump/readme.txt

says
> The main 'geoname' table has the following fields :
...
> asciiname : name of geographical point in plain ascii characters, varchar(200)

I need the ASCII name because I want to match the official name of
"Doña Ana, NM" and the often-used name of "Dona Ana, NM" and figured
that using this field was best, because "Dona Ana" does not exist as
an alternate name.

(I could be wrong. The geonames server lets me search for "Dona Ana"
in the US and it finds Doña Ana, so I figured using asciiname was part
of the correct search strategy.)

Scanning through cities1000.txt I found 5 cities which contained non-
ASCII characters in that field:

106909 '\xe1\xb8\x90uba'
6943439 '\xd9\x83\xd8\xa7\xd9\x81 \xd8\xa7\xd9\x84\xd8\xac
\xd8\xa7\xd8\xb9'
1567723 'Song C\xe1\xba\xa7u'
1586288 'C\xe1\xba\xa7n Gi\xe1\xbb\x9d'
1586296 'C\xe1\xba\xa7n Duoc'

Those should be

106909 likely Duba (see http://en.wikipedia.org/wiki/Duba )
6943439 ... I have no idea ...
1567723 Song Cau
1586288 Can Gio
1586296 Can Giuoc

I tried to edit the asciiname field by going to a record, but I saw no
way to do that. I assume it is a derived field and not something which
is externally editable.

Best regards,

-- Andrew Dalke <da...@dalkescientific.com>

Denis Arnaud

unread,
Nov 11, 2009, 3:31:18 PM11/11/09
to GeoNames
Hello,

On Nov 8, 10:36 pm, Andrew Dalke <andrewda...@gmail.com> wrote:
> Scanning through cities1000.txt I found 5 cities which contained non-
> ASCII characters in that field:
>
> 6943439 '\xd9\x83\xd8\xa7\xd9\x81 \xd8\xa7\xd9\x84\xd8\xac\xd8\xa7\xd8\xb9'
>
> Those should be
> 6943439  ... I have no idea ...

I guess that record 6943439 is encoded in UTF8. If so, it would
correspond to:
كاف الجاع

In case you want to decode it yourself, you can use the following
small Python program: http://codepad.org/QZGa9xan
(for those interested, the C++ is also easy to write)

Best Regards

Denis

Andrew Dalke

unread,
Nov 11, 2009, 4:10:23 PM11/11/09
to geon...@googlegroups.com
Hi Denis!

On Nov 11, 2009, at 9:31 PM, Denis Arnaud wrote:
> On Nov 8, 10:36 pm, Andrew Dalke <andrewda...@gmail.com> wrote:
>> Those should be
>> 6943439 ... I have no idea ...
>
> I guess that record 6943439 is encoded in UTF8. If so, it would
> correspond to:
> كاف الجاع


Since I have a terminal with UTF-8 encoding, and since that output
comes direct from my Python program, this also works.

>>> print '\xd9\x83\xd8\xa7\xd9\x81 \xd8\xa7\xd9\x84\xd8\xac\xd8\xa7
\xd8\xb9'
كاف الجاع
>>>

I was about to do

unicode('\xd9\x83\xd8\xa7\xd9\x81 \xd8\xa7\xd9\x84\xd8\xac\xd8\xa7\xd8
\xb9', "utf").encode("utf8")

when I realized that was rather pointless.

In any case, the "I have no idea" means that I don't know what goes
in the ASCII field for that name. The others I could handle by eye,
picking the closest looking ASCII character or by doing:
# Get the raw UTF-8 bytes
name = field[ascii_name_column]
# Convert to Unicode
name = unicode(name, "utf8")
# Normalize in a way that you'll have to consult
# the Unicode references for, then convert to ASCII.
# short version: split out diacritics into composed characters
# then remove the composing characters.
name = unicodedata.normalize("NFKD", name).encode("ASCII",
"ignore")

but with pure Arabic this ended up as " ", which is the space in the
middle.


Andrew
da...@dalkescientific.com


Denis Arnaud

unread,
Nov 12, 2009, 6:25:35 AM11/12/09
to GeoNames
On Nov 11, 10:10 pm, Andrew Dalke <da...@dalkescientific.com> wrote:
>
> In any case, the "I have no idea" means that I don't know what goes  
> in the ASCII field for that name. The others I could handle by eye,  
> picking the closest looking ASCII character or by doing:
>          # Get the raw UTF-8 bytes
>          name = field[ascii_name_column]
>          # Convert to Unicode
>          name = unicode(name, "utf8")
>          # Normalize in a way that you'll have to consult
>          # the Unicode references for, then convert to ASCII.
>          # short version: split out diacritics into composed characters
>          # then remove the composing characters.
>          name = unicodedata.normalize("NFKD", name).encode("ASCII",  "ignore")
>

I would moreover transliterate the string:
http://cldr.unicode.org/index/cldr-spec/transliteration-guidelines
[There are also some demos on the ICU project site. For instance, for
the transliteration: http://demo.icu-project.org/icu-bin/translit (and
to check the language identifier: http://unicode.org/cldr/utility/languageid.jsp)]

When using the "Any->Latin" transliterator, it yields: "kạf ạljạʿ" .
You can then go on normalising the string.

Hope it helps

Denis

Reply all
Reply to author
Forward
0 new messages