escaped unicode characters as aliases

6 views
Skip to first unread message

Andrew Dalke

unread,
Nov 7, 2009, 6:09:29 PM11/7/09
to GeoNames
The following from alternateNames.txt contain escaped Unicode
characters in

2296413 788578 mk Lubani&#353ta 1
58902 113646 fa Tabr&#299z 1
2181093 2540850 ar Warzāz&#257t
2181103 6533373 fa Meydān-e Em&#257m Khomein&#299

of which these

Tabr&#299z
Warzaz&#257t
Warzāz&#257t

are also listed in cities1000.txt as alternatenames.

-- Andrew Dalke <da...@dalkescientific.com>

Marc Wick

unread,
Nov 8, 2009, 1:34:44 AM11/8/09
to geon...@googlegroups.com
Please don't hesitate to correct it:
http://www.geonames.org/manual.html

Marc

Andrew Dalke

unread,
Nov 8, 2009, 10:40:50 AM11/8/09
to GeoNames
On Nov 8, 7:34 am, Marc Wick <m...@geonames.org> wrote:
> Please don't hesitate to correct it:http://www.geonames.org/manual.html

I had not actually known that as an option, as I've been working with
the data files directly.

I tried to do that and found the first difficulty was finding the
records those fields referred to. All I had was:

2296413 788578 mk Lubani&#353ta 1
58902 113646 fa Tabr&#299z 1
2181093 2540850 ar Warzāz&#257t
2181103 6533373 fa Meydān-e Em&#257m Khomein&#299

Neither the main geonames.org page nor the advanced search page allow
lookup by geonameid 788578. They think it's some sort of postal code.
I could not find a match in cities1000.txt for that geonameid (though
I could have searched in allCountries.txt), and searching for
"Lubani&#353ta" pointed out some place in Mexico which made no sense.
(The language code is mk for Macedonian.)

I used Google and found out that records are accessible through URLs
like
http://www.geonames.org/6533373

I used that to go to each record, and in every case the alternate name
string is displayed correctly.

This suggests some sort of import/export Unicode encoding problem.
Likely one which has been fixed, but with remnants of it still in the
system somewhere.

I do not think I am able to edit and fix these four cases.

Best regards,

-- Andrew Dalke <da...@dalkescientific.com>

Marc Wick

unread,
Nov 8, 2009, 10:51:35 AM11/8/09
to geon...@googlegroups.com
It is an html encoding and it will therefore display correctly on an
html page.
All that needed to be done was to click on the edit link and save again.
I did it for you.

Best

Marc

Andrew Dalke

unread,
Nov 8, 2009, 2:10:40 PM11/8/09
to GeoNames
On Nov 8, 4:51 pm, Marc Wick <m...@geonames.org> wrote:
> It is an html encoding and it will therefore display correctly on an
> html page.

But it shouldn't, unless the source data is supposed to be HTML
encoded - which it isn't.

Take for example geonameid 6619831, which is for the Victoria & Albert
Museum. Go to that page and view source and you'll see an error in the
HTML:

<meta name="description" content="Victoria & Albert Museum England
Kensington and Chelsea, United Kingdom, museum" />

This is a bug. The '&' should be &amp; and you can see that the title
field is properly escaped:

<title>Victoria &amp; Albert Museum, United Kingdom</title>

Potentially this opens the geonames server up to various sorts of well-
known attacks.

I tried to test this by adding a place named '"> testing' at the
stadium in Boden. That gives me the error message

error while saving:
Cannot parse ' country:SE names:" names:testing fcode:STDM( fc:S )':
Lexical error at line 1, column 54. Encountered: after : "\"
names:testing fcode:STDM( fc:S )"

That implies that special characters aren't properly being escaped
before doing into the database. If I use two double quotes, it worked,
that is

""> testing

I'm disconcerted about that. Perhaps there is also a possible
injection attack on the database? I also can't figure out where the
record went so I can delete it. (Or more properly, use the correct
name.) Please feel free to delete the record if it was created, or for
that matter wipe my account if this was too improper.


> All that needed to be done was to click on the edit link and save again.
> I did it for you.

Thanks. But that shouldn't have been the right solution if data
conversion is correctly done all the way through, which is why I
didn't consider doing it.

-- Andrew Dalke <da...@dalkescientific.com>
Reply all
Reply to author
Forward
0 new messages