Geonames ID/city name inconsistency

27 views
Skip to first unread message

Athanasios Anastasiou

unread,
Mar 9, 2023, 6:18:50 AM3/9/23
to ROR Technical Forum
Hello

I have recently made the transition from GRID to ROR but upon ingesting the V1.20 dataset I am getting constrain validation errors, at least for a "Washington, D.C." entry.

The problem is with `[].addresses[].geonames_city[].id` 4140963 which, for some entries, is given as "Washington" and for others it is given as "Washington, D.C.".

My constrain fails because I have a "UNIQUE" on the city "name". This was prompted way back when the schema was created using the GRID database, under the assumption that a geonames id should correspond to exactly one unambiguous label.

Unfortunately, this "error" is replicated over a large number of entries which can be reviewed using this simple script: https://gist.github.com/aanastasiou/75ea710b15e1bf9359858a6597262454

What I would like to ask is:

1. Are "synonyms" supposed to be allowed in this case? If that is the case, I will have to modify the schema.
2. If synonyms are not allowed, how do we go about handling this? I can add a local preprocessing step to apply the "Washington, D.C." correction but there might be other mispellings. Is this something that could be corrected at the source perhaps?

All the best
Athanasios Anastasiou

Liz Krznarich

unread,
Mar 9, 2023, 4:10:16 PM3/9/23
to ROR Technical Forum, athana...@gmail.com
Hi Athanasios,

Thanks for raising this. The name for Geonames ID 4140963 changed from Washington D.C. to Washington in the Geonames record in April 2020. https://www.geonames.org/4140963/washington.html . This means that records created or last updated after that date have Washington in addresses[0].geonames_city.name, while those created/updated before that date have Washington D.C. 

Due to ongoing issues with the Geonames API (which we use for creating, updating and validating location information during our release process), we don't currently synchronize location information with Geonames across the entire ROR dataset on each release, so you may find variability in cases like this where the name was changed in Geonames. We do plan to migrate from using the Geonames API to the data dump, which will allow us to regularly synchronize place names with Geonames across the entire dataset. I would expect that migration to happen later this year (Github issue https://github.com/ror-community/ror-roadmap/issues/150).

Cheers,
Liz

---
Liz Krznarich
Technical Lead
kerz-NAR-itch | she/her | US central time (GMT-6)

Athanasios Anastasiou

unread,
Mar 10, 2023, 4:33:56 AM3/10/23
to Liz Krznarich, ROR Technical Forum
Hello Liz

Thank you very much for such a quick response.

"Why they changed it? I can't say,
People liked it better this way."

I am a little bit sorry that I did not probe this error a little bit deeper before contacting ROR. I am impressed that since ~2018 and the days of GRID, I only now come across this eventuality.

Is ror-roadmap/Issues a better place for reporting these bugs than this mailing list?

Finally, regarding the error itself: I note from the issue that "This API regularly returns different results for the same request...". Does this mean that even if I created short local scripts to pre-process the JSON file before deploying it, these might be invalid by the next release (?).

Liz Krznarich

unread,
Mar 10, 2023, 2:07:23 PM3/10/23
to ROR Technical Forum, athana...@gmail.com, ROR Technical Forum, Liz Krznarich

Hi Athanasios,

We've noticed name changes to at least 1 Geonames ID just about every time we do a release, which is more than I'd expect, but then again, that's why place IDs exist!

I'm not sure how often GRID synced its Geonames information or what their exact process was, so it's possible this issue did not exist in their dataset. The final GRID release in Sept 2021 still lists the name of 4140963 as Washington D.C., so I'm not sure that they synced data for existing Geonames IDs at all (maybe they just ingested new records as needed). At any rate, as mentioned, we'll solve this problem later this year as part of a larger package of work to update our validation tools.

As for the Geonames API issues, you may or may not run into the same problems. What we've seen is that, for some time (as is days or weeks, not minutes) after a change is made to a given ID, API requests to that ID will sometimes return the previous name and sometimes the current name. I suspect it could be related to caching in edge nodes of a CDN service by Geonames, but we've not received answers to our support requests about this so I can't say for sure.

For bug reports, if it's clear that a particular behavior is a bug, please do open an issue in ror-roadmap https://github.com/ror-community/ror-roadmap/issues . However, it's also fine to start a thread here, especially if you're not sure whether a bug exists.

Cheers,
Liz

Arthur Smith

unread,
Mar 10, 2023, 3:30:41 PM3/10/23
to ror-...@ror.org
At some point GIRD decided to change "Czech Republic" to "Czechia" in their "country" field; there may have been other country name changes like that but I noticed that one. However it was changed for all records at once so at least it was consistent.

  Arthur
--
You received this message because you are subscribed to the Google Groups "ROR Technical Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ror-tech+u...@ror.org.
To view this discussion on the web visit https://groups.google.com/a/ror.org/d/msgid/ror-tech/d7941ea4-10d3-4880-aae5-385a0e1337abn%40ror.org.


Athanasios Anastasiou

unread,
Mar 14, 2023, 7:08:29 AM3/14/23
to Arthur Smith, Liz Krznarich, ror-...@ror.org
Hello both

Thanks for the information. Washington was just the tip of the Iceberg here :/
In the attached, the first number is the geonames_city.id and after the comma we get N entries of the form (geonames_city.city:count).

The idea here was to use the count, to automate the editing but this rule does not seem to be robust.
(For example, assuming that the "old" entry would have more samples in the dataset -> substitute the low count city label to the high count city label.)

There are quite a few entries which, for the moment, I am correcting "manually" as a preprocessing step to ingesting the ROR dataset.
I am simply editing the txt file to leave one entry next to the ID and I am re-using a cached index of where these changes should be applied.

Next stop: Country names! (Really not looking forward to it...)

All the best
Athanasios Anastasiou


You received this message because you are subscribed to a topic in the Google Groups "ROR Technical Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/a/ror.org/d/topic/ror-tech/eklgNbEnUVk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ror-tech+u...@ror.org.
To view this discussion on the web visit https://groups.google.com/a/ror.org/d/msgid/ror-tech/fe81ebe2-6aa6-78cc-c357-fde91a065584%40aps.org.
v1.20-2023-02-28-ror-data_geonames_city_discrepancies.txt
Reply all
Reply to author
Forward
0 new messages