Query advice for local ElasticSearch index with GeoNames data

258 views

Skip to first unread message

Rainer Simon

unread,

May 20, 2021, 3:44:00 AM5/20/21

to GeoNames

Dear list,

I'm working on a project that requires geocoding of several 10s of 1000s of placenames. Rather than hammering the GeoNames API with requests, I've built a local EasticSearch index from the available dump files.

I've tried different query strategies to build a system that returns the most "plausible" (for whatever that means...) place as the top hit. But no matter what I do, it's never as good as the response returned from the GeoNames API. (I guess that's a compliment to GeoNames system design :-)

Just running a "trivial" ElasticSearch fulltext query doesn't work, that's for sure: rivers might rank higher than cities of the same name; a search for 'Paris' would rank 'New Paris, OH' higher than the French capital; places with multiple words are generelly bad. ('Las Vegas' is a nice example: the city in Nevada was would rank waaay behind places that only include 'Las Vegas' somewhere inside their name.)

Can anyone advise on a good query strategy that might get me at least close the the quality of the API? What I'm using at the moment is the following:

1. I require a match for the search phrase in the name OR alternatename fields (with a small boost on name field matches)

2. I boost places with feature classes PPLC (capital -> highest boost), PPLA (1st order admin), PPLA (2nd order admin -> lowest boost)

3. I add an additional weight based on the population number

This seems to get me reasonable results most of the time. But still nowhere near the original API, which almost always seems to return the most "expected" results.

If anyone has thoughts, I'd be super-grateful on ideas about better query design, or insights into the inner workings of the API.

Cheers,

Rainer

Denis Arnaud

unread,

May 21, 2021, 3:43:37 AM5/21/21

to GeoNames

That would be nice, indeed, to get some hints on how the API works :)

As I had a similar need, I built an open source library to return the most natural match for a given request: https://github.com/trep/opentrep/

That library is used behind the scene for https://www.transport-search.org/ and https://www2.transport-search.org/ (the code for those sites is itself available as part of the OpenTREP project).

In OpenTREP, the key factor for relevancy of search results is the application of the PageRank algorithm. The graph used for the PageRank algorithm is a proxy of all the transport connections (e.g., flights, trains) between any single point, when available. In practice, only the transport-related points, and corresponding cities, can be "page-ranked" that way. It already allows to page-rank around 20,000 points and associated cities.

I find the PageRank algorithm to be extremely powerful when you want relevancy of search results, and very easy and quick to apply. The only thing is that we need a graph linking the geographical points: the more the points are linked together, the higher importance they have, and the higher probability that you want to see them first when you search for them.

The data used by OpenTREP are Open Travel Data (OPTD): https://github.com/opentraveldata/opentraveldata/ . For the geographical points, Geonames is one of the main sources. The page-ranked transport-related points are available as a CSV file: https://github.com/opentraveldata/opentraveldata/blob/master/opentraveldata/ref_airport_pageranked.csv

As all of these projects are open source, contributions are welcome :)

Kind regards

Denis

Reply all

Reply to author

Forward

0 new messages