I'm working on a project that requires geocoding of several 10s of 1000s of placenames. Rather than hammering the GeoNames API with requests, I've built a local EasticSearch index from the available dump files.
I've tried different query strategies to build a system that returns the most "plausible" (for whatever that means...) place as the top hit. But no matter what I do, it's never as good as the response returned from the GeoNames API. (I guess that's a compliment to GeoNames system design :-)
Just running a "trivial" ElasticSearch fulltext query doesn't work, that's for sure: rivers might rank higher than cities of the same name; a search for 'Paris' would rank 'New Paris, OH' higher than the French capital; places with multiple words are generelly bad. ('Las Vegas' is a nice example: the city in Nevada was would rank waaay behind places that only include 'Las Vegas' somewhere inside their name.)
Can anyone advise on a good query strategy that might get me at least close the the quality of the API? What I'm using at the moment is the following:
1. I require a match for the search phrase in the name OR alternatename fields (with a small boost on name field matches)
2. I boost places with feature classes PPLC (capital -> highest boost), PPLA (1st order admin), PPLA (2nd order admin -> lowest boost)
3. I add an additional weight based on the population number
This seems to get me reasonable results most of the time. But still nowhere near the original API, which almost always seems to return the most "expected" results.
If anyone has thoughts, I'd be super-grateful on ideas about better query design, or insights into the inner workings of the API.