Place matching in record search queries

Justin York

unread,

Apr 21, 2016, 12:25:24 PM4/21/16

to root...@googlegroups.com

I'm not a search expert. I wouldn't be able to do anything fancy without ElasticSearch. So the idea of handling a search query that includes places is daunting. How do FamilySearch and Ancestry handle places in their record search queries?

When starting a new search at Ancestry they prompt you to select a normalized place.

If you select a normalized value then on the results page you get an advanced place filter with options of matching Broad, Country, State and adjacent states, State, County and adjacent counties, County, and Exact.

When set to Broad there are 1,103,301,743 results.

When set to State there are 444,576,313 results.

When set to Exact there are 1,751,043 results.

Clearly that filter is working. And it's fast too.

Are all events in records normalized and inserted into a geospatial index? Is there any string matching of places going on?

If I don't select a normalized value when searching then I don't get the fancy place filter. I just get the options of Broad and Exact. When searching for "Amsterdam, Montomgery, NY" I get 669,259,292 Broad matches and 1,751,043 Exact matches. I'm not sure why the broad matches are different but the exact matches are the same as above. It seems they are attempting to normalize the input query even when the user doesn't selecting a normalized value.

I remember hearing once that FamilySearch normalizes place values on the fly when answering search queries. Is it true? I don't understand how that could possibly be true unless they're referring solely to the user input.

How would you design a search system that matches against millions of records? Is it possible to do without normalizing the places in the records?

Luther Tychonievich

unread,

Apr 21, 2016, 2:44:01 PM4/21/16

to root...@googlegroups.com

I believe they use a place authority, essentially a big database of place name equivalences. FamilySearch has an API to theirs, but it is marked as deprecated.

https://familysearch.org/developers/docs/api/resources

Beyond that I don't know anything about the fancy guts of the search engine...

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Brumfield

unread,

Apr 21, 2016, 2:52:59 PM4/21/16

to root...@googlegroups.com

For FreeREG2, we normalized church place names and hand-geolocated them all using GENUKI and other sources. We make two passes for "nearby" searches, first requiring a user to pick a place in their search form input, then fetching nearby places which contain records (hooray for MongoDB's 2dsphere index type!), then passing the lot of place IDs into the actual record query. (Each place also has a list of alternate place names for places with both English and Welsh names which may differ radically, but those toponyms are still normalized.)

That worked very well for a limited set of places (50,000ish, though I think fewer than half have records) and a record set of around 35 million.

Our next placename-related dataset is dealing with place of birth on census records. This is an entirely different proposition, since birthplaces are unrestricted, and (worse) may have been recorded incorrectly by the enumerator filling out the form. Wish us luck!

Ben

--

Justin York

unread,

Apr 21, 2016, 2:55:56 PM4/21/16

to root...@googlegroups.com

Ben, what do you mean by unrestricted? Do you mean that they point to a jurisdiction as opposed to a city or town?

Ben Brumfield

unread,

Apr 21, 2016, 3:40:21 PM4/21/16

to root...@googlegroups.com

What I mean by that is that parish registers have a restricted inventory of places they came from -- parishes in the UK. There aren't that many of them, and it's possible to compile a database of them in a mostly-straightforward manner. Even better, they're associated with a record by virtue of metadata -- where the register book was from -- rather than data written within an entry.

By contrast, someone's birthplace may be anywhere on the globe. They're data -- one record's birthplace may have nothing to do with the next record's birthplace entry. And the datum itself may have been spoken aloud by an illiterate and written by someone unfamiliar with the location:
"Where's your wife from?"
"Some place in Sweden -- Goatburg, I think."

The chances that the entry will contain either Göteborg or Gothenburg in such a case is pretty small.

Ben

todd.d....@gmail.com

unread,

Apr 25, 2016, 6:03:43 PM4/25/16

to Rootsdev

I'd like to throw Mapzen's open data place authority/gazeteer into the mix: https://whosonfirst.mapzen.com/

It powers Pelias/Mapzen Search, and it's a super-collection of various datasets (GeoNames, Wikidata, Natural World, etc.) licensed as CC0 (public domain) data. Check out this data spelunker to get a sense of the current data which is improving daily (you can help too!): https://whosonfirst.mapzen.com/spelunker/

–Tod

Tod Robbins

Digital Asset Manager, MLIS

todrobbins.com | @todrobbins

Tom Morris

unread,

Apr 25, 2016, 6:59:23 PM4/25/16

to root...@googlegroups.com

I'm not sure I totally understand the question, but let me offer a few general observations.

- search magic happens at both indexing and query time for ElasticSearch. If you wanted to be able to do "sounds like" queries, you'd compute soundex codes at index time for the document collection and create a soundex index and then compute them on the query as well and look things up in the index. If you wanted to handle misspellings, you might use an n-gram index.

- you want to precompute as much as possible, so you'd only use a geospatial index as a last resort (plus it can't really do "neighboring counties" accurately anyway). Your gazeteer (or other geo database) is going to know what the adjacent counties are, so the query can be expanded from a single county to a list of counties, exactly as if the user had put that list of counties in their query.

- if you had to do a geospatial search for some reason, these typically use pyramidally quad trees to quickly narrow down to a set of candidates which can then be filtered.

ElasticSearch is amazingly powerful, but it has lots of knobs and dials that can be tweeked. It can take a while to get to grips with all it offers and how to best achieve the particular queries you want to service.

Tom

--

Tom Morris

unread,

Apr 25, 2016, 7:05:32 PM4/25/16

to root...@googlegroups.com

On Mon, Apr 25, 2016 at 6:03 PM, todd.d....@gmail.com <todd.d....@gmail.com> wrote:

I'd like to throw Mapzen's open data place authority/gazeteer into the mix: https://whosonfirst.mapzen.com/

Cool. One typical problem with trying to use gazetteers for genealogical work is that they tend to be modern, or, if not modern, fixed at a single instant in time.

I couldn't tell from the amount of time I was willing to commit to it whether or not Mapzen's gazetteer includes historical places and/or names. Anyone know?

Tom

todd.d....@gmail.com

unread,

Apr 25, 2016, 10:31:42 PM4/25/16

to Rootsdev

Tom,

Who's On First (WOF) does provide for historical places via https://github.com/whosonfirst/whosonfirst-dates, and it's a matter of getting the data formatted into the dataset. I think it's one of the most promising projects related to open geodata.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,

Apr 25, 2016, 11:10:15 PM4/25/16

to root...@googlegroups.com

Thanks Todd. Sounds like they've at least started thinking about it. It might be because they're geo-heads that they've covered the "part of" relations first, but they also need to do "formed from," "divided into," etc to be maximally useful in this space. Someone needs to be able to do a web version of http://goldbug.com/animap/ or a county record (e.g. probate) finder if we want to be all roots-y about it.

Tom

Justin York

unread,

Apr 26, 2016, 11:01:54 AM4/26/16

to root...@googlegroups.com

Thanks Tom, that's great advice. I had a really hard time trying to figure out what I wanted to ask.

Handling historical names (synonyms) in the search query is an interesting issue. It seems that you either need 1) a historical place database with many synonyms, 2) normalized data, or 3) expect users to search on all historical names.

Reply all

Reply to author

Forward