Storing Genealogy Data

748 views
Skip to first unread message

Gary Stanley

unread,
Jan 16, 2016, 2:06:11 PM1/16/16
to rootsdev
I'm at somewhat of a development crossroads and I could do with some advice from fellow developers of Genealogical data.

I'm working on a project to log records of Romany Gipsies in the UK & USA between 1600-1900 primarily. This includes posting birth, baptism, marriage, death, burial, census and migration (ship manifest) records, as well as newspaper articles, online and obviously making them searchable. Now I'm an old school SQL user and so setup a relational database using mySQL to achieve this and the results of which are currently viewable at my website - www.romaheritage.co.uk

I've thrown in a bit of Google maps integration to map appropriate locations and churches on a map, depending on the details of the specific record or newspaper article.

The problem I have is this. A relational database doesn't quite feel right for this project and I'm feeling somewhat constrained by what I can do. What I have now is quite basic in regards to the database schema, and initially I looked at trying to map out GEDCOM X to a relational database ... but there would just be far to many joins and this looks unfeasible. Next up I looked at JSON as a file structure of holding records but for searches and analyzing data this didn't seem to me to be the best way to acheive what I wanted either. Then recently I came across an article about Family Search exploring using MongoDB .. a NoSQL data storage application that I had never heard of (I know, I'm behind the times!!!)

Is anyone able to help me or recommend the best avenue to do what I want to do? I know Ben who works for FreeREG frequents this group from time to time, and as they hold the type of data I want to work with, I'm hoping he or someone may be able to assist!

Gary Stanley

unread,
Jan 16, 2016, 2:20:34 PM1/16/16
to rootsdev
I should have added that the Google Maps integration isn't supposed to be "just for show", it was intended for two purposes; firstly so readers could see instantly where a particular location was in relation to the record they are viewing. Secondly because Romany Gipsies travelled literally all over the place and are quite difficult to track - not just location wise, but family units also - the hope was that with more and more records being added to the system, patterns or clusters would appear in certain areas that may indicate or help deduce inter-family relationships. Although we are relatively small at the moment, this is already proving it's worth as a "detective" resource ...

Thomas Wetmore

unread,
Jan 16, 2016, 2:26:01 PM1/16/16
to root...@googlegroups.com
Gary,

The possibilities are endless. I agree that relational databases are not the best fit to genealogical database. That being said, nearly all genealogical system use them!! Which is why filling out data forms on many genealogical programs can feel so much like forcing square pegs into round holes. That being said, both RootsMagic and Gramps, two popular systems, use relational databases. The others probably do too.

MongoDB is a great option in my opinion. So are graph databases such as neo4j. There are a number of others that have their proponents. MongoDB and neo4j are the only two I have looked at seriously, and either one would do a bang up job as the backing store for a genealogical system.

Twenty five years ago I developed a genealogical database in which all the records were Gedcom structures. I allow any tags so that users can design their own schemas, introducing new conventions (e.g., new event types, new attribute types), on the fly. Note that for all practical purposes Gedcom syntax is the same as JSON syntax is that same as XML syntax, so anything you can do with one of them you can do with all of them.

What allows this to work quite well with the database is a programmable subsystem with a lot of features.

The issue is the quality of indexing you can get. MongoDB and neo4j have good indexing built in. For my old system I had to add my own indexing subsystem to support the program itself and the programmable subsystem. It’s really not hard to write a custom index for a genealogical systems by simply coming up with three different indexes — one for names, one for dates and one for places. Really only a little bit of hack software is needed for them.

Tom Wetmore
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "rootsdev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Gary Stanley

unread,
Jan 16, 2016, 2:34:12 PM1/16/16
to rootsdev

Thanks for the quick reply Tom ... yes I mentioned MongoDB because it's something I've been leaning towards, but was a bit afraid to take the jump .. probably because a NoSQL system is alien to me. I've been stuck in the SQL relational world for a bit too long and have been used to setting limits for the data (the square holes for the round pegs, as you put it so well). Having greater freedom scares me somewhat .. as does getting my head around it's workings!!!

Doug Blank

unread,
Jan 16, 2016, 4:24:56 PM1/16/16
to root...@googlegroups.com
On Sat, Jan 16, 2016 at 2:25 PM, Thomas Wetmore <tt...@verizon.net> wrote:
Gary,

The possibilities are endless. I agree that relational databases are not the best fit to genealogical database. That being said, nearly all genealogical system use them!! Which is why filling out data forms on many genealogical programs can feel so much like forcing square pegs into round holes. That being said, both RootsMagic and Gramps, two popular systems, use relational databases. The others probably do too.

Actually, Gramps uses a hierarchical data representation, and a data store that has been around. The data on disk is pickled Python objects, with just a few indexes (using BSDDB). 

However, in Gramps 5.0 due out this year, the backend has been abstracted away from the original design. This now allows any database backend. I have written a few now: DB-API (generic SQL), Django (pure relational), and an in-memory version using Python dictionaries (ie, hash tables).

Which is better depends on what you want to do, and how you want to do it:

If you want to do many ad hoc queries on any field, or get large chunks of data from the database in one access, the SQL is the way to go. If you would rather keep the data as a blob, but use many fast accesses to a close disk, then perhaps something less relational would do. Some of the more modern No-SQL databases (like Mongo, and CouchDB) might be a cool technology to try.

The pure-relational was super fast via web-based ad hoc queries showing page views of data (eg, a select/join). But was dog-slow using our API that was designed for blobs (hierarchical data). The hierarchical is a pain when attempting to do something that looks like a query. (I've since abandoned the pure-relational Django, and am writing a new version of Gramps for the web based on pure-Python and Tornado.)

 

MongoDB is a great option in my opinion. So are graph databases such as neo4j. There are a number of others that have their proponents. MongoDB and neo4j are the only two I have looked at seriously, and either one would do a bang up job as the backing store for a genealogical system.

Twenty five years ago I developed a genealogical database in which all the records were Gedcom structures. I allow any tags so that users can design their own schemas, introducing new conventions (e.g., new event types, new attribute types), on the fly. Note that for all practical purposes Gedcom syntax is the same as JSON syntax is that same as XML syntax, so anything you can do with one of them you can do with all of them.

Yep. Gramps has import/export for GEDCOM, JSON, XML, CSV, and others.


What allows this to work quite well with the database is a programmable subsystem with a lot of features.

The issue is the quality of indexing you can get. MongoDB and neo4j have good indexing built in. For my old system I had to add my own indexing subsystem to support the program itself and the programmable subsystem. It’s really not hard to write a custom index for a genealogical systems by simply coming up with three different indexes — one for names, one for dates and one for places. Really only a little bit of hack software is needed for them.

If you wanted to try a MongoDb or CouchDB with Gramps, that would be very powerful, and would get you a host of features. I'd be glad to discuss further.

-Doug

Ryan Heaton

unread,
Jan 17, 2016, 1:05:42 AM1/17/16
to root...@googlegroups.com
What an awesome project!

I have a few points I can offer. You're welcome to take them for what they're worth.

GEDCOM X was never designed to be mapped to a relational database. It was designed for exchanging data between systems regardless of the implementation details of those systems. I'd suggest waiting on using GEDCOM X until you think about making an API to your database available for other products to consume. Rootsmagic, Legacy, AQ, Mac Family Tree, Ancestry, FindMyPast, MyHeritage, and a ton of other products all know how to read a GEDCOM X RS API. So the idea is that if your database supports a GEDCOM X RS API, these products should already be able to use it. I'm actually giving a RootsTech presentation on this in a few weeks.

(I also find it mildly surprising how often the idea of mapping GEDCOM X to a relational database gets brought up. Does this happen with all data exchange specifications? I wonder how often people bring up mapping HTML to a relational database...)

I'm not sure what article you read, but I can confirm that FamilySearch is moving to a NoSQL database implementation. We actually decided on Cassandra, but I'm sure MongoDB would be a great choice, too. And we did decide to base our JSON schema on GEDCOM X JSON. We're also doing some experimental research into graph databases like Neo4J and Banyan. If we decided to use a graph database, it would probably be used as an optimization mechanism (for queries and stuff) and not as the canonical data store.

Good luck! Keep up the great work!

-Ryan

Gary Stanley

unread,
Jan 17, 2016, 6:23:48 AM1/17/16
to rootsdev
Thank you for all the advice guys, I knew this would be the place to ask questions!
I know GEDCOM X was never conceived to be mapped to a relational database, but when I was working on my own database schema, I just found a lot of what I needed paralleled with its specifications. I'd also seen a few people say if would never work as a relational database, but I guess I just had to discover that for myself.
As for FamilySearch, it wasn't an article I read (I'm sorry I gave that impression) it was actually a video presentation by one of the guys at FamilySearch (A Billion Person Family Tree, by Randall Wilson) from 2013 where he spoke about evaluating MongoDB and it's performance. He isn't the worlds greatest speaker, but that presentation opened my eyes somewhat.

I'm still not sure which way to go with this. I could stick with SQL/a relational database and attempt to develop it more, but NoSQL intrigues me ... although it's going to be a steep learning curve for me personally to go that route. Not that I don't like a challenge.

Oh what to do .. I don't know!

Tom Wetmore

unread,
Jan 18, 2016, 10:35:48 AM1/18/16
to rootsdev
Gary,

The possibilities are endless. I agree that relational databases are not the best fit to genealogical database. That being said, nearly all genealogical system use them!! Which is why filling out data forms on many genealogical programs can feel so much like forcing square pegs into round holes. That being said, both RootsMagic and Gramps, two popular systems, use relational databases. The others probably do too.

Actually, Gramps uses a hierarchical data representation, and a data store that has been around. The data on disk is pickled Python objects, with just a few indexes (using BSDDB). 

Doug, My apologies for making a wrong assumption. Tom Wetmore 

Ben Brumfield

unread,
Jan 22, 2016, 10:07:55 AM1/22/16
to rootsdev
On Saturday, January 16, 2016 at 1:34:12 PM UTC-6, Gary Stanley wrote:

Thanks for the quick reply Tom ... yes I mentioned MongoDB because it's something I've been leaning towards, but was a bit afraid to take the jump .. probably because a NoSQL system is alien to me. I've been stuck in the SQL relational world for a bit too long and have been used to setting limits for the data (the square holes for the round pegs, as you put it so well). Having greater freedom scares me somewhat .. as does getting my head around it's workings!!!

We use MongoDB for MyopicVicar (the code that powers http://FreeREG2.freereg.org.uk/ ), and have been fairly pleased.  That said, there are some caveats I'd like to mention from our experience:

FreeREG is a record database, not a family tree database.  As such, the distribution of data concentrated a single kind of record (a single table, in RDBMS terms) -- 30+ million parish register entries, around 50 thousand places, a few hundred transcribers, and a few dozen counties.  If the data you're trying to represent is similarly focused on a single, complex, heterogeneous kind of record, NoSQL is a good fit.  If it is more evenly distributed across entities, and only makes sense when those entities are joined to each other, NoSQL will bite you.

We've been very pleased with MongoDB's index support.  Its replication was astonishingly easy to set up and surprisingly quick.  We did run into kinks trying to do bulk updates on a 200GB database, when those updated records needed to be sent to the other members of the replication set that required .  (This may have been fixed in more recent versions of MongoDB).  When running in a contentious environment, we also ran into failures on the part of the query optimizer to choose the proper index based on the query shape, and were forced to rely on manually specifying hints.  (This also may have been fixed in more recent versions.)

Finally, in order to use some popular modules (RefineryCMS in our case), we ended up installing MySQL as well.  This may be inescapable, since pure-NoSQL applications seem not to last very long in the wild once they become generalized.  (The search engine part of our code is pure NoSQL, but once we needed to move content management onto the same platform we were stuck running two DBMS.)  Some people in the same boat have argued that Postgres's HStore or JSON can accomplish much of the NoSQL functions while still giving you an RDBMS.  It's not clear to me whether this is so, and I'd want to look really hard at indexes before committing.

Ben
http://manuscripttranscription.blogspot.com/


Magnus Sälgö

unread,
Aug 9, 2016, 12:22:56 PM8/9/16
to rootsdev
I posted at another place in this group the following 

I would recommend looking into RDF and SPARQL and the Wikidata platform. Its a way to define relations that looks excellent for genealogy and much more...

I have written a little bit about Wikidata at WikiTree see wikitree.com/wiki/Space:Wikidata  

Other links:
Regards
Magnus Sälgö
Stockholm, Sweden

Ben Laurie

unread,
Aug 11, 2016, 8:04:55 PM8/11/16
to root...@googlegroups.com
On 16 January 2016 at 19:25, Thomas Wetmore <tt...@verizon.net> wrote:
> Gary,
>
> The possibilities are endless. I agree that relational databases are not the best fit to genealogical database. That being said, nearly all genealogical system use them!! Which is why filling out data forms on many genealogical programs can feel so much like forcing square pegs into round holes. That being said, both RootsMagic and Gramps, two popular systems, use relational databases. The others probably do too.
>
> MongoDB is a great option in my opinion. So are graph databases such as neo4j. There are a number of others that have their proponents. MongoDB and neo4j are the only two I have looked at seriously, and either one would do a bang up job as the backing store for a genealogical system.
>
> Twenty five years ago I developed a genealogical database in which all the records were Gedcom structures. I allow any tags so that users can design their own schemas, introducing new conventions (e.g., new event types, new attribute types), on the fly. Note that for all practical purposes Gedcom syntax is the same as JSON syntax is that same as XML syntax, so anything you can do with one of them you can do with all of them.

Surely the important point is that GEDCOM allows you to reference
other entities? JSON doesn't. XML probably does, but I am confident is
not human readable once you've done it.

Thomas Wetmore

unread,
Aug 11, 2016, 9:19:02 PM8/11/16
to root...@googlegroups.com
Sorry, I don’t understand what you mean, which I suppose, is my polite way of saying that you are wrong. It is trivial to refer to other entities with JSON, just as it is with XML. And many would argue that GEDCOM, JSON and XML are equally readable (or unreadable as the case may be). The three are equally expressive and only differ in some details of syntactic sugar.

In GEDCOM every object has an id, and one object can refer to other objects via those ids. JSON can be used to refer to genealogical objects in the same way. As can XML. As can umpty million other formats.


Tom Wetmore

Ben Laurie

unread,
Aug 12, 2016, 3:55:21 AM8/12/16
to root...@googlegroups.com
On 12 August 2016 at 02:19, Thomas Wetmore <tt...@verizon.net> wrote:
>
>> On Aug 11, 2016, at 8:04 PM, Ben Laurie <b...@links.org> wrote:
>>
>> On 16 January 2016 at 19:25, Thomas Wetmore <tt...@verizon.net> wrote:
>>> Gary,
>>>
>>> The possibilities are endless. I agree that relational databases are not the best fit to genealogical database. That being said, nearly all genealogical system use them!! Which is why filling out data forms on many genealogical programs can feel so much like forcing square pegs into round holes. That being said, both RootsMagic and Gramps, two popular systems, use relational databases. The others probably do too.
>>>
>>> MongoDB is a great option in my opinion. So are graph databases such as neo4j. There are a number of others that have their proponents. MongoDB and neo4j are the only two I have looked at seriously, and either one would do a bang up job as the backing store for a genealogical system.
>>>
>>> Twenty five years ago I developed a genealogical database in which all the records were Gedcom structures. I allow any tags so that users can design their own schemas, introducing new conventions (e.g., new event types, new attribute types), on the fly. Note that for all practical purposes Gedcom syntax is the same as JSON syntax is that same as XML syntax, so anything you can do with one of them you can do with all of them.
>>
>> Surely the important point is that GEDCOM allows you to reference
>> other entities? JSON doesn't. XML probably does, but I am confident is
>> not human readable once you’ve done it.
>
> Sorry, I don’t understand what you mean, which I suppose, is my polite way of saying that you are wrong. It is trivial to refer to other entities with JSON,

My point is that there is nothing in the standard that says how to do
this, unlike GEDCOM. Yes, of course, you can invent your own way of
doing it, but then you have a data structure that is more than JSON
can represent in a standardised way.

> just as it is with XML. And many would argue that GEDCOM, JSON and XML are equally readable (or unreadable as the case may be). The three are equally expressive and only differ in some details of syntactic sugar.
>
> In GEDCOM every object has an id, and one object can refer to other objects via those ids. JSON can be used to refer to genealogical objects in the same way. As can XML. As can umpty million other formats.
>
>
> Tom Wetmore
>

Thomas Wetmore

unread,
Aug 12, 2016, 5:39:20 AM8/12/16
to root...@googlegroups.com

> On Aug 12, 2016, at 3:55 AM, Ben Laurie <b...@links.org> wrote:
>
> On 12 August 2016 at 02:19, Thomas Wetmore <tt...@verizon.net> wrote:
>>
>>> On Aug 11, 2016, at 8:04 PM, Ben Laurie <b...@links.org> wrote:
>>>
>>> On 16 January 2016 at 19:25, Thomas Wetmore <tt...@verizon.net> wrote:
>>>> Gary,
>>>>
>>>> The possibilities are endless. I agree that relational databases are not the best fit to genealogical database. That being said, nearly all genealogical system use them!! Which is why filling out data forms on many genealogical programs can feel so much like forcing square pegs into round holes. That being said, both RootsMagic and Gramps, two popular systems, use relational databases. The others probably do too.
>>>>
>>>> MongoDB is a great option in my opinion. So are graph databases such as neo4j. There are a number of others that have their proponents. MongoDB and neo4j are the only two I have looked at seriously, and either one would do a bang up job as the backing store for a genealogical system.
>>>>
>>>> Twenty five years ago I developed a genealogical database in which all the records were Gedcom structures. I allow any tags so that users can design their own schemas, introducing new conventions (e.g., new event types, new attribute types), on the fly. Note that for all practical purposes Gedcom syntax is the same as JSON syntax is that same as XML syntax, so anything you can do with one of them you can do with all of them.
>>>
>>> Surely the important point is that GEDCOM allows you to reference
>>> other entities? JSON doesn't. XML probably does, but I am confident is
>>> not human readable once you’ve done it.
>>
>> Sorry, I don’t understand what you mean, which I suppose, is my polite way of saying that you are wrong. It is trivial to refer to other entities with JSON,
>
> My point is that there is nothing in the standard that says how to do
> this, unlike GEDCOM. Yes, of course, you can invent your own way of
> doing it, but then you have a data structure that is more than JSON
> can represent in a standardized way.

GEDCOM is both a syntactic standard AND a semantic standard. JSON and XML are syntactic standards. So GEDCOM has built into its semantic standard how to refer to other objects. JSON and XML, as syntactic standards, allow their users to define both how to represent objects and how to represent references between them in pretty much any fashion a user might wish. Is this what you are trying to say?

However, the underlying point I was making back in January, is that there is no difference in the fitness of GEDCOM, JSON or XML (or other formats) to represent genealogical data. Your statement that JSON does not allow an entity to refer to another entity is simply untrue and irrelevant to that point. I don’t understand what other point you might be making, but what you have said has no bearing on the ability or usefulness of JSON or XML to hold representations of genealogical data.

The GEDCOM-X standard is generally thought of as an abstract data model. However, if you wish to put GEDCOM-X data into an external file format, the LDS has defined both a JSON external format and an XML external format for that data. So there are two direct counter points to your argument there.

Your statement that “then you have a data structure that is more than JSON can represent in a standardized way,” is meaningless and indicates a lack of understanding of the purpose of JSON. When you define how to use JSON to represent genealogical data, you have to specify how you are going to do it, which includes how you are going to have children refer to parents, spouses to spouses, and so on. But this is the whole raison d’être of JSON in the first place. It gives designers the flexibility of defining how to represent data the way that seems best for their applications. Your statement negates the whole purpose of JSON.

Tom Wetmore
Reply all
Reply to author
Forward
0 new messages