Gedcom Paths, ChatGPT, and 2 Models for Genealogical Records

17 views
Skip to first unread message

Thomas Wetmore

unread,
Oct 12, 2024, 10:43:15 PM10/12/24
to root...@googlegroups.com
Given that a genealogical database stores records in Gedcom format (there are at least three programs that do), the idea of a "Gedcom path" is useful. Say the expression "INDI->NAME" means the first 1 NAME line in a person, or "INDI->BIRT*->DATE" means all the 2 DATE lines in all the BIRT events in a person, or "FAM->CHIL*" means the list of CHIL lines in a family, or "INDI->ANY*->DATE*" means all the 2 DATE lines wherever they are in a person.

Such a facility has many uses, including validating a Gedcom file against a particular standard, or quickly extracting information from anywhere in a record. I asked ChatGPT to help implement this feature and, IMHO, got high quality software that is useful as is, though I'm modifying it so it fits nicely with other code.

Most genealogical software has specific structures for objects like persons, families, sources, and so on. These structures don't have to be complicated but are subject to creeping featurism. Software that uses Gedcom for internal format doesn't need custom structures (it does still need custom functions). The Gedcom representation of a record is a tree structure of nodes, each node from a line of Gedcom. If you want the birth of the person, there isn't a field in a data structure holding it; you traverse an INDI tree to extract it. The "field" holding the date is there, of course, just located down in a tree instead of in a field in a custom data type. I've written genealogical software both ways, but I wouldn't claim to know which is better. I can say I prefer coding using the Gedcom tree model.

Anybody else have thoughts about these two ways to implement genealogical data structures? Are there others?

Tom Wetmore

lkes...@lkessler.com

unread,
Oct 13, 2024, 6:22:20 PM10/13/24
to root...@googlegroups.com
Tom,

For what it's worth, I have this Hybrid structure that I use.

I have an Indi table, a Fam table and a Structure table for everything else. This makes it easy to include a few commonly referenced fields like Name, Sex, Birthdate, Deathdate, MarriageDate, Structure Title, etc. so they don't have to be looked up every time.

Every event/fact, i.e. Level 1 line with all its sublines are stored as a separate fact under its Indi, Fam or Struct record. So 1 BIRT and all its sublines which include the date, place, notes, etc, are stored as one fact under the INDI.

Two exceptions:

1. FAMS, FAMC, CHIL, HUSB, WIFE links are stored with any sublines they may have in two tables, one for Parent-child links and one for Family-spouse links.

2. Whenever an line with a link is encountered, a structure with the XREF ID and any sublines is created and if it doesn't already exist in the structure table, it is added. Each of these will have a list of all the facts that reference them, and each fact will contain every structure they reference. Places are treated as links. A link within a link is handled recursively.

Example of #2

0 @I11@ INDI
1 BIRT
2 DATE 1855
3 SOUR @S22@
4 PAGE Volume 2; Page 43
4 NOTE @N33@
2 PLAC Boston, MA, USA
1 DEAT
2 DATE 1920
2 NOTE @N33@
...
0 @N33@ NOTE Some
1 CONC notes here
...
0 @S22@ SOUR
1 TITL Library Book

For the INDI @I11@ record, this will create these 2 facts (where " / " is for the line separator)
BIRT / 2 DATE 1855 / 3 STRU-1 / 2 STRU-2
DEAT / 2 DATE 1920 / 2 STRU-0

It will create these structures:
STRU-0 = NOTE @N33@
STRU-1 = SOUR @S22@ / 1 PAGE Volume 2; Page 43 / 2 STRU-0
STRU-2 = PLAC Boston, MA, USA

And then when it later gets to the NOTE record, since it is the same as STRU-0, it will use it and this fact to it:
NOTE Some notes here

And then when it gets to the 0 SOUR record, since it is different from the existing structures (STRU-1 includes a citation), it will create the structure:
STRU-3 = SOUR @S22@
and add the fact to it:
TITL Library Book

Not sure if I explained that well enough to make sense, but it works very nicely for me.

Louis
--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/F144EF65-358F-4CD1-B051-C6FA6900BC87%40gmail.com.

Thomas Wetmore

unread,
Oct 14, 2024, 6:37:45 AM10/14/24
to root...@googlegroups.com
Louis,

Thanks. Sounds like a lot of thought behind it. What do you use as the database? It seems like you've designed the database structure for speed and easy access to information. You've separated things in different ways for anticipated actions.

In contrast I've been having odd thoughts about genealogical databases. How important are conventional databases and their indexes and their tuning in our time? How important is a customized structure for performance. An example:

LifeLines uses a persistent BTree for its database. It indexes persons by name. Each name has a code and all persons with names that match are indexed to the code. When the user enters a name the index retrieves the matching persons. It is very is fast.

In DeadEnds the database is in RAM, so access times are orders of magnitude faster. I reproduced the LL name index in a hash table. Name search is instantaneous. I'm wondering whether a name index is needed. Given a name why not iterate every single person, looking for matches? Have we reached the point along Moore's law where we can store "normal sized" databases in memory and get rid of a layer of indexes? It makes the coding much simpler.

The records in LifeLines and DeadEnds are GNode trees, where each node is a Gedcom line and each tree is a Gedcom record. That's it. There are no indexes (other than the name index I'm thinking of getting rid of.) The database is a couple hash tables that map Gedcom keys to GNode trees. All genealogical operations are done by traversing the trees doing things. With processors now running multi-cores at over 3 Ghz per core, it might be time for a rethink. "Where we're going, we don't need roads."

Tom Wetmore

lkes...@lkessler.com

unread,
Oct 14, 2024, 8:03:53 PM10/14/24
to root...@googlegroups.com

Tom,

 

Like you, I don’t use a database, but just use in-memory data structures to hold my data.

 

I wrote the first version of my program over 30 years ago (Wow!, it’s been that long!) in Fortran and stored everything in arrays.  I converted that to Turbo Pascal and (again like you), wrote my own B*-tree to hold my data.  Both Fortran and Pascal are optimized compilers and produce very fast code. I used Fortran in the 1970’s for the chess program I developed and in doing so, I learned proper optimization techniques.

 

Turbo Pascal turned into Borland Delphi (an object-oriented version of Pascal with a tremendous IDE that had a great debugger). Embarcadero bought Delphi from Borland and they continue to develop and market it (now up to Version 12.2).

 

Behold has always been just a GEDCOM reader that produces a report from the data. It has always been fast because of the internal in-RAM data structures, but I could never make it quite as fast for very large files (millions of people) as GENViewer by MudCreek Software.  Their secret trick for speed was using memory mapping for the GEDCOM file and they had direct access to it without even physically reading the whole thing. They made a pass and indexed what they needed.  By the way, GENViewer is still available for free at http://www.mudcreeksoftware.com/

 

About 10 years ago, I was working to turn Behold into a full-fledged genealogy editor.  I knew then I’d need a database to do that. I considered SQLite which had the ability to have in-RAM databases, but it overall was too clunky and wasn’t fast enough for me. So I decided to ultimately go with Embarcadero’s InterBase. It took several years to convert my internal data structures into database-compatible tables, and the result was the description I gave earlier.  I got rid of my B*-trees (sad to see them go) and replaced them internally for now with Delphi’s generic structures, such as TList and TDictionary, many of which are hash-table based so were even faster than my B*-tree.

 

But then what happened was a change to genealogy itself. About 8 years ago, I chose MyHeritage to be where I’d keep and update my primary genealogy file. Their databases of billions of records and automatic record matching and matches to other trees was a game changer. I knew I could never implement that myself in Behold.  So even though I was ready to, I never did convert Behold to use InterBase

 

But I’m still continuing to develop Behold as a GEDCOM reader. I can download my GEDCOM any time I want from MyHeritage and use Behold to view and make use of the information better.

 

With regards to your questions:

 

I would use a database, not a BTree or other data structures for a genealogy data editing program. You really can’t get much faster or simpler than an indexed database that is in RAM.

 

I would never iterate. A hash table will save you time. And that is just what a database index is.

 

Re: “making coding simpler” – Well, simply doing database calls is the simplest. If writing the code in you language of choice is difficult or gives complex routines, you could look for a higher level language with object-oriented extensions to write it in. But then, you and I are old horses, and new tricks don’t go well with us.

 

There’s my 2 cents worth.

 

Louis

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Thomas Wetmore

unread,
Oct 15, 2024, 11:12:32 AM10/15/24
to root...@googlegroups.com


> On Oct 14, 2024, at 8:03 PM, <lkes...@lkessler.com> <lkes...@lkessler.com> wrote:
>
> Tom,
> Like you, I don’t use a database, but just use in-memory data structures to hold my data.
> I wrote the first version of my program over 30 years ago (Wow!, it’s been that long!) in Fortran and stored everything in arrays. I converted that to Turbo Pascal and (again like you), wrote my own B*-tree to hold my data. Both Fortran and Pascal are optimized compilers and produce very fast code. I used Fortran in the 1970’s for the chess program I developed and in doing so, I learned proper optimization techniques.

I wrote my first genealogical software in 1987. It was an application generator that output troff files used to typeset family data sheets. It worked nicely but the spec files had to include all the genealogical data the family, requiring me to copy and paste data into several spec files. To avoid this I wrote a program to hold my genealogical data in one place and added a programming subsystem to generate the troff (or any other type of text) text. That was version one of LifeLines.

> Turbo Pascal turned into Borland Delphi (an object-oriented version of Pascal with a tremendous IDE that had a great debugger). Embarcadero bought Delphi from Borland and they continue to develop and market it (now up to Version 12.2).
> Behold has always been just a GEDCOM reader that produces a report from the data. It has always been fast because of the internal in-RAM data structures, but I could never make it quite as fast for very large files (millions of people) as GENViewer by MudCreek Software. Their secret trick for speed was using memory mapping for the GEDCOM file and they had direct access to it without even physically reading the whole thing. They made a pass and indexed what they needed. By the way, GENViewer is still available for free at http://www.mudcreeksoftware.com/


Thanks for the Mud Creek link. You probably know about Pleiades Software with their GenMerge program. They have been around a long time doing very interesting stuff with Gedcom.

> About 10 years ago, I was working to turn Behold into a full-fledged genealogy editor. I knew then I’d need a database to do that. I considered SQLite which had the ability to have in-RAM databases, but it overall was too clunky and wasn’t fast enough for me. So I decided to ultimately go with Embarcadero’s InterBase. It took several years to convert my internal data structures into database-compatible tables, and the result was the description I gave earlier. I got rid of my B*-trees (sad to see them go) and replaced them internally for now with Delphi’s generic structures, such as TList and TDictionary, many of which are hash-table based so were even faster than my B*-tree.

LifeLines uses a custom BTree database, and my own HashTable, Set, Sortable and Unsorted List. In DeadEnds I got rid of the database and use in-RAM hash tables. I plan to move DeadEnds from C to Swift, where 1) the container data types are built-in, 2) dynamic memory management is built-in, and 3) Unicode is built-in.

> But then what happened was a change to genealogy itself. About 8 years ago, I chose MyHeritage to be where I’d keep and update my primary genealogy file. Their databases of billions of records and automatic record matching and matches to other trees was a game changer. I knew I could never implement that myself in Behold. So even though I was ready to, I never did convert Behold to use InterBase

Similarly I shifted over to an Ancestry.com tree for my master "database". My DeadEnds database and Ancestry tree are out of sync and I don't like that. You can generate a Gedcom file from an Ancestry tree, but 1) it is data lossy, and 2) there are bugs in the generated Gedcom, forcing me to hand edit some FAMS and FAMC links. One of my goals for DeadEnds at present it to reach the point that it will be able to merge my master Lifelines Gedcom file and my Ancestry.com Gedcom file.

> But I’m still continuing to develop Behold as a GEDCOM reader. I can download my GEDCOM any time I want from MyHeritage and use Behold to view and make use of the information better. With regards to your questions:
> I would use a database, not a BTree or other data structures for a genealogy data editing program. You really can’t get much faster or simpler than an indexed database that is in RAM. I would never iterate. A hash table will save you time. And that is just what a database index is. Re: “making coding simpler” – Well, simply doing database calls is the simplest. If writing the code in you language of choice is difficult or gives complex routines, you could look for a higher level language with object-oriented extensions to write it in. But then, you and I are old horses, and new tricks don’t go well with us. There’s my 2 cents worth.

Thanks for your thoughts on indexing. I'm sure I will leave the indexing in, but we are reaching a point with technology where it is not crazy to consider doing with out.

Have you ever indexed on dates? Say you want to find all the persons whose birth dates are in a range, or their deaths are, or you have a range for both births and deaths. If you haven't done it, how would you. I've designed date indexes in the past but have yet to implement one.

> Louis

Really enjoyed reading about your experiences. Thanks.

Tom Wetmore





lkes...@lkessler.com

unread,
Oct 17, 2024, 12:50:20 PM10/17/24
to root...@googlegroups.com
Tom asked:
> Have you ever indexed on dates? Say you want to find all the persons whose birth dates are in a range, or their deaths are, or you have a range for both births and deaths. If you haven't done it, how would you. I've designed date indexes in the past but have yet to implement one.

I have developed my own structure for dates and written about it here: https://www.beholdgenealogy.com/blog/?p=1640 which also explains how to sort dates.

With regards to indexing dates, I never had a need for it. But it would be tricky to index because you can't use exact lookups or hashtables. My initial thoughts would be to store the dates sorted in a binary tree or Btree and look for the first value >= the minimum date in your range and then look for the last value <= the largest date in your range. The dates from that first value to the last value will have at least some values falling in your desired range.

Louis



Thomas Wetmore

unread,
Oct 17, 2024, 3:06:18 PM10/17/24
to root...@googlegroups.com
On Oct 17, 2024, at 12:50 PM, <lkes...@lkessler.com> <lkes...@lkessler.com> wrote:

I have developed my own structure for dates and written about it here:  https://www.beholdgenealogy.com/blog/?p=1640  which also explains how to sort dates.

Louis, Thanks for the link. Looking forward to reading it. I am still up in the air on implementing dates. I did a complete implementation for Lifelines, and another for another project, and lately I've carefully read the Gedcom standards. Even though I eschew official standards where possible, that's not a good idea in this case. Double dates and date ranges and partially known dates -- ugh.

With regards to indexing dates, I never had a need for it. But it would be tricky to index because you can't use exact lookups or hashtables.  My initial thoughts would be to store the dates sorted in a binary tree or Btree and look for the first value >= the minimum date in your range and then look for the last value <= the largest date in your range. The dates from that first value to the last value will have at least some values falling in your desired range. 

Louis

For indexing dates I thought an approach similar to my name indexing might work. First find the years (forget months and days) of every date in every record. Then create a hash table where the keys are the years and the elements are the sets of the keys of all records that have that year in any of its dates. To lookup a date, 1) get the year, 2) retrieve the set of possible records, then 3) inspect each record for an event and date that matches the search parameters.

You can imagine using larger indexes or more indexes. For example go to the month level. Or separate deaths, birth, marriages, etc. I wouldn't complexify the simple approach unless performance required it; and can't imagine that it would. Will let you know when I know when I have experimented.

Best, Tom Wetmore


Reply all
Reply to author
Forward
0 new messages