Partition and "Score" a Gedcom File with Memoization

3 views
Skip to first unread message

Thomas Wetmore

unread,
Oct 8, 2024, 11:06:54 PM10/8/24
to root...@googlegroups.com
I mentioned my intention to write a program to 1) Read a Gedcom file; 2) Break the records into closed partitions (closed sets of persons related to each other in any way); and 3) Computing the number of ancestors and descendants of all persons.

Running the program on my master Gedcom file (15,950 persons, 5,357 families) results in these times:

1. Read and parse file into memory and create all records: 67 ms
2. Validate all FAMC, FAMS, HUSB, WIFE, and CHIL links: 86 ms
3. Partition the records into closed sets (1,557 in this case): 212 ms
4. Find the numbers of ancestors and descendants of all persons: 16 ms
TOTAL 381 ms

I was surprised to see how small 4) is, 16 ms. Think about computing the number of ancestors of each person in a set. There will be many shared ancestors (and descendants). In a "dumb" implementation the recursions going up through ancestors and down through descendants will flow through the same persons 100s if not 1000s of times. This problem begs for a "memoization" solution, where after a number is computed, it is cached and never computed again. It worked very well here. I expected this number to be the highest of the four, not the smallest, showing how poor we humans are at predicting where performance problems will happen.

Another surprise is how long step 3) took, 212 ms to partition the 15,950 persons and 5,357 families into 1,557 partitions. You may wonder why there are so many partitions, but that's another story. There is one big partition that holds most of the records and is my main "conclusion" tree. I will analyze that code, and the code for step 2) to try to squeeze more milliseconds out of it. But it's not necessary. This is good performance. The program ran on an Apple Mac Studio with an M1 MAX chip, three generations out of date.

Tom Wetmore


Reply all
Reply to author
Forward
0 new messages