Partitioning Gedcom Records and CHATGPT comments.

Thomas Wetmore

unread,

Oct 4, 2024, 9:20:39 PM10/4/24

to root...@googlegroups.com

Quick followup on recent messages.

Today I wrote a C program that reads a Gedcom file and partitions its records into closed sets of interconnected records. It uses my DeadEnds library. It was working more quickly than expected. I was about to embark on more detailed testing, when I thought it might be fun to pass it by ChatGPT first. I sent it the code of the main program, but nothing from the library. If I were to have gotten this code out of the blue with someone asking me to give advice, it would take some real time to do it. ChatGPT's response took a two seconds at most. In retrospect I should have asked ChatGPT to write it for me. Below is what I sent and what ChatGPT responded:

WHAT I SENT
------------------------------------------------------------

A gedcom file holds genealogical data. The following functions read a gedcom file and then using person to family links and family to person links partitions the records in the file into closed partitions where each person in a partition is somehow connected through parent/child connections to every other person in the partition. I wrote this today and it seems to be working correctly, but I haven't check it very far. Would you care to comment? Do you see anything wrong?

//
// main.c
// Partition
//
// Created by Thomas Wetmore on 4 October 2024.
// Last changed on 4 October 2024.

#include <stdio.h>
#include "import.h"
#include "gnodelist.h"
#include "utils.h"
#include "stringtable.h"
#include "generatekey.h"
#include "writenode.h"
#include "file.h"

static void getArguments(int, char**, String*);
static void getEnvironment(String*);
static void usage(void);
static void goAway(ErrorLog*);
static void doPartition(GNodeList*);
static List* createPartition(GNode*, GNodeList*);
static void createIndexOfRecords(GNodeList*);
static bool debugging = true;

// main is the main program of the Partition batch program.
int main(int argc, char** argv) {
String gedcomFile = null;
String searchPath = null;
printf("%s: Partition begin.\n", getMillisecondsString());
// Get the Gedcom file to process.
getArguments(argc, argv, &gedcomFile);
getEnvironment(&searchPath);
gedcomFile = resolveFile(gedcomFile, searchPath);
if (debugging) printf("Resolved file: %s\n", gedcomFile);
// Read/parse the Gedcom file and get the list of all roots.
File* file = openFile(gedcomFile, "r");
ErrorLog* log = createErrorLog();
GNodeList* roots = getGNodeTreesFromFile(file, log);
printf("Partition: %s: read gedcom file.\n", getMillisecondsString());
if (lengthList(log) > 0) goAway(log);
closeFile(file);
// Validate record keys read from the Gedcom file.
checkKeysAndReferences(roots, file->name, log);
printf("Partition: %s: validated keys.\n", getMillisecondsString());
if (lengthList(log)) goAway(log);
// Instead of creating a full database we only need a single RecordIndex.
createIndexOfRecords(roots);
// Call the function that creates the partitions.
doPartition(roots);
}

// getFileArguments gets the file names from the command line. They are mandatory.
static void getArguments(int argc, char* argv[], String* gedcomFile) {
int ch;
while ((ch = getopt(argc, argv, "g:")) != -1) {
switch(ch) {
case 'g':
*gedcomFile = strsave(optarg);
break;
case '?':
default:
usage();
exit(1);
}
}
if (!*gedcomFile) {
usage();
exit(1);
}
}

// createIndexOfRecords creates a RecordIndex from a GNodeList. The RecordIndex is a file local
// static variable, recordIndex.
static RecordIndex* recordIndex = null;
static void createIndexOfRecords(GNodeList* list) {
recordIndex = createRecordIndex();
FORLIST(list, el)
GNode* root = ((GNodeListEl*) el)->node;
String key = root->key;
if (key) {
int line = ((GNodeListEl*) el)->line;
addToRecordIndex(recordIndex, key, root, line);
}
ENDLIST
}

static StringSet* visited = null;
static void doPartition(GNodeList* gnodes) {
visited = createStringSet(); // Visited GNode keys.
List* partitions = createList(null, null, null, false);
// We want to put both person and family records in a partition.
// A partition must be closed.
FORLIST(gnodes, el)
GNode* root = ((GNodeListEl*) el)->node;
String key = root->key;
if (key && !isInSet(visited, key)) {
printf("Visiting %s\n", root->key);
appendToList(partitions, createPartition(root, gnodes));
}
ENDLIST
printf("THERE ARE %d PARTITIONS\n", lengthList(partitions));
int count = 1;
FORLIST(partitions, element)
List* partition = (List*) element;
printf("Partition %d has %d records\n", count++, lengthList(partition));
ENDLIST
}

// createPartition creates a partitioning by finding the closure Gedcom graph containing the
// argument GNode.
static List* createPartition(GNode* root, GNodeList* gnodes) {
// partition is the list of GNodes that form the partition created by this function.
List* partition = createList(null, null, null, false);
// queue is the list of root unprocessed nodes/records in the closure.
List* queue = createList(null, null, null, false);
// add the original node as the first item in the queue of unprocessed nodes/records.
prependToList(queue, root);
// remove and process nodes from the queue.
while (lengthList(queue) > 0) {
// Remove the next node from the queue to process.
GNode* root = getAndRemoveLastListElement(queue);
String key = root->key;
if (!key) continue; // The header record does not have a key.
// If the key is in the key set the node has already been processed.
if (isInSet(visited, key)) continue;
addToSet(visited, key); // Add key to set of processed keys, and
appendToList(partition, root); // add the node/record to the partition.
// Process the tree. If a person find the FAMS and FAMC links.
if (recordType(root) == GRPerson) {
for (GNode* child = root->child; child; child = child->sibling) {
String tag = child->tag;
if (eqstr(tag, "FAMS") || eqstr(tag, "FAMC")) {
String value = child->value;
GNode* node = searchRecordIndex(recordIndex, value);
if (!node) {
printf("ERROR ONE\n");
continue;
}
prependToList(queue, node);
}
}
} else if (recordType(root) == GRFamily) {
for (GNode* child = root->child; child; child = child->sibling) {
String tag = child->tag;
if (eqstr(tag, "HUSB") || eqstr(tag, "WIFE") || eqstr(tag, "CHIL")) {
String value = child->value;
GNode* node = searchRecordIndex(recordIndex, value);
if (!node) {
printf("ERROR TWO\n");
continue;
}
prependToList(queue, node);
}
}
}
}
deleteList(queue);
return partition;
}

// getEnvironment checks for the DE_GEDCOM_PATH env variable.
static void getEnvironment(String* gedcomPath) {
*gedcomPath = getenv("DE_GEDCOM_PATH");
if (!*gedcomPath) *gedcomPath = ".";
}

// usage prints the RunScript usage message.
static void usage(void) {
fprintf(stderr, "usage: Partition -g gedcomfile\n");
}

static void goAway(ErrorLog* log) {
printf("Partition: cancelled due to errors\n");
showErrorLog(log);;
exit(1);
}

--------------------------
WHAT CHATGPT RESPONDED
------------------------

Your code looks clean and logical, and from a high-level perspective, the structure seems solid for handling the partitioning of genealogical records into connected partitions. Here are some observations and suggestions:

1. Error Handling

• You’re checking for errors well when opening the GEDCOM file (openFile) and validating the keys. However, there are a couple of places where more robust error handling could help:
• In createPartition: You print "ERROR ONE" and "ERROR TWO" if searchRecordIndex fails to find a node. Instead of printing generic error messages, it would be more helpful to include additional context, such as the key or node value that triggered the error.
• It might also be useful to log the errors into the ErrorLog you created earlier for consistency.

2. Memory Management

• You’re creating lists dynamically using createList, but I don’t see any explicit deallocation of memory for these lists, except for the queue in createPartition. If these lists are allocated dynamically, make sure that memory is freed properly when no longer needed to avoid memory leaks, especially with partitions and visited.

3. Optimization

• Processing the queue in createPartition: Right now, you’re using prependToList to add nodes to the queue and getAndRemoveLastListElement to retrieve and process them. This approach mimics a stack, where the most recently added node is processed first. If you want to traverse the tree in breadth-first fashion (which may be a more natural fit for some genealogical structures), you can append elements to the end of the queue and remove them from the front.
• Set Operations: The visited set is key to ensuring you don’t reprocess nodes. If the StringSet you are using is implemented efficiently (like a hash table), this is good. If not, consider optimizing the underlying data structure for better lookup and insertion times.

4. Debugging Flag

• The use of debugging is a good idea, but I see that you’re mixing printf debug statements and error messages. It might be useful to control both debug logs and error outputs through a more refined logging system that can print to different destinations (stdout vs stderr, for example), and toggle between different levels of verbosity.

5. Handling of Edge Cases

• Does your program handle edge cases such as:
• GEDCOM files with circular references between families and individuals?
• GEDCOM records that might have missing or incomplete links?
• Orphan records (e.g., individuals or families without any connections)?

If these cases are unlikely in your input data, it’s not a major concern. But if they are, you may want to include additional checks or logging to handle them gracefully.

6. Improving Partitioning Logic

• The partitioning approach using parent-child links seems correct, but just to confirm: Have you thought about non-parental relationships (like spouse or siblings) that could also affect the “connectedness” of a group? It seems like you’re already handling HUSB, WIFE, and CHIL tags, so this looks on track, but it’s something to keep in mind depending on your GEDCOM file’s complexity.

Overall, the logic is clear and seems well-structured. Once you test it with a broader range of GEDCOM files, especially with edge cases, I think it will perform as expected. If any specific test cases fail, feel free to share more details and I’d be happy to dive deeper.

----------------------------------
Tom Wetmore
-------------------------------------

Enno Borgsteede

unread,

Oct 5, 2024, 11:13:19 AM10/5/24

to root...@googlegroups.com

Hello Tom,

Have you ever asked ChatGPT to help optimize some of your source code?

I'm asking, because we have an algorithm in Gramps that can calculate
and display paths between people, which is more than 100 times slower
than a similar algorithm in Geneweb. I have a tree with more than
600,000 people, which includes a Karel de Grote GEDCOM that I once found
on the web, and Geneweb can compute and display a path between any two
people in that tree in 1 or 2 seconds. I see that on Geneanet, and also
when I run Geneweb here.

Algorithms like these are quite trivial, in the sense that you start
with one person, and then follow direct connections until you reach the
other one, and store some sort of part string while doing that. And if
may phone can do that in a reasonable time, close to real time, I see no
reason why Gramps shouldn't be able to do the same, And if an algorithm
is designed so that it never processes the same person twice, its
processing time can be quite predictable, and increase more or less
linear with the size of the tree, or even better.

The performance of our algorithm seems to be nothing like that, and the
duration of the calculation looks more like it's exponential.

The source code is more than 3000 lines, so I doubt whether ChatGPT is
able to process that, but I may be able to find the main loop, and feed
that to the AI.

Do you or anyone else no other AIs that may be able to do this? I often
use Visual Studio Code, but never tried how that works with AI.

Regards,

Enno

Thomas Wetmore

unread,

Oct 5, 2024, 11:50:20 AM10/5/24

to root...@googlegroups.com

On Oct 5, 2024, at 11:13 AM, Enno Borgsteede <enno...@gmail.com> wrote:

Hello Tom,

Have you ever asked ChatGPT to help optimize some of your source code?

Enno,

I haven't asked, but on occasion it makes suggestions.

I'm asking, because we have an algorithm in Gramps that can calculate and display paths between people, which is more than 100 times slower than a similar algorithm in Geneweb. I have a tree with more than 600,000 people, which includes a Karel de Grote GEDCOM that I once found on the web, and Geneweb can compute and display a path between any two people in that tree in 1 or 2 seconds. I see that on Geneanet, and also when I run Geneweb here.

I wrote a LifeLines script (called cousins) that does this. Scripts are interpreted which slows things down, but for my database, much smaller than your 600,000 example, it runs fast. When it is runs it prints a dot for every 100 persons considered, and the dots stream across the terminal very rapidly.

Algorithms like these are quite trivial, in the sense that you start with one person, and then follow direct connections until you reach the other one, and store some sort of part string while doing that. And if may phone can do that in a reasonable time, close to real time, I see no reason why Gramps shouldn't be able to do the same, And if an algorithm is designed so that it never processes the same person twice, its processing time can be quite predictable, and increase more or less linear with the size of the tree, or even better.

In the LifeLines script language there is an efficient data type to hold sets of persons. Most operations you would imagine are implemented. For example you can ask for the ancestor set of a set, which returns all the ancestors of all the persons in the original. Other operations handle children, parents, descendants, spouses. There are the normal union, intersection and difference operations. With this datatype it is straightforward to write algorithms of substantial complexity. All the operations are order n or order n log n.

The performance of our algorithm seems to be nothing like that, and the duration of the calculation looks more like it's exponential.

The source code is more than 3000 lines, so I doubt whether ChatGPT is able to process that, but I may be able to find the main loop, and feed that to the AI.

I would give it a try. So far ChatGPT has handled everything I've given it. It once told me I could send it a zip file of my full code base and it would comment on it for me. By the way, the cousins script is just 267 lines of code. It finds the closest relationship between and two persons (there can be many as you know) and prints out that relationship in a way that shows all the steps.

Do you or anyone else no other AIs that may be able to do this? I often use Visual Studio Code, but never tried how that works with AI.

I use both Visual Studio Code and Xcode for development. Their AI extensions can give good suggestions as I'm typing, but I've never tried to use them for deeper things. I'm so old now that I really like a fossil with this newfangled stuff.

Regards,

Enno

Tom Wetmore

Enno Borgsteede

unread,

Oct 5, 2024, 3:32:20 PM10/5/24

to root...@googlegroups.com

Op 05-10-2024 om 17:50 schreef Thomas Wetmore:

I would give it a try. So far ChatGPT has handled everything I've given it. It once told me I could send it a zip file of my full code base and it would comment on it for me. By the way, the cousins script is just 267 lines of code. It finds the closest relationship between and two persons (there can be many as you know) and prints out that relationship in a way that shows all the steps.

I'll have another look at our code, but finding cousins is easy, and our Deep Connections Gramplet goes further than that, just like Geneanet does for these random persons:

I hope that this picture will make it, but if it does not, I can show a similar path from Gramps, as text.

Enno

Thomas Wetmore

unread,

Oct 5, 2024, 4:13:26 PM10/5/24

to root...@googlegroups.com

On Oct 5, 2024, at 3:32 PM, Enno Borgsteede <enno...@gmail.com> wrote:

Op 05-10-2024 om 17:50 schreef Thomas Wetmore:

I would give it a try. So far ChatGPT has handled everything I've given it. It once told me I could send it a zip file of my full code base and it would comment on it for me. By the way, the cousins script is just 267 lines of code. It finds the closest relationship between and two persons (there can be many as you know) and prints out that relationship in a way that shows all the steps.
I'll have another look at our code, but finding cousins is easy, and our Deep Connections Gramplet goes further than that, just like Geneanet does for these random persons:

<1s60IWyMsuyc2w6H.png>

Enno,

The program is called cousins, but it finds the relationship between any two persons who have any connection at all. It can go up and down child/parent chains any number of times, it can follow spouse and sibling links, it can shift over to in-laws. If there is any connection between the two persons, using any combination of FAMS, FAMC, HUSB, WIFE, and CHIL links, it will find it quickly.

Tom Wetmore

Enno Borgsteede

unread,

Oct 5, 2024, 4:52:27 PM10/5/24

to root...@googlegroups.com

Op 05-10-2024 om 22:13 schreef Thomas Wetmore:

> The program is called cousins, but it finds the relationship between
> any two persons who have any connection at all. It can go up and down
> child/parent chains any number of times, it can follow spouse and
> sibling links, it can shift over to in-laws. If there is any
> connection between the two persons, using any combination of FAMS,
> FAMC, HUSB, WIFE, and CHIL links, it will find it quickly.

OK, thanks. That suggest that it's time for some serious refactoring, or
a rewrite.

Enno

Thomas Wetmore

unread,

Oct 5, 2024, 5:02:40 PM10/5/24

to root...@googlegroups.com

Enno,

Picture came through fine. I just checked the program to see if I was telling you the truth (it is over 30 years old). I lied. The "cousins" program cannot find any connection, only biological ones. However the "relate" program, which is very similar, can find ANY relationship. It determines if two persons are connected and if they are shows the connection details. It keeps "breadcrumbs." It is surprisingly short, about 90 lines of code, and is attached. The while loop in the relate procedure is the core; grok that and you've got it. I said I used the personset data structure to write this, but no, the program uses two hash tables and a list.

Tom Wetmore

-------------------------------- relate -- a LifeLines script/program -------------------

/*
relate - Finds a shortest path between two persons in a LifeLines
database.
by Tom Wetmore (t...@petrel.att.com)
Inspired by Jim Eggert's relation program
Version 1, 08 September 1993
*/

proc main ()
{
getindimsg(from, "Please identify starting person.")
getindimsg(to, "Please identify ending person.")
if (and(from, to)) {
print("Computing relationship between:\n ")
print(name(from)) print(" and ")
print(name(to)) print(".\n\nThis may take awhile -- ")
print("each dot is a person.\n")

set(fkey, save(key(from)))
set(tkey, save(key(to)))
call relate(tkey, fkey)
} else {
print("We're ready when you are.")
}
}

global(links)
global(rels)
global(klist)

proc relate (fkey, tkey)
{
table(links) /* table of links back one person */
table(rels) /* table of relationships back one person */
list(klist) /* list of persons not linked back to */

insert(links, fkey, fkey)
insert(rels, fkey, ".")
enqueue(klist, fkey)
set(again, 1)

while (and(again, not(empty(klist)))) {
set(key, dequeue(klist))
set(indi, indi(key))
call include(key, father(indi), ", father of")
call include(key, mother(indi), ", mother of")
families(indi, fam, spouse, num1) {
children(fam, child, num2) {
call include(key, child, ", child of")
}
if (spouse) {
call include(key, spouse, ", spouse of")
}
}
if (fam, parents(indi)) {
children(fam, child, num2) {
call include(key, child, ", sibling of")
}
}
if (key, lookup(links, tkey)) {
call foundpath(tkey)
set(again, 0)
}
}
if (again) {
print("They are not related to one another.")
}
}

proc include (key, indi, tag)
{
if (and(indi, not(lookup(links, key(indi))))) {
print(".")
set(new, save(key(indi)))
insert(links, new, key)
insert(rels, new, tag)
enqueue(klist, new)
}
}

proc foundpath (key)
{
print("\n\nA relationship between them was found:\n\n")
set(again, 1)
while (again) {
print(" ")
print(name(indi(key)))
print(lookup(rels, key))
print("\n")
set(new, lookup(links, key))
if (eq(0, strcmp(key, new))) {
set(again, 0)
} else {
set(key, new)
}
}
}

Enno Borgsteede

unread,

Oct 6, 2024, 11:07:34 AM10/6/24

to root...@googlegroups.com

Hello Tom,

Thank you. I knew that the code could be made simpler, and ChatGPT also suggested that it may process the same persons multiple times, and some extra logging proves that this is indeed the case, although I haven't figured out the best solution yet, to make it fit in Gramps.

ChatGPT gave me a handful of comments, including this:

1. Optimize the Queue Operations

Current Issue: The self.queue is being used as a list, and you are using pop(0), which removes the first element. In Python, popping from the front of a list is an O(n) operation because it requires shifting all elements.

Solution: Use a deque (from the collections module), which provides an O(1) time complexity for both append and pop operations at both ends.

and

2. Reduce Duplicate Calculations and Cache More Data

Current Issue: You're likely recalculating relationships between the same people multiple times.

Solution: Use a more comprehensive caching mechanism. You already have self.cache, but consider expanding it to store intermediate results like relationships between people, so that any repeated queries can be avoided entirely.

and another one:

3. Optimize Relationship Search Algorithm

If self.get_relatives or relationship_calc.get_one_relationship performs complex operations, see if those methods themselves can be optimized. Specifically:

If they query the database, check for indexing or ways to limit the number of queries.

If these methods have repetitive calculations, caching or memoization might be helpful.

And these are all quite to the point.

Until now, my experience with ChatGPT was not very good, because it didn't seem to have specific knowledge of GEDCOM, nor about about specific buildings in my neighborhood, or other things that I see as facts.

Some time ago, I asked it to generate code to remove lines with specific tags from a GEDCOM file, and that time, it produced code that removed all lines that had that tag somewhere in that line, and didn't seem to know that the tag is the second token in a line, and is preceded by a number.

Things may be better now.

Regards,

Enno

Thomas Wetmore

unread,

Oct 6, 2024, 11:40:58 AM10/6/24

to root...@googlegroups.com

On Oct 6, 2024, at 11:07 AM, Enno Borgsteede <enno...@gmail.com> wrote:
>
> Hello Tom,
> Thank you. I knew that the code could be made simpler, and ChatGPT also suggested that it may process the same persons multiple times, and some extra logging proves that this is indeed the case, although I haven't figured out the best solution yet, to make it fit in Gramps

Enno,

You gave me the idea of asking ChatGPT to comment on the "relate" script. I was amazed by its response, generated in a second or two. I'll attach it. It "understands" the algorithm better than I do! When I first played with ChatGPT the week after it was released, I told my wife that this was the beginning of the biggest revolution that will ever happen to humankind. In the interim year or so I have seen no reason to change that opinion. Fortunatley I'm old and won't be around many more years; I'm not sure I'd want to live in the world this revolution is bringing.

Tom Wetmore

---------------- ChatGPT's response when I asked for comments on the "relate" LifeLines script shared yesterday ------------

This LifeLines script is quite impressive in its simplicity and efficiency, especially given the task it performs—finding the closest relationship between two individuals in a genealogy database. Here are a few observations about its structure and functionality:

1. Breadth-First Search (BFS) for Relationships:
• The script essentially implements a breadth-first search (BFS) to find the shortest path (i.e., closest relationship) between two individuals. BFS is an optimal choice for this kind of problem because it guarantees that the first time the target person is found, the path is the shortest possible.
• By maintaining a queue (klist) and progressively exploring the relationships (parents, children, siblings, spouses), the program efficiently explores the family tree.

2. Use of Tables for Tracking:
• Two tables (links and rels) are used to store the chain of connections (backtracking information) and the type of relationship for each person encountered. This reduces redundancy and keeps track of the paths already explored, ensuring no cycles are followed and no person is revisited unnecessarily.
• This is crucial for genealogical datasets that can have looping relationships (e.g., cousins marrying), which the program efficiently handles.

3. Short and Elegant Code:
• The concise nature of the code shows the power of the LifeLines scripting language in abstracting complex genealogical operations. This is a clear advantage over other programs where similar functionality might require 1000s of lines due to a lack of such high-level abstractions.
• For example, the way relationships are encapsulated in include() and foundpath() allows the code to remain focused and modular, each function handling one specific task.

4. Heuristic Efficiency:
• The use of a progress indicator (dots printed for each person examined) shows the script is designed with the user’s experience in mind. The inclusion of print(".") gives feedback during potentially long computations without slowing the program down excessively.
• The check for whether the target person (tkey) has already been reached (lookup(links, tkey)) ensures that unnecessary computations are stopped as soon as the shortest relationship is found.

5. Handling Complex Family Structures:
• The script considers various family structures, such as parents, children, spouses, and siblings, making it versatile in finding relationships in complex genealogical trees.

Potential Improvement:

• Optimization for Larger Databases: While BFS is optimal for finding the shortest path, for very large genealogical databases, memory usage could potentially become an issue with a large queue (klist). Introducing techniques to limit search depth or prune branches that are unlikely to yield results (such as distant relatives many generations apart) might improve performance in huge datasets.

Overall, this is a highly efficient and elegant solution to the problem. The LifeLines scripting language clearly allows powerful genealogical queries with relatively little code. Would you like to explore optimizations or alternative algorithms for larger data sets, or is the current solution meeting your needs well?

Enno Borgsteede

unread,

Oct 6, 2024, 6:17:49 PM10/6/24

to root...@googlegroups.com

Hello Tom,

When I compare your code with what we have for Gramps, it's clear that
the algorithm is almost the same, but your code is shorter, and simpler.
And that difference is important, and one of the weak points of ChatGPT,
which did tell me that our code might process the same persons multiple
times, but did not detect where our code should be modified to add
breadcrumbs, even though that's quite obvious, because we have the
equivalent of your include procedure, where we add relatives to our
processing queue.

In your code, you only queue relatives when they're not in the links
table, and ChatGPT does not see that I should do the same in our code,
which only checks against the breadcrumbs after popping that relative
from the queue. And that's a true waste of time, that's quite easy to
see, for a human.

ChatGPT suggests optimizations that make the code longer, and that's
where it fails. It does not see that the code can be simplified, like we
can.

Regards,

Enno

Thomas Wetmore

unread,

Oct 7, 2024, 5:58:47 AM10/7/24

to root...@googlegroups.com

On Oct 6, 2024, at 6:17 PM, Enno Borgsteede <enno...@gmail.com> wrote:
>
> Hello Tom,
>
> When I compare your code with what we have for Gramps, it's clear that the algorithm is almost the same, but your code is shorter, and simpler. And that difference is important, and one of the weak points of ChatGPT, which did tell me that our code might process the same persons multiple times, but did not detect where our code should be modified to add breadcrumbs, even though that's quite obvious, because we have the equivalent of your include procedure, where we add relatives to our processing queue.

The LifeLines language is customized to handle genealogical data and write genealogical reports. I'm sure the LL code is shorter than Gramps's because of its large number of genealogical operations. If I add up the lines of C code behind the LL interpreter and the "built-ins" that implement the operations the story is different. The LL language API does, IMHO, hit a sweet spot for implementing genealogical algorithms.

>
> In your code, you only queue relatives when they're not in the links table, and ChatGPT does not see that I should do the same in our code, which only checks against the breadcrumbs after popping that relative from the queue. And that's a true waste of time, that's quite easy to see, for a human.
>
> ChatGPT suggests optimizations that make the code longer, and that's where it fails. It does not see that the code can be simplified, like we can.

As much as a I rave about ChatGPT it does not bat 100%.

Tom Wetmore
>
> Regards,
>
> Enno

Enno Borgsteede

unread,

Oct 7, 2024, 10:19:47 AM10/7/24

to root...@googlegroups.com

Hello Tom,

The LifeLines language is customized to handle genealogical data and write genealogical reports. I'm sure the LL code is shorter than Gramps's because of its large number of genealogical operations. If I add up the lines of C code behind the LL interpreter and the "built-ins" that implement the operations the story is different. The LL language API does, IMHO, hit a sweet spot for implementing genealogical algorithms.

I understand that, but I also know that anyone can write compact code by using proper functions and procedures, and it's also a matter of style. I'm inclined to write long texts, because I find it hard to remember parameters, but Gramps and Python have enough means to write compact code. It's just not there in this thing.

As much as a I rave about ChatGPT it does not bat 100%.

Hehe, I know. I commented, because what I see here confirms my earlier experience which suggests that it's quite superficial. It is able to find simple ways for improvement, like replacing a list with a queue, or caching some data, but it does not see the main problem, which is in the missing breadcrumbs. And that part is quite easy to see too, once you know where to look. And that seems to be a natural consequence of the fact that it's a language model, not true AI. It may look impressive, and sometimes it is, but people were also impressed by Eliza.

https://theconversation.com/ai-tools-produce-dazzling-results-but-do-they-really-have-intelligence-223311

There are many more articles about this, like this one:

https://link.springer.com/article/10.1007/s10676-024-09775-5

And at the same time, I found that it can produce very nice code, when I just give it the actual challenge, to which it replied that we can use a Breadth-First Search (BFS) and gave me a piece of code that's as compact as yours:

from collections import deque

def kortste_pad(start, doel):
    # Als de start en doel hetzelfde zijn
    if start == doel:
        return [start.naam]

    # BFS setup
    queue = deque([(start, [start.naam])])
    bezocht = set([start])

    while queue:
        huidige_persoon, pad = queue.popleft()

        # Zoek door de kinderen, ouders en echtgenoten
        for verwant in huidige_persoon.kinderen + huidige_persoon.ouders + huidige_persoon.echtgenoten:
            if verwant not in bezocht:
                if verwant == doel:
                    return pad + [verwant.naam]

                queue.append((verwant, pad + [verwant.naam]))
                bezocht.add(verwant)

    return None # Geen pad gevonden

It's in Dutch, but I bet that you get the idea. I left out the test class and test code, but it gave me that too, and it has a queue for the people that need to be processed, and a set for the breadcrumbs, so it's about the most efficient that I can get.

The only problem for me is that Gramps policies don't allow AI generated code, so I need to keep this secret somehow.

I hereby admit that I can't do this much better myself.

Cheers,

Enno

Wayne Pearson

unread,

Oct 7, 2024, 11:17:50 AM10/7/24

to root...@googlegroups.com

On Mon, Oct 7, 2024 at 8:19 AM Enno Borgsteede <enno...@gmail.com> wrote:

The only problem for me is that Gramps policies don't allow AI generated code, so I need to keep this secret somehow.

I hereby admit that I can't do this much better myself.

Treat the code like any other code you might find, whether in a textbook, Stack Overflow or any discussion forum - use it as a starting point, and write your own version. No one should use AI generated code as-is -- it should be reviewed, as you adapt it to your needs, to ensure that it really is doing what the AI is so confidently claiming it does.

At that point, you're not using AI-generated code, you're using code adapted from AI, which after all, was adapted from its training data, and thus is no less (or more) valid than the code that user2368429 on Stackexchange posted.

--

Wayne

Enno Borgsteede

unread,

Oct 7, 2024, 11:40:57 AM10/7/24

to root...@googlegroups.com

Op 07-10-2024 om 17:17 schreef Wayne Pearson:

> Treat the code like any other code you might find, whether in a
> textbook, Stack Overflow or any discussion forum - use it as a
> starting point, and write your own version. No one should use AI
> generated code as-is -- it should be reviewed, as you adapt it to your
> needs, to ensure that it really is doing what the AI is so confidently
> claiming it does.

Oh, yes, I know that, by experience.

I once let ChatGPT 'write' code to generate _UID tags with a checksum,
as used in PAF, and the code looked great, and the UIDs too, but a
friend told me that his program replaced them with other ones, implying
that that program did not accept them, and created its own.

This proves that ChatGPT just 'wrote' code to create some sort of
checksum, with the right number of digits, but not the one used by PAF,
and defined by the LDS. And once I found that definition, with example
code, written by Gordon Clarke, I got it right in a few minutes.

For further reading, check this:

https://www.scientificamerican.com/article/chatgpt-isnt-hallucinating-its-bullshitting/

And that's an article that I can heartily agree with. Hallucinating
would mean that it would consistently 'write' the wrong code, or give
the wrong answers, but that's not the case. It simply does not know what
it's doing, hence the BS in the title.

Regards,

Enno

John Cardinal

unread,

Oct 7, 2024, 1:11:19 PM10/7/24

to root...@googlegroups.com

Enno,

I meant to reply earlier but didn't get to it until now.

I have several programs that calculate relationships. They follow this basic approach:

1. Build a hash table collection (key is ID, value is #generations) of person 1's direct ancestors.

2. Do a depth-first search of person 2's direct ancestors. Look up each of person 2's ancestors in the hash table from step 1, and if found, and if #generations is lower any prior match, update the "closest common ancestor". For any hit, stop searching that subtree.

After the search from step 2, "closest common ancestor" will be unset or will be set to the closest common ancestor, not withstanding ties. The method could be tweaked to find all relationships if necessary.

The speed of this algorithm is not directly affected by the total number of people in the GEDCOM file. The performance is determined by the number of ancestors of the two people involved. Most people don't have a lot of known ancestors. Someone with a maximum depth of 20 generations of ancestors, and thus more than a million *possible* direct line ancestors, will have far fewer known ancestors.

I just did a test using a C# implementation. The code loaded the ancestors collection (673 people) in 0ms. Of course, it wasn't zero, but the elapsed time was less than the .NET "StopWatch" timer resolution. Finding the relationship between person 1 and 2 also took 0ms. The measurements do not include the time to open/load the genealogy data.

Here are some examples from a test harness I use for the relationship calculator.

Example 1: siblings

37:45.390 I Loading relationship template strings from file
37:45.409 I Loaded 5776 relationship templates
Build Person 1 ancestor table: 673 people in 0ms
Find relationship: 0ms
Person 2 is brother of Person 1

Example 2: distant cousins

41:56.363 I Loading relationship template strings from file
41:56.374 I Loaded 5776 relationship templates
Build Person 1 ancestor table: 673 people in 0ms
Find relationship: 0ms
Person 2 is 3rd cousin 3 times removed of Person 1

The total elapsed time--about 20ms--is mostly spent doing the setup work necessary to write the message that provides the text version of the relationship ('brother of', '3rd cousin 3 times removed of'). The program uses template files for internationalization. In a real application, the templates would be loaded only once.

John

Thomas Wetmore

unread,

Oct 7, 2024, 2:07:04 PM10/7/24

to root...@googlegroups.com

John,

The problem Enno and I are concerned with is finding the closest relationship between two persons via ANY path, including spouses, siblings, children, etc. That is, we do not require the two persons to have a common ancestor or be biologically related in any way. For example, finding the relationship between you and the children of your sister-in-law's husband's parents.

My version, and I assume Enno's, begins with a queue that has the first person. Then add the parents, children, siblings and spouses of the person to the queue. Then iterate the queue: take a person off the queue, see if its the target, and if not and add all its parents, children, siblings, spouses to the queue, but of course, never add anyone to the queue that you've already taken off. This does a breadth first search through the graph of persons that "surrounds" the first person, going further and further away from that person, until you run into the second person or empty the queue. If there is any path between the two persons this will find the shortest one. If there is no path it must eventually quit when there are no more people to look at.

Tom Wetmore

> On Oct 7, 2024, at 1:11 PM, John Cardinal <jfcar...@gmail.com> wrote:
>
> Enno,
>
> I meant to reply earlier but didn't get to it until now.
>
> I have several programs that calculate relationships. They follow this basic approach:
>
> 1. Build a hash table collection (key is ID, value is #generations) of person 1's direct ancestors.
>
> 2. Do a depth-first search of person 2's direct ancestors. Look up each of person 2's ancestors in the hash table from step 1, and if found, and if #generations is lower any prior match, update the "closest common ancestor". For any hit, stop searching that subtree.

> ...

John Cardinal

unread,

Oct 7, 2024, 2:43:10 PM10/7/24

to root...@googlegroups.com

Tom,

Thanks for the clarification. Though I didn't describe it, you can easily extend my method to find a relationship via a spouse, but it does not handle social relationships like "sister-in-law's husband's parents".

John

Thomas Wetmore

unread,

Oct 7, 2024, 2:50:53 PM10/7/24

to root...@googlegroups.com

John,

If you extend your system to include spouses AND siblings I believe you can get it all. In my example one spouse link goes to your wife, a sibling to your sister-in-law, a spouse to her husband and then a parent: two spouse links, a sibling link and a parent link, a path distance only four. You'd find that path as fast as you'd find a first cousin, two parent links and two child links.

Tom Wetmore

Enno Borgsteede

unread,

Oct 7, 2024, 2:56:22 PM10/7/24

to root...@googlegroups.com

Op 07-10-2024 om 20:06 schreef Thomas Wetmore:

> My version, and I assume Enno's, begins with a queue that has the first person. Then add the parents, children, siblings and spouses of the person to the queue. Then iterate the queue: take a person off the queue, see if its the target, and if not and add all its parents, children, siblings, spouses to the queue, but of course, never add anyone to the queue that you've already taken off. This does a breadth first search through the graph of persons that "surrounds" the first person, going further and further away from that person, until you run into the second person or empty the queue. If there is any path between the two persons this will find the shortest one. If there is no path it must eventually quit when there are no more people to look at.

That's exactly what our code tries to do too, with the exception that it
has a problem with the breadcrumbs, which is quite hard to find, because
it's a bit of a mess. And maybe it's not that hard, but fellow
developers seem to get scared by its messy looks, and me too.

The code is on GitHub for all to see at

https://github.com/gramps-project/addons-source/blob/maintenance/gramps52/DeepConnectionsGramplet/DeepConnectionsGramplet.py

and the algorithm's core is in main, where you can find some sort of a
cache that was probably meant to store breadcrumbs. And although it
stores some, it seems to have little effect. And that where I hoped that
ChatGPT might find something, but it didn't.

In our case, the get_relatives function processes associations too, and
person links in attached notes, but removing those features does not
speed things up, so the problem really seems to be in the breadcrumbs
implementation.

If you look at the code, starting at line 211, you can see the queue,
and a cache, and even some fragment that tried to check that cache, but
was commented out. Reactivating that doesn't work, so I either need to
delve deeper, or replace it with something else.

Note that this is all open source, so you can fork the repo and try
things at home. You need more than one repo though, because this one's
just for the add-ons.

Regards,

Enno

Thomas Wetmore

unread,

Oct 7, 2024, 5:28:51 PM10/7/24

to root...@googlegroups.com

On Oct 7, 2024, at 2:56 PM, Enno Borgsteede <enno...@gmail.com> wrote:

That's exactly what our code tries to do too, with the exception that it has a problem with the breadcrumbs, which is quite hard to find, because it's a bit of a mess. And maybe it's not that hard, but fellow developers seem to get scared by its messy looks, and me too.

To clarify what "breadcrumbs" means (is the same word in Dutch? In English the meaning comes from German fairy tails).

Finding a path between two persons is fine, but you have to display that path. Clearly there are links going from the first person through the graph to the target person. You have to record those links somehow in order to reconstruct the path. An interesting question is how?

Tom Wetmore

Enno Borgsteede

unread,

Oct 8, 2024, 9:35:24 AM10/8/24

to root...@googlegroups.com

Hello Tom,

> To clarify what "breadcrumbs" means (is the same word in Dutch? In
> English the meaning comes from German fairy tails).

Same here. The Dutch word is 'broodkruimels', and I know it from the
fairy tale of Klein Duimpje, or Little Thumb, which is the literal
translation of that expression. And although I thought that it was
German, its true origins seem to be French:

https://en.wikipedia.org/wiki/Hop-o%27-My-Thumb

The article also mentions Hansel and Gretel (Hans en Grietje), and
that's German.

>
> Finding a path between two persons is fine, but you have to display
> that path. Clearly there are links going from the first person through
> the graph to the target person. You have to record those links somehow
> in order to reconstruct the path. An interesting question is how?

Well, there are several answers, in our existing code, in yours, and in
the fresh code that I got from ChatGPT. And now that I think that I've
added the proper breadcrumbs in Gramps, and find that it's still slow,
I'm beginning to think that the problem may be in some other parts of
our code, and maybe in the way we store paths. And that's a bit harder
to figure out, but I can try Visual Code for that, because it works
quite good for Python debugging, in Linux.

And in the mean time, I also found another interesting piece in the main
code of Gramps:

https://github.com/gramps-project/gramps/blob/master/gramps/plugins/tool/notrelated.py

This tool was designed to find all people who are not related to the
selected person, and its main procedure, called findRelatedPeople, walks
the tree in a similar manner as in the other code, with breadcrumbs, but
without storing paths, and it's quite fast. It could also be adapted for
partitioning, by repeating the process for the first unrelated person,
and repeating that, until all clusters are found.

It looks like your code stores the path in a more efficient way than in
our code, which seems to store an ever longer path for each person
that's further away from the start. And that can be quite a drag. I see
that in the fresh code from ChatGTP too, where it's not a problem in a
small test tree, but can be in the big ones.

Enno

Thomas Wetmore

unread,

Oct 8, 2024, 10:58:08 AM10/8/24

to root...@googlegroups.com

Enno,

I know you understand my bread crumb approach, but if anyone else is interested here is a quick summary. It uses two hash tables. The keys used in each are the persons's keys, which are in fact their Gedcom cross-references.

The first hash table maps each key to the key of the previous person in the path. The second hash table maps each key to the relationship that that person has with the previous person in the path. There is only one place in the overall algorithm where these tables are added to, which is when a never seen before person is added to the queue.

Once the final person is found the first hash table is used to construct the path by backing up through it, and the second is used to get the relationship at each step.

Tom Wetmore

Enno Borgsteede

unread,

Oct 8, 2024, 12:11:42 PM10/8/24

to root...@googlegroups.com

I know, and this one's just for fun. It's the program that you posted
here, translated to Python:

> # relate - Finds a shortest path between two persons in a LifeLines
> # database.
> # by Tom Wetmore (t...@petrel.att.com)
> # Inspired by Jim Eggert's relation program
> # Version 1, 08 September 1993
>
> from collections import deque
>
> def main():
>     from_person = input("Please identify starting person: ")
>     to_person = input("Please identify ending person: ")
>
>     if from_person and to_person:
>         print(f"Computing relationship between:\n {from_person} and
> {to_person}.\n")
>         print("This may take a while -- each dot is a person.\n")
>
>         fkey = save_key(from_person)
>         tkey = save_key(to_person)
>
>         relate(fkey, tkey)

> else:
> print("We're ready when you are.")
>

> # Global variables
> links = {}
> rels = {}
> klist = deque()
>
> def relate(fkey, tkey):
>     global links, rels, klist
>
>     links[fkey] = fkey
>     rels[fkey] = "."
>     klist.append(fkey)
>
>     found = False
>
>     while not found and klist:
>         key = klist.popleft()
>         indi = get_individual(key)
>         include(key, get_father(indi), ", father of")
>         include(key, get_mother(indi), ", mother of")
>
>         for fam in get_families(indi):
>             for child in get_children(fam):

> include(key, child, ", child of")

> spouse = get_spouse(fam)
> if spouse:

> include(key, spouse, ", spouse of")
>

>         for fam in get_parents_families(indi):
>             for sibling in get_children(fam):
>                 include(key, sibling, ", sibling of")
>
>         if key == tkey or tkey in links:
>             foundpath(tkey)
>             found = True
>
>     if not found:

> print("They are not related to one another.")
>

> def include(key, indi, tag):
>     if indi and save_key(indi) not in links:
>         print(".")
>         new_key = save_key(indi)
>         links[new_key] = key
>         rels[new_key] = tag
>         klist.append(new_key)
>
> def foundpath(key):

> print("\n\nA relationship between them was found:\n\n")

>     while True:
>         print(f" {get_name(get_individual(key))} {rels[key]}")
>         new_key = links[key]
>         if new_key == key:
>             break
>         key = new_key
>
> # Helper functions to simulate database interactions
> def save_key(person):
>     # Simulate saving or retrieving a unique key for a person
>     return person
>
> def get_individual(key):
>     # Simulate retrieving an individual by key
>     return key
>
> def get_father(indi):
>     # Simulate retrieving father of an individual
>     return None
>
> def get_mother(indi):
>     # Simulate retrieving mother of an individual
>     return None
>
> def get_families(indi):
>     # Simulate retrieving the families of an individual
>     return []
>
> def get_children(fam):
>     # Simulate retrieving children from a family
>     return []
>
> def get_spouse(fam):
>     # Simulate retrieving spouse from a family
>     return None
>
> def get_parents_families(indi):
>     # Simulate retrieving parent families of an individual
>     return []
>
> def get_name(indi):
>     # Simulate retrieving the name of an individual
>     return indi

I like this, because it works without even knowing the LifeLines language.

Enno Borgsteede

unread,

Oct 8, 2024, 12:16:30 PM10/8/24

to root...@googlegroups.com

Note that I did not test what I just sent, but just showed what look like good Python. It came with these comments:

Key Changes and Considerations:

Input and Output: I replaced getindimsg with Python’s input() function and simplified the print statements.

Table/Lists: In Python, dictionaries (links, rels) are used for the mappings, and a deque for the queue (klist).

Database simulation: Functions like get_father(), get_families(), save_key() are placeholders simulating how the data is accessed.

Control flow: The while loops and conditionals from the original code have been adapted using Python's idiomatic control structures.

This version assumes you have some data source or database that these helper functions (get_father(), etc.) can interact with. You can replace these with actual data retrieval logic as needed.

Thomas Wetmore

unread,

Oct 8, 2024, 2:29:59 PM10/8/24

to root...@googlegroups.com

Enno,

Fun seeing that! I've been programming since '66, but haven't written a line of Python. Fortran from '66 to '78, and C from '78 'til now, with lots of Algol, Pascal, C++, Objective C, and Swift in the mix. I read a Python book after applying for a job in bioinformatics, but it didn't happen.

Tom Wetmore

Reply all

Reply to author

Forward

Partitioning Gedcom Records and CHATGPT comments.

Thomas Wetmore

Enno Borgsteede

Thomas Wetmore

Enno Borgsteede

Thomas Wetmore

Enno Borgsteede

Thomas Wetmore

Enno Borgsteede

1. Optimize the Queue Operations

2. Reduce Duplicate Calculations and Cache More Data

3. Optimize Relationship Search Algorithm

Thomas Wetmore

Enno Borgsteede

Thomas Wetmore

Enno Borgsteede

Wayne Pearson

Enno Borgsteede

John Cardinal

Thomas Wetmore

John Cardinal

Thomas Wetmore

Enno Borgsteede

Thomas Wetmore

Enno Borgsteede

Thomas Wetmore

Enno Borgsteede

Enno Borgsteede

Key Changes and Considerations:

Thomas Wetmore