Assigning Keys to Gedcom Records

32 views
Skip to first unread message

Thomas Wetmore

unread,
Jun 10, 2024, 7:49:49 PMJun 10
to root...@googlegroups.com
I wonder if there is any interest in this.

In Gedcom files Person and Family (INDI and FAM) and the other record types are identified with keys. (Gedcom specs call records structures, and call keys cross reference identifiers).

Gedcom keys start and end with @-signs and between them are digits, capital (Latin) letters. and underscores in any order. There are no requirements on length.

Programs that generate Gedcom files must generate and assign keys to the records. There are a few patterns that show up. For example, person records are sometimes keyed in order as @I1@, @I2@, and so on.

Keys are "owned" by Gedcom. Users are not supposed to worry about them, know about them, use them or think about them. But in some situations users like to memorize keys of "key" persons so they can go straight to them.

Questions.

Do you care how keys are assigned? If keys are sequential and assigned in the order that records are added, they will bounce around based on the vagaries of your research.

If you could come up with a scheme to assign key identifiers to records, would you want to do that? For instance would you want persons who are closely related to have similar keys?

And if you wanted to do that what kind of scheme would you want to use?

Tom Wetmore


Jason Wyckoff

unread,
Jun 11, 2024, 8:00:54 AMJun 11
to root...@googlegroups.com

Do you care how keys are assigned? If keys are sequential and assigned in the order that records are added, they will bounce around based on the vagaries of your research.

Assuming keys are immutable... IMHO, no. Keys should be arbitrary identifiers that do not contain meaning. Though I've used integers for my Primary Keys (PKs) for years, I have embraced Guids recently to break me from attempting to extract meaning from the keys. 



If you could come up with a scheme to assign key identifiers to records, would you want to do that? For instance would you want persons who are closely related to have similar keys?

The issue with attempting to put meaning or logic into keys is when information changes? 

What if you find out that someone really isn't related? What happens when you find another sibling in the family? What happens to the order that you put into the primary key?

There are numbering systems available, but I believe that should be more decorative and illustrative instead of being a primary key. See https://familytreemagazine.com/organization/genealogy-numbering-systems/

And if you wanted to do that what kind of scheme would you want to use?

Tom Wetmore


--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/95B417F9-5D43-4244-A84E-0334829422F8%40gmail.com.

Marshall Lake

unread,
Jun 11, 2024, 8:21:43 AMJun 11
to Digest recipients

A couple of caveats ...

I'm on travel currently and not totally in a "developer mode".

As a developer, my experience with Gedcoms is pretty limited.


The way that the Gedcom is designed, I don't see how controlling the way
the key identifiers are assigned would make development or usage easier or
better.

Perhaps I'm missing something and/or need further explanation as to what
is trying to be accomplished.



> In Gedcom files Person and Family (INDI and FAM) and the other record types are identified with keys. (Gedcom specs call records structures, and call keys cross
> reference identifiers).
>  
> Gedcom keys start and end with @-signs and between them are digits, capital (Latin) letters. and underscores in any order. There are no requirements on length.
>  
> Programs that generate Gedcom files must generate and assign keys to the records. There are a few patterns that show up. For example, person records are sometimes keyed
> in order as @I1@, @I2@, and so on.
>  
> Keys are "owned" by Gedcom. Users are not supposed to worry about them, know about them, use them or think about them. But in some situations users like to memorize
> keys of "key" persons so they can go straight to them.
>  
> Questions.
>  
> Do you care how keys are assigned? If keys are sequential and assigned in the order that records are added, they will bounce around based on the vagaries of your
> research.
>  
> If you could come up with a scheme to assign key identifiers to records, would you want to do that? For instance would you want persons who are closely related to have
> similar keys?
>  
> And if you wanted to do that what kind of scheme would you want to use?

--
Marshall Lake -- marsha...@gmail.com -- http://www.mlake.net

paul...@gmail.com

unread,
Jun 11, 2024, 9:25:33 AMJun 11
to rootsdev
> "in some situations users like to memorize keys of "key" persons so they can go straight to them"

You don't say which software you use to access those records. Certainly, decent desktop software such as Family Historian offers many different ways to "bookmark" persons of interest.

Thinking in terms of relational databases, what is effectively the Primary Key is usually *intended* to be an arbitrary value, as trying to create one from one or more "table columns" is fraught with danger if any of those values ever change.

My advice is "never go there".

Thomas Wetmore

unread,
Jun 11, 2024, 11:41:48 AMJun 11
to root...@googlegroups.com

On Jun 10, 2024, at 9:05 PM, Jason Wyckoff <ja...@jasonwyckoff.com> wrote:


Do you care how keys are assigned? If keys are sequential and assigned in the order that records are added, they will bounce around based on the vagaries of your research.

Assuming keys are immutable... IMHO, no. Keys should be arbitrary identifiers that do not contain meaning. Though I've used integers for my Primary Keys (PKs) for years, I have embraced Guids recently to break me from attempting to extract meaning from the keys. 

I agree that they are arbitrary and carry no meaning. I'm the author of a genealogical program that keeps all records in Gedcom format. The database is persistent so once a record is in the key never changes. I have users who have memorized the keys of a few of their "key" (ha ha) relatives and don't want them to change. These users want keys to be honored. This is anti Gedcom specs. If users want their own keys there is the REFN tag.


If you could come up with a scheme to assign key identifiers to records, would you want to do that? For instance would you want persons who are closely related to have similar keys?

The issue with attempting to put meaning or logic into keys is when information changes? 

Exactly. I'm writing a new system that doesn't use a persistent database. It reads Gedcom files into an in-Ram "database". If changes are made the database is written out to a new Gedcom file. Now I honor the keys on the files, but I plan on making a change to reassign random keys to records on import. If multiple Gedcom files are used to build the database, where clashes are highly likely, there will be no problems.

Without something clever, new random keys will be assigned on every import. I don't think that's an issue. With my M1 based Mac my 20,000 record database imports in a few milliseconds. This includes parsing the Gedcom, validating it, generating keys, creating the database, indexing the names, and a few other operations. With speeds like that rekeying the records on every import has vanishingly small impact on performance.

My first program comes with a built in complete programming language that allows users to do anything with the data at all. Initially intended for report generation programs now do a wide array of things. The feature allows users to put the key of persons and families into reports. This probably fostered the memorization users did. I conclude that reports should never mention internal keys.

What if you find out that someone really isn't related? What happens when you find another sibling in the family? What happens to the order that you put into the primary key?

There are numbering systems available, but I believe that should be more decorative and illustrative instead of being a primary key. See https://familytreemagazine.com/organization/genealogy-numbering-systems/

Using keys like Henry numbers or register numbers are non-starters. They are after the fact keys that are generated only in the context of a limited report output.

Thomas Wetmore

unread,
Jun 11, 2024, 11:49:02 AMJun 11
to root...@googlegroups.com


> On Jun 11, 2024, at 8:21 AM, Marshall Lake <marsha...@gmail.com> wrote:
>
>
> A couple of caveats ...
>
> I'm on travel currently and not totally in a "developer mode".
>
> As a developer, my experience with Gedcoms is pretty limited.
>
>
> The way that the Gedcom is designed, I don't see how controlling the way the key identifiers are assigned would make development or usage easier or better.
>
> Perhaps I'm missing something and/or need further explanation as to what is trying to be accomplished.
>
I don't think you are missing anything. Gedcom keys should be invisible to users. But my program, mentioned briefly in the last message, presents records to users in pure Gedcom form because that's how they are kept in the database. Users modify the database by directly editing Gedcom records. They have gotten used to seeing the keys, and they memorize a few of them, and they get peeved if any change. I'm trying to find the best way to wean them away from this behavior. I contributed to this mess by allowing users to search by key values as well as by names. But they can also search by REFN value, which is they way they should do it if they don't want to type in a name.

Tom Wetmore

Doug Henderson

unread,
Jun 11, 2024, 12:42:10 PMJun 11
to root...@googlegroups.com
Tom,

The consumer of a GEDCOM cannot make any assumptions about the record cross-reference ID values. It should validate that all ID values used within records are defined in the GEDCOM.

The producer of a GEDCOM must make sure that all referenced ID values are defined in the GEDCOM.

There is the possibility for ambiguous references to record IDs where IDs are inserted into records where such usage is not mandated by the GEDCOM standards. For example, a record ID can be inserted into a NOTE.TEXT, by editing that text in an application that directly works with GEDCOM records, such as your lifelines program, or some of my own programs which provide similar functionality. I have never tried to "correctly" handle these cases, nor do I have a reliable definition of what "correctly" means in the presence of applications which do not handle these cases at all.

When I need to generate record IDs, I generally use an id that has two parts: The first is the minimal unique prefix based on the record type, with the second being a sequence (starting at 1) for each prefix, in the order of records in the GEDCOM file.

I think that having rules for generating IDs is generally not useful outside of the context of one or a tightly related group of applications. All GEDCOM consumers must never make assumptions about the value of an ID unless there was some way to guarantee that the producer of the GEDCOM was following some predefined rules. I suppose that the consumer could verify that the producer used the same IDs as the consumer would create when it produces an unchanged GEDCOM, prior to making any changes to the imported GEDCOM.

I export GEDCOM files from FamilySearch, MyHeritage, Ancestry, FindMyPast, and WikiTree web sites. I have Ancestry Quest, RootsMagic, Legacy, GEDKeeper2 installed on my laptop, all of which import and export GEDCOM files. And I have written numerous programs in Go and Python which import and export GEDCOM files, and manipulate the data or generate reports.

Having a regular system for generating GEDCOM record IDs is very helpful for testing and debugging GEDCOM output files and is very useful to the developer, but publishing and supporting some scheme is just extra work for little benefit.

Doug



--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/95B417F9-5D43-4244-A84E-0334829422F8%40gmail.com.


--
Doug Henderson, Calgary, Alberta, Canada - from gmail.com

Thomas Wetmore

unread,
Jun 11, 2024, 4:34:17 PMJun 11
to root...@googlegroups.com
Doug,

Thanks. I agree. LifeLines generates sequential keys, e.g., @I1@, @I2@, ..., for persons, etc, matching your approach. In LL users cannot modify any keys or the values of FAMS, FAMC, HUSB, WIFE, CHIL nodes, so they can't mess up the lineage-linking part of the database. But as you say, they can add keys as values to other nodes. LL has or had a feature where it traverses Gedcom trees for values that are syntactically keys to make sure they point to real things.

I'm doing a redesign now (after 30+ years) and will probably go with random keys and use the Go language. LL makes it convenient for users to try to memorize the keys of "key" persons, which in afterthought, does more harm than good. Users should add REFN tags if they want shortcuts.

Tom W.

Stephen Woodbridge

unread,
Jun 11, 2024, 8:38:53 PMJun 11
to root...@googlegroups.com
Hi Thomas,

I created a system for rendering my GEDCOM as a website by loading it into tables in a database. I added additional tables for managing a large collection of family photos with a couple of tables like:

photos
--------
photo_id
description

indi_photos
---------------
indi_id
photo_id

obviously indi_id renumbering/changing can't be done or it breaks this. And on the file system all the photos where stored in a directory and named <photo_id>.jpg which made it easy to compute the img tag url.

One reason for this was the GEDCOM standard for multimedia was not well defined or support by most programs at the time.
I would be nice if this was better supported or there was an immutable UUID that was assigned to each INDI. The UUID probably needs to be prefixed by the program assigning it so merging data from various programs does not lead to collisions.

Just some thoughts,
  -Steve W
--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Bean

unread,
Jun 12, 2024, 10:07:15 AMJun 12
to rootsdev
From the perspective of an engineer of code used by customers
You cite that GEDCOM says that users are not supposed to worry about them or know about them, so anything besides good as random is leading towards working against the intent. In the most ideal world as a user, the program knows what I want right when I want it, guides me to it, and I'm done. In other words, a good search button or "Here's a list of all related information you need" should suffice. If the IDs are helpful enough to make the search easier without being exposed directly to me, then as a user I am content.

From the perspective of an engineer who knows that loading IDs with useful information can make searches faster
Some databases store by ID and randomly generate one if you do not. CouchDB is one example. There are indexing tools and plugins which can leverage cleverly crafted IDs for use during search. So whatever you use, consider being structured if you want to be take advantage of these. Otherwise you are talking about maintaining some sort of datastructure with partial duplication in order to take advantage of these later--and if the structure is built on the fly, the associated costs might annul the benefits.

From the perspective of an engineer who knows that they might be helpful to users especially on smaller files or where it is "me-centric"
I'd say that if you are going to use vanity/customized IDs rather than GUIDs/UUIDs, then first you have to accept that either they are arbitrarily selected by the user and verified to be unique before being accepted (like a Twitter handle) and perhaps default to first-name+last-name in it somewhere or "depth-X" in it where X is the minimum step distance from me + an incrementor if not already unique. Of course, as more information becomes known or as data is updated, these would have to be updated which throws off the option to "memorize" an reference keys directly as a user.

In this case, though, it seems like having a vanity ID on top of the general one makes more sense. And even if not unique, normal functionalities would not break. In cases where duplicates occur and a search occurs, you could display pertinent differentiating information adjacent to each vanity ID such as vanityID+dateOfEntry, but use a different coloration for date of entry and have a tooltip explaining why the vanityID receives extra decoration.

Thomas Wetmore

unread,
Jun 12, 2024, 12:26:54 PMJun 12
to root...@googlegroups.com

Bean,

Thanks for your comments. I agree.

Here's a quick review of Gedcom's official view on id's.

• Each record has a key (in Gedcom parlance: cross-reference identifier). They are "owned" by Gedcom. Users aren't supposed to use them. However if programs return the key in search results, users see them and learn about them. And in programs like LifeLines, where users edit their Gedcom records directly they can't help but see them (they can't change them, but they see them). Though they are not supposed to, users get possessive about the keys and can get peeved if the software changes them. These keys are the database id's.

Gedcom has three tags for other kinds of ID's; from the Gedcom 7 specs:

• REFN is a user-generated identifier for a structure.
• UID is a globally-unique identifier for a structure.
• EXID is an identifier maintained by an external authority that applies to the subject of the structure.

REFN is the proper way for a user to add their own identifier to a Gedcom record. REFN's are "owned" by the user. LifeLines supports them and makes sure that a REFN id appears in only one record. (A single record can have multiple REFN values, but each value can be found in only one record.) In LifeLines a user can search for records by key, by the value of any 1 NAME line, or by REFN values. It was maybe a mistake to allow users to search for key.

When people think of id's for genealogical records they sometime think about Register numbers or Henry numbers, that is, id's that encode relationships. It doesn't take much thought to realize that these are non-starters for id's of any kind. They get horribly complex and have to change en masse when persons are added or deleted.

Tom Wetmore

Przemek Więch

unread,
Jun 17, 2024, 4:50:46 AMJun 17
to root...@googlegroups.com
Hi,

Here are a few ideas from the perspective of a user and a developer.

1. As a user I often use IDs to navigate to a certain individual. The combination of first and last name can repeat multiple times and sometimes I don't even have the first and last name of an individual.

2. It is important that IDs do not change. Tools such as Topola Viewer use the IDs to create permalinks to specific individuals, e.g. https://pewu.github.io/topola-viewer/#/view?url=https%3A%2F%2Fchronoplexsoftware.com%2Fmyfamilytree%2Fsamples%2FThe%2520Kennedy%2520Family.gdz&indi=I0 (notice &indi=I0).

3. In my workflow I sometimes merge 2 GEDCOM files into one. This is a bit problematic in terms of assigning IDs because both files may have conflicting ids. My hacky solution is to convert one set of ids from I123 to Ix123 where x is a letter identifying the other file. The goal is to make the conversion deterministic.

4. A more robust solution would be using UUIDs or something akin to URIs like https://example.com/family.ged#I123. Since the GEDCOM key doesn't allow special characters, one could put the URI in the REFN field and set the record key as the md5 checksum of the URI. The checksum would be deterministic and globally unique (with a good enough probability). The URI idea is probably a bit too complicated though.

Best,
Przemek

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

Richard Light

unread,
Jun 17, 2024, 12:29:36 PMJun 17
to root...@googlegroups.com

Hi,

I think that the design of GEDCOM doesn't lend itself to overloading the record key with additional meaning. As Przemek points out, the key only has meaning in the scope of the file within which it sits. As soon as you try to merge two or more GEDCOM files, problems start to crop up.

I have tended to use the WWW field to hold persistent URLs which identify a particular source:

For individuals, as against sources of information, there is of course the Wikitree site https://www.wikitree.com/ which assigns a unique, persistent URL to each person.

Richard

Thomas Wetmore

unread,
Jun 17, 2024, 1:19:29 PMJun 17
to root...@googlegroups.com
Przemek,

Thanks. The records in a LifeLines database are pure Gedcom. One can search for persons by Gedcom key, with or without the @-signs, by REFN value, or by name. The program uses person keys of the form @Innn@, and persons can also be searched for by just nnn. Name searching is flexible, e.g. my name can be searched as 't/wtmr'. This finds the list of people with names that match, which the user chooses from.

LifeLines users often do memorize the keys of a few key people, and don't like it if they change. Nevertheless, the keys "belong" to Gedcom and can theoretically change at any time. In my databases I add REFN's to a few key people usually using their initials, which are then permanent ids.

Tom Wetmore
Reply all
Reply to author
Forward
0 new messages