Embedding Semantics in Gedcom Keys

Thomas Wetmore

unread,

Sep 15, 2024, 8:56:52 PM9/15/24

to root...@googlegroups.com

I have brought up the topic of Gedcom keys (cross references) before. Many programs use keys like @I1@ and @F4@, using a letter to indicate record type and an integer to sequence through the record types. Keys consist of @-signs surrounding letters, digits and underscores. There can be a max of 20 characters between the @-signs. The keys "belong" to Gedcom; the user should have no control over them.

Though my LifeLines program uses the single letter/sequential number pattern, I've never liked it. Usually the "key" persons in a database will be the persons created earliest, so will have the lowest integers, and users get used to those keys and want to search using them. In my current program I can use a random key generator.

I'm wondering whether there might be other useful ways to assign key values. For instance can we somehow define the "importance" of a person in a database and assign them a key based on that?

What would "importance" mean? I think many databases are created to search for the genealogy of a small set of persons. These persons are important. Would a user have to specify the "important" persons or could software define them.

If you have 20 characters to play with in assigning keys you can imagine out all sorts of weird schemes to encode information.

Could you define the "most connected" person in the database/Gedcom file and start the keys with that person? How would you define such a person? If you could get the counts of all ancestors and descendents for the persons in your database, would that be a way to score them and define importance? If this makes sense how would you convert these counts into keys?

I'm going to run an experiement on my own database to compute these "conncectedness scores" to see what they look like.

Are there any thoughts out there on whether this issue is an issue at all? Maybe I shouldn't be worrying about such a silly thing.

Tom Wetmore

Luther Tychonievich

unread,

Sep 16, 2024, 7:28:01 PM9/16/24

to root...@googlegroups.com

It sounds like you are talking about d'Aboville numbers or a similar system (https://en.wikipedia.org/wiki/Genealogical_numbering_systems), though those only label INDI, not other records like FAM, SOUR, REPO, etc., and don't handle ALIA. I can see how there could be a different system that handles all of those, though I wonder if it could be simple.

I don't understand what benefit would be gained by using different identifiers. As you note, the xref_ids are entirely internal so unless changing them provides some performance benefit, I'm not sure what value any particular scheme would provide. I personally use just sequential integers, no letters at all, because it requires less logic to pick the ID of the next new record and ensure uniqueness. If you use them like a sorting key when showing a list of results to a query or the like, I'd think it would make more sense to let the user pick the anchor person and compute the order on the fly. If you have a meaningful scheme, I;d think it would make more sense to put it in a REFN instead of an xref_id so that that meaning won't be lost by other applications during import.

FYI, the 20-character limit only applies in GEDCOM versions 5.4, 5.5, and 5.5.1. Versions 5.3 and before and 7.0 do not limit the xref_id length. 7.0 also added character set limitations that were not in 5.5.1 or before because of reports that many applications were limited in the characters they could parse (contrary to the 5.5.1 specification) resulting in import errors if non-asciii-alphanumerics were used. If sharing your files with other programs is a priority, I'd recommend following 5.5.1's "tag" production instead of its "pointer_string" production when generating xref_id values.

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/E8BF9A26-1E08-4253-8DC3-64A8933F6C24%40gmail.com.

Tom Morris

unread,

Sep 16, 2024, 7:35:33 PM9/16/24

to root...@googlegroups.com

Hi Tom,

On Sun, Sep 15, 2024 at 8:56 PM Thomas Wetmore <ttwet...@gmail.com> wrote:

...

I'm wondering whether there might be other useful ways to assign key values. For instance can we somehow define the "importance" of a person in a database and assign them a key based on that?

...

Are there any thoughts out there on whether this issue is an issue at all? Maybe I shouldn't be worrying about such a silly thing.

From a usability point of view, I think what attracts users to these values is that: a) they are stable and b) they are (easily) memorable. Often, when entering a traditional family into a program, they tend to follow Ahnentafel patterns (me = 1, parents = 2 & 3, grandparents = 3, 4, 5, 6), so they're not only memorable, but the IDs have built-in semantics.

I think overloading semantics on an identifier is almost certainly a recipe for trouble. Much better to keep different aspects separate. A search shortcut or dialog option to bring up the "central" person for a database or progenitor or most recent descendant or whatever common criteria researchers want would much better than trying to bake this into an identifier.

Best,

Tom

Thomas Wetmore

unread,

Sep 16, 2024, 9:29:21 PM9/16/24

to root...@googlegroups.com

Thanks to Luther and Tom for responding.

I think the idea of embedding semantics in Gedcom keys (cross references) is probably a non-starter. As Luther points out Gedcom7 removes restrictions on the length of keys. (I thought the Gedcom7 standard had forgotten to put the restriction in!). So with Gedcom7 you could go crazy with the put-semantics-in-the-keys idea. You could make each key describe the relationship between its person and someone from a small set of "central" persons. Small set because not all persons in a database have to be related to every other person. A database holds partitions, where every person in a partition is related to every other person in that partition, but not related to anyone in any other partition. Pick one person from each partition as its "central" person, and then make every key of every person in the partition the relationship between the person and the central person. Crazy but doable. The central person could be selected using the idea I suggested in the last email, by picking the person with the most ancestors and descendants in the partition. I'm not suggesting this is a good idea, mind you, just something that is possible.

But what is behind my wondering about this? It boils down to sorting and iterating. How do you want the persons in your database to be sorted and listed and iterated. Of course there can be many kinds of lists of persons needed in many contexts.

Obviously you will want to sort by name, and that's probably the most important. One might think (or just old timers like me who started programming in the 60s) that sorting a large database of persons by names might require lots of processing power, and it kind of does if the records aren't somehow pre-sorted by name, but with modern processors you would have to have a massive database before the sorting time would be noticeable. My database of 15,000 persons sorts by name in a couple milliseconds. And sorting by name requires comparing names, and for arbitrary name using Gedcom rules, this comparison operation is non-trivial. But still it's all done in a couple milliseconds.

Are there other ways to sort a list of records. By key is one obvious way. As another say your program needs to iterate through every person in your database, and order does not matter. No need to have persons in name or key order. What order would you use? Most (all?) databases have a "natural" order. If you have SQL as backing store you would likely iterate through the "person table" in its "index" order. In my current in-RAM database records are kept in hash tables that map keys to Gedcom "trees". The hash table is ordinary, so it has buckets and the buckets have entries, each entry being a map from a key to a Gedcom root "node". Natural order in this case is iterating through the buckets and then their entries. Because this is based on a hash function it is pseudo-random.

My LifeLines program has a programming subsystem. One of its built-in datatypes is a Person Sequence (INDISEQ in the reference manual). Programs can build up these sequences in many ways using built-in operations. For example, you can get the sequence of all spouses of persons in a sequence. You can get the sequence of all siblings, or all children, or all parents or all ancestors or all descendants of the persons in a sequence. You can union, intersect and difference sequences. These sequences can be sorted by name, by key, or by user-assigned properties. Allowing a user-defined property shows I've been worrying about sorting persons for a long time.

Tom Wetmore

paul...@gmail.com

unread,

Sep 16, 2024, 11:44:49 PM9/16/24

to root...@googlegroups.com

Tom, isn't one problem that the list of potential sort sequences is effectively infinite? That catering for even a dozen or so such key elements becomes arduous?
I value the facilities in Family Historian that allow us to create and save a wide range of queries to satisfy (at least most of) the examples you give. Selection criteria, columns and sort orders are user definable. Any pre-saved query can be modified on the fly, if needed. The results are more-or-less dynamic (database updates reflected by at worst a re-query), without any need for special keys.
Maybe I don't understand your ambition clearly enough.
Paul

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/D3FE4D07-C569-4955-8A0F-98D427189A19%40gmail.com.

Thomas Wetmore

unread,

Sep 17, 2024, 6:02:47 AM9/17/24

to root...@googlegroups.com

Paul,

Thanks. Excellent points. Persons should be sortable by name, key, birth date, birth place, and so on. Using an SQL database with good table design these should be automatic. My databases are not SQL so sorting is a bit different. I plan to extend my PersonSequence data type to support all these sorting modes.

I guess my main question boils down to, if you want to get a list of your persons, and you have no sorting criteria in mind, how should that list be ordered, and should that ordering reflect any "natural order" imposed by the database, or should it have anything to with the records' key values? Honestly I think this is a rather unimportant question. If you don't care anything will do. If you care you sort it.

Tom Wetmore

Reply all

Reply to author

Forward