GEDCOM Records vs. Lines

115 views
Skip to first unread message

Thomas Wetmore

unread,
Aug 23, 2024, 9:21:42 PM8/23/24
to root...@googlegroups.com
Steam rising. This has bugged me for years, and the latest GEDCOM 5.5.5 standard is as guilty as the previous ones.

The GEDCOM standard just can't get straight what a GEDCOM "record" is. Any logical person "knows" intuitively that a GEDCOM record (as it exists in a GEDCOM file) is a 0 level line and all the lines that follow it, forming a hierarchy, until the next 0 level line where the next record begins. The latest standard is schizophrenic about this.

In the introductory material the word "record" is used to mean "line." There is no question about this. A subrecord is simply a line (though they don't use the word "line" even here, still calling it a "record") one level deeper than another "record." This is VERY confusing. In fact you have to read pretty far into the document before you finally and reluctantly realize that you have to suspend common sense as you suddenly think, "They CAN'T mean that ... but by God, they do."

I've been reading GEDCOM standards for over 35 years and I can assure you that the standards have always been this confusing.

But then, later in the chapters that describe the grammar the other schizophrenic personality wakes up and suddenly the tern "GEDCOM line" is used to mean exactly what it should mean, a single line of GEDCOM text, and the term "GEDCOM record" is used to mean exactly what common sense says it must mean. So in a sense GEDCOM 5.5.5 has gotten it right, as long as you only read the right parts.

It is clear that the two parts of the document were written by two individuals with different vocabularies about what GEDCOM is.

This seems very odd to me. Tamara Jones, a very, very particular editor, who is very careful and thorough and who tries to straighten everything out, is the overall editor of GEDCOM 5.5.5; I can't understand how he has not discovered this and straightened it out.

Steam successfully blown off.

Tom Wetmore

paul...@gmail.com

unread,
Aug 23, 2024, 11:25:24 PM8/23/24
to rootsdev
Tom, your post on records & lines woke me up, dammit. Going back to persona(e)...

I know you don't like the term, but old-fashioned "alias" has unwanted connotations.

In my view, a (genealogy) persona is the exact (well, best guess of the) form in which an individual is referred to in a given document being cited. Moreover it is a well-defined atttribute of a *citation* (not the person/individual).

That should always be quoted (repeated) as part of citation text, so there is "electronic copy", not just implied/hidden by some link to an image.

It seems to me reasonable to allow for multiple versions of a reading so, just as we may show alternatives in citation text, a citation's persona attribute is multi-valued.

Now consider Latin in e.g. early church records such as Gulielmus for William. This is usually (and should be) transcribed as is, but ought to be paired with a translation to English.

This suggests we need a way of also marking persona "attributes" with a language "attribute". [I'm not going to go down the rabbit hole of grammatical inflections, even though they generate a lot of nonsensical readings.]

Then consider Arabic, Urdu, Hindi, Thai, Chinese, or any of thousands of scripts that anglophones pretend don't exist. In our Unicode environment there is no reason to exclude data entry in the appropriate script (sometimes more than one for a given language).

Therefore a persona needs not only a language attribute but also a script name. And, for good measure, a marker to indicate a Romanized or other type of transliteration from one persona to another.

Ah, that means persona needs to be an entity (record) in its own right. Now we're getting somewhere.

Paul White

Richard Light

unread,
Aug 24, 2024, 2:02:40 AM8/24/24
to root...@googlegroups.com

Thomas,

Following up on your steam-letting, I did a search for GEDCOM versions. I see that there is also a version 7 (https://gedcom.io/specifications/FamilySearchGEDCOMv7.html).

This isn't an area I know about, so I would welcome advice. When generating GEDCOM as a download from the updated FreeBMD web site (work in progress ...), which version of GEDCOM should I conform to? I see that at present I am producing 5.5.1 output.

My instincts are to go for a version which is Unicode-based (i.e. 5.5.5 or 7). However, there is the danger that too 'modern' an encoding might limit the usefulness of our GEDCOM, if the various applications/web sites which might consume it haven't kept up to date with the development of the specification.

I would welcome your thoughts.

Best wishes,

Richard Light

--

Richard Light
richard...@gmail.com
@richardofsussex

Enno Borgsteede

unread,
Aug 24, 2024, 8:00:23 AM8/24/24
to root...@googlegroups.com
Hello Richard,

> My instincts are to go for a version which is Unicode-based (i.e.
> 5.5.5 or 7). However, there is the danger that too 'modern' an
> encoding might limit the usefulness of our GEDCOM, if the various
> applications/web sites which might consume it haven't kept up to date
> with the development of the specification.

AFAIK, no-one really uses Unicode, so I suggest that you stick with
GEDCOM 5.5.1 and export as UTF-8. 5.5.5 is not a real standard, and 7 is
not interesting enough to upgrade to for the big guys. I see no good
reason to support it in Gramps either.

Regards,

Enno


Thomas Wetmore

unread,
Aug 24, 2024, 10:33:46 AM8/24/24
to root...@googlegroups.com


> On Aug 23, 2024, at 11:25 PM, paul...@gmail.com <paul...@gmail.com> wrote:
>
> Tom, your post on records & lines woke me up, dammit. Going back to persona(e)...
>
> I know you don't like the term, but old-fashioned "alias" has unwanted connotations.

Paul,

I don't mind the term "persona" and use it a lot.
>
> In my view, a (genealogy) persona is the exact (well, best guess of the) form in which an individual is referred to in a given document being cited. Moreover it is a well-defined atttribute of a *citation* (not the person/individual).

I agree. You say it well.
>
> That should always be quoted (repeated) as part of citation text, so there is "electronic copy", not just implied/hidden by some link to an image.
>
> It seems to me reasonable to allow for multiple versions of a reading so, just as we may show alternatives in citation text, a citation's persona attribute is multi-valued.
>
> Now consider Latin in e.g. early church records such as Gulielmus for William. This is usually (and should be) transcribed as is, but ought to be paired with a translation to English.
>
> This suggests we need a way of also marking persona "attributes" with a language "attribute". [I'm not going to go down the rabbit hole of grammatical inflections, even though they generate a lot of nonsensical readings.]
>
> Then consider Arabic, Urdu, Hindi, Thai, Chinese, or any of thousands of scripts that anglophones pretend don't exist. In our Unicode environment there is no reason to exclude data entry in the appropriate script (sometimes more than one for a given language).
>
> Therefore a persona needs not only a language attribute but also a script name. And, for good measure, a marker to indicate a Romanized or other type of transliteration from one persona to another.
>
> Ah, that means persona needs to be an entity (record) in its own right. Now we're getting somewhere.

In principle I agree with all you say about names. My assumption is that most genealogists do most of their work in their native language so would want their software to default to their native script, but have a way of "escaping" to other scripts. Gedcom 5.5.5 embraces Unicode, and also has a number of name properties to be used for language and romanization purposes. I would look at the capabilities in 5.5.5 before adding something new.

Your thoughts on how personas can be used in genealogical software seem similar to mine. Most software does not support having both personas and individuals in the same database. But the LDS's FamilySearch tree uses the idea very well. Essentially a person in the FamilySearch tree is a cluster of personas. FamilySearch's goal is to get their users to eventually cluster their billions of records (personas!) into a single ancestral tree for all of humanity. Of course this sets up "reclustering wars" where multiple genealogists interested in the same families spend hours undoing clustering operations made by others, and reclustering to their preferred arrangements.

I have experimented with keeping both personas and persons in my own database. I use LifeLines, a program I wrote in the early 1990s. Every record in a LL database is purely and simply a GEDCOM record. I create personas by taking data directly out of items of evidence. And then I merge personas (smallish GEDCOM records) into persons (largish GEDCOM records) when I believe that is the right thing to do. But there isn't a good user interface for doing this, so I haven't made it work in a way to be proud of.

In an Ancestry.com tree you also build up your persons by clustering your own data with Ancestry's vast collection of evidence records. Every "shaking leaf" is a persona waiting for you to check out. The big difference between Ancestry.com and FamilySearch is that every persons's Ancestry tree is unique, while FamilySearch is striving to build a single tree of all of us.

Best,

Tom W.
>
> Paul White

Thomas Wetmore

unread,
Aug 24, 2024, 10:55:35 AM8/24/24
to root...@googlegroups.com


On Aug 24, 2024, at 2:02 AM, Richard Light <richard...@gmail.com> wrote:

Thomas,

Following up on your steam-letting, I did a search for GEDCOM versions. I see that there is also a version 7 (https://gedcom.io/specifications/FamilySearchGEDCOMv7.html).

This isn't an area I know about, so I would welcome advice. When generating GEDCOM as a download from the updated FreeBMD web site (work in progress ...), which version of GEDCOM should I conform to? I see that at present I am producing 5.5.1 output.

My instincts are to go for a version which is Unicode-based (i.e. 5.5.5 or 7). However, there is the danger that too 'modern' an encoding might limit the usefulness of our GEDCOM, if the various applications/web sites which might consume it haven't kept up to date with the development of the specification.

I would welcome your thoughts.

Best wishes,

Richard Light


Richard,

Some ideas, for what they are worth:

5.5.1 is the "real" version that most software pretends to support.
5.5.5 is the LDS's attempt to clean up 5.5.1; they want it adopted.
7.x.x was a pipe dream many years ago that is not taken seriously (I would love to hear if others believe differently).

Adoption of UNICODE is, IMHO, CRITICAL to the future of GEDCOM, and genealogical software as a whole, so I believe that moving to 5.5.5 is important. But this is not  a deal. UTF-8 is just ASCII with a clever way of encoding the non-ASCII characters as multi-byte groups. Almost all support software, e.g., text editors, word editors, user interface packages, have been handling UNICODE for years. I'm very old school and still use vi for editing my genealogical records. The vi editors have understood UNICODE for a long time.

Also on this subject I believe that most genealogical software supports UNICODE, because they basically have to be be relevant. For for most genealogical software GEDCOM is "just" a way to import and export data. Most programs are going to import GEDCOM, and if there are UTF-8 (or UTF-16) encodings in there they will be read just fine. And on export also.

Corollary: ANSEL is horrible; no-one knows what it is; there is not even a good definition of its character set, which would be abysmally inadequate even it it were well defined; it should be stricken forever. It was, is, and will always be a bad idea.

Tom Wetmore

Thomas Wetmore

unread,
Aug 24, 2024, 11:01:38 AM8/24/24
to root...@googlegroups.com
Normally I agree with everything Enno says. But I would say that everyone uses Unicode, whether they know it or not. Exporting 5.5.1 as Unicode is probably perfectly okay, because Unicode is probably one of the approved character sets supported by 5.5.1. If that is true there is no immediacy in moving to 5.5.5.

Tom Wetmore

John Cardinal

unread,
Aug 24, 2024, 11:38:09 AM8/24/24
to root...@googlegroups.com
UTF-8 is the most common way of encoding the Unicode character set. For Latin based characters and some other character subsets, UTF-8 is a very efficient way to encode Unicode text. It's not as efficient for other character encodings, but it's not terrible for those.

GEDCOM 7.x supports UTF-8 only. AFAIK, FamilySearch never released a standard named "GEDCOM 5.5.5" so I don't know what people are referring to with that. GEDCOM 5.5.1 supports UTF-8. GEDCOM 5.5 does *not* support UTF-8, but many programs that claim to write GEDCOM 5.5 files offer UTF-8 as an option, and many programs will read a GEDCOM 5.5 file even if it uses the non-standard (for 5.5) UTF-8 encoding.

You can see the specs for several GEDCOM versions here: https://gedcom.io/specs/

ANSEL is defunct; it has been withdrawn or deprecated. Modern development runtime systems do not support it, so developers must jump through some hoops to support it. No modern programs should write ANSEL.

John Cardinal

Thomas Wetmore

unread,
Aug 24, 2024, 1:53:56 PM8/24/24
to root...@googlegroups.com
John,

Use www.gedcom.org to see the GEDCOM 5.5.5 documents. You are correct that it does not come from FamilySearch.

Tom Wetmore


On Aug 24, 2024, at 11:38 AM, John Cardinal <jfcar...@gmail.com> wrote:

GEDCOM 7.x supports UTF-8 only. AFAIK, FamilySearch never released a standard named "GEDCOM 5.5.5" so I don't know what people are referring to with that. GEDCOM 5.5.1 supports UTF-8. GEDCOM 5.5 does *not* support UTF-8, but many programs that claim to write GEDCOM 5.5 files offer UTF-8 as an option, and many programs will read a GEDCOM 5.5 file even if it uses the non-standard (for 5.5) UTF-8 encoding.

You can see the specs for several GEDCOM versions here: https://gedcom.io/specs/

Good birding,

Tom Wetmore, http://bartonstreet.com/tom/birds
Newburyport, Mass.
Think globally, bird locally.



paul...@gmail.com

unread,
Aug 24, 2024, 2:24:30 PM8/24/24
to root...@googlegroups.com
Anyone of a certain age that had to deal with non-ASCII characters will remember the horrors of code sets.
Unicode is indispensable. What sucks is rubbish data entry tools, and Microsoft Word is no shining light of good practice

Only in recent weeks I discovered a "convention" used by one 17th century scribe, representing a doubled consonant by overbar above a single.
No idea how widespread that usage was, but it may help to explain frequent occurrence of "Johana" transcriptions when expecting "Johanna" - the overbar ignored or just dismissed as "noise".

How to enter "n" with overbar? Seems it's not a standard Unicode character so needed a compounding diacritic.
My point is that Microsoft Word's "Insert Symbol" lacks any way to search by description, or display groups of related or similar characters.

While on the subject, I'm depressed by the widespread jingoism displayed by those who refuse to accommodate "foreign" characters.
Laziness may play a part, but the strong impression is that many truly believe the world must adapt to a bare Latin character set.
A rude awakening will happen if the Irish, Welsh or Chinese take over the world.

-----Original Message-----
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Saturday, August 24, 2024 4:01 PM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

paul...@gmail.com

unread,
Aug 24, 2024, 3:17:47 PM8/24/24
to root...@googlegroups.com
Hi Tom

Sorry if it sounds dismissive, but the virtues and/or convenience of Ancestry and FamilySearch don't cut any ice with me.
In particular, Ancestry's GEDCOM compatibility is abysmal, and your efforts developing LifeLines suggests you're equally unimpressed.

Family Historian (FH) has possibly unmatched GEDCOM compatibility, plus invaluable (legal) extensions, and keeps my data local.
Further, it offers display, searching and query options that others have never bothered to include.
These make a huge difference for managing large data sets, for example with One Name-type studies.

Unfortunately, though, FH makes no attempt to accommodate the genealogist's distinction between person and persona.
That is presumably in part for lack of demand, but mainly (in my opinion) because GEDCOM does not support this explicitly.

Merging personae into persons is surely not the answer: destroying (or at least masking) evidence and assessment, and obviating (or at least hindering) "undo" operations.
Any future not-a-GEDCOM should build the distinction into its core structure, as well as supporting many other kinds of person-person relationship.
It won't be easy or happen in my lifetime.

A religious obsession with "lineage" explains the original GEDCOM data model and why no amount of hacking will ever fix it.
And, most often overlooked, it has only patchy support for "family history".

Kind regards
Paul


Enno Borgsteede

unread,
Aug 24, 2024, 5:43:55 PM8/24/24
to root...@googlegroups.com
Op 24-08-2024 om 17:01 schreef Thomas Wetmore:
> Normally I agree with everything Enno says. But I would say that everyone uses Unicode, whether they know it or not. Exporting 5.5.1 as Unicode is probably perfectly okay, because Unicode is probably one of the approved character sets supported by 5.5.1. If that is true there is no immediacy in moving to 5.5.5.

You're mostly right about the 1st part, and I realized that a few
minutes after I sent that mail. And I wrote that, because I associated
Unicode with configuration files used by the original Microsoft Train
Simulator, which was made in Japan. Those files used 16 bit characters,
as defined in Unicode 1.0, which were quite a nuisance when you wanted
to hack the sim by changing them. Normal text searches failed on those,
and you couldn't edit them with notepad, only with wordpad. And that is
20 years ago now.

Today, most programs use Unicode, and the 16 bit wchar works quite well
in Europe and the Americas, and is used in the internal character
representation of  a lot of programming languages, like C++ and Java.
It's probably also faster in databases, but I have no data about those.

At the file level however, 16 bit Unicode is rare, and many modern
programs simply export UTF-8, which is the most efficient encoding for
Unicode, and supports the full Unicode code space, as it exists today.
Programs like Gramps and RootsMagic don't give the user a choice, and
always export UTF-8, and that's what Tamura recommends too, so we're safe:

https://www.tamurajones.net/GEDCOMCharacterEncodings.xhtml

And according to that, a program like Brother's Keeper still uses legacy
8 bit IBM PC code pages on export.

PAF can export in all 4 formats listed in this article, and Ancestral
Quest supports 3.


Thomas Wetmore

unread,
Aug 24, 2024, 8:54:15 PM8/24/24
to root...@googlegroups.com


> On Aug 24, 2024, at 3:17 PM, <paul...@gmail.com> <paul...@gmail.com> wrote:
>
> Hi Tom
>
> Sorry if it sounds dismissive, but the virtues and/or convenience of Ancestry and FamilySearch don't cut any ice with me.
> In particular, Ancestry's GEDCOM compatibility is abysmal, and your efforts developing LifeLines suggests you're equally unimpressed.

I have found value in both FamilySearch and Ancestry. I agree that Ancestry's support of GEDCOM is poor. Whenever I export something from Ancestry to GEDCOM I have to post process the file. Their extraction software is buggy.
>
> Family Historian (FH) has possibly unmatched GEDCOM compatibility, plus invaluable (legal) extensions, and keeps my data local.
> Further, it offers display, searching and query options that others have never bothered to include.
> These make a huge difference for managing large data sets, for example with One Name-type studies.

Now I want to read about FH.
>
> Unfortunately, though, FH makes no attempt to accommodate the genealogist's distinction between person and persona.
> That is presumably in part for lack of demand, but mainly (in my opinion) because GEDCOM does not support this explicitly.
>
> Merging personae into persons is surely not the answer: destroying (or at least masking) evidence and assessment, and obviating (or at least hindering) "undo" operations.
> Any future not-a-GEDCOM should build the distinction into its core structure, as well as supporting many other kinds of person-person relationship.
> It won't be easy or happen in my lifetime.

Personas must remain intact as they get clustered into the different person groupings. I was software architect at a company that did exactly this (it was not a genealogical application). We had many tens of millions of personas automatically extracted from the internet, and we wrote clustering algorithms to apply many metrics to group personas into persons. The personas remained intact. We also wrote software to "summarize" a person by coming up with "biography" that gleaned information from the personas. We had some persons based on clusters of several thousand personas, though typically much less. This is the approach that makes sense to me.
>
> A religious obsession with "lineage" explains the original GEDCOM data model and why no amount of hacking will ever fix it.
> And, most often overlooked, it has only patchy support for "family history".

I'm not this critical. As a syntactic standard GEDCOM is as general purpose as XML or JSON or any semantic web language. They are isomorphic. When you apply the "lineage-linking" restrictions to it, ie., to get GEDCOM 5.5.1, etc., you loose that flexibility. In the LifeLines program a goal was to use GEDCOM syntactically, only checking that the most basic of lineage-linking restrictions are met (HUSB really does point to a person, that kind of thing). Other than that any arrangement of tags and values to any depth was allowable. LifeLines also allows user defined record types. It would "be easy" (yeah, right) to let the INDI records be the persona records and create a new record type to represent a person. This new record type could be implemented in many ways, but it would have to contain links to all the INDI records making it up. I would probably structure person records just like INDI records, but choose the best versions of all the facts found in the personas. Yeah, there are problems when personas are added or removed, as the summary person might have to change.

You don't even need a new record type. GEDCOM INDIs could be used for both personas and persons. A convention to tell them apart would be easy to establish.

Best,

Tom Wetmore


>
> Kind regards
> Paul

Richard Light

unread,
Aug 25, 2024, 6:49:13 AM8/25/24
to root...@googlegroups.com
On 24/08/2024 20:17, paul...@gmail.com wrote:
Unfortunately, though, FH makes no attempt to accommodate the 
genealogist's distinction between person and persona.
That is presumably in part for lack of demand, but mainly (in my 
opinion) because GEDCOM does not support this explicitly.

Merging personae into persons is surely not the answer: destroying (or 
at least masking) evidence and assessment, and obviating (or at least 
hindering) "undo" operations.
Any future not-a-GEDCOM should build the distinction into its core 
structure, as well as supporting many other kinds of person-person 
relationship.

Well I wasn't expecting this when I asked an innocent question about which version of GEDCOM to publish ...

I looked at the 5.5.5 and 7 specs, and was pleased that there was movement in the direction of a sensible use of Unicode, for example. However, if there is a view that the model is not fit for some purposes, specifically for recording personae, I would say that is an argument for starting something fresh.

As it happens, the data published by Free UK Genealogy is all of the "persona" type (a.k.a. "evidence", a.k.a. "attestation"). Our largest resource - https://www.freebmd.org.uk - consists of transcriptions of scans of an index (hand-crafted) to a Register of birth, marriage and death certificates. In other words, it's a pretty indirect piece of evidence, with plenty of scope for mistakes to creep into the data, and a general lack of precision (dates only accurate to the quarter or month; places only specified as Registration Districts). It's clearly useful, but equally clearly it requires the sort of additional interpretation and theory-forming that this thread has been discussing.

It won't be easy or happen in my lifetime.

Well, let's see. One achievable goal might be to create a machine-processible data format for expressing the exact nature of "personae". For example, for FreeBMD we would want to say that we have a scan (and point to it), transcribed by person X (and possibly person Y), which asserts that in the GRO Register for births, Registration District R, quarter Q, volume V, page P there is a certificate which asserts that a person with the name F[orename(s)] S[urname] was born. Where our processing of the data flags up that the data might be suspect (wrong Volume for the District; Page number outside the expected range) this could be included as a caveat. All of this information could be delivered as a single JSON or XML payload, giving the researcher everything we ourselves know about that entry, and the means to double-check it for themselves by going back to the scan.

If we built this, would anyone come (i.e. use it)? One advantage of delivering it as JSON or XML is that you wouldn't have to write a parser to interpret it; conversely you would need JSON- or XML-aware software tools to work with it.

Apart from the GRO Index, our other projects relate to parish registers (baptisms, marriages and burials) and census data. So between them, I think, they cover a worthwhile subset of the sorts of persona data we might want to encode.

Interestingly ("we are not alone") the Pelagios project (https://pelagios.org/activities/) is interested in working on pretty much the same problem in its People activity.

A religious obsession with "lineage" explains the original GEDCOM data 
model and why no amount of hacking will ever fix it.
And, most often overlooked, it has only patchy support for "family 
history".

I'm not suggesting tackling the bigger issue of properly supporting "family history". However, if we can make progress on a relatively simple framework for recording primary and secondary evidence, that might point a way forward for this too.

Maybe we just leave GEDCOM behind ...

Richard

paul...@gmail.com

unread,
Aug 25, 2024, 7:19:32 AM8/25/24
to root...@googlegroups.com
Hi Tom, thanks for that.

Of course, I'm complaining about FamilySearch & Ancestry's tree building/search/query interface, not their data collections (though I have major issues with source referencing in both).

Re. FH there is a free trial, reasonable price, and amazing support. But be prepared for long haul learning. Email me for screenshots.

Love that example of clustering and "bio".

I should explain myself better about "lineage". It's more to do with their focus on that to the exclusion of other relationships (more in a moment).

The basic GEDCOM INDI & FAM record types are broadly satisfactory. User-defined extension tags are a good way to accommodate extra detail. However, the prohibition on extending core record types and sub-tags is limiting (to put it mildly).

An example is FH's sensible "fake" Place record type. Most useful, as far as it goes, but ultimately crippled because core events like baptism & census are not allowed to have a pointer to that place instead of their attribute/property *tag*.

The other sorely missed opportunity is to promote events from attribute to record type. Even the most obvious example of a census shows what's wrong here: an INDI should be able to point to a census record AND the pointer (relationship) should have attributes that qualify the relationship ("wife", age, birthplace, occupation, other text).

The same applies to other GEDCOM events so that INDIs can "participate" in various capacities, e.g. a baptism (principal, mother, sponsor, even officiant if that floats your boat).

Then we would also have the opportunity for "global" events such a war, battle or train crash and diverse participants (who could potentially span multiple user "trees").

Normalisation and referential integrity are not exactly GEDCOM's strong points, are they?

But of all GEDCOM's structural deficiencies, perhaps the worst is citations which should be a record type in its own right with links to both source and the applicable event(s)-attribute(s). Enough said.

Anyway, coming back to personae, these should live in an "evidence layer" that must never be destroyed. I think of "persons" as a construction (conceivably even a fiction), a projection down to a fluid layer where all the argument takes place.

The missing link is a collection of persona-person links that (like census pointer attributes above) qualify that relationship - in this case our assessment of the evidence that person is represented by those personae. AND there is nothing to exclude the possibility of one persona being linked to more than one person.

Further, for each collection of P-P links there should be provision for an overall narrative to argue (and qualify) the conclusions.

None of the above would be compulsory for those happy with a simple life, but they unlock the potential for genealogy (no, I'm not an expert), not just "tree building".

Finally, there is endless untapped opportunity with relationship links - between INDI-INDI, INDI-other, even other-other. For sure you get the picture.

My main bugbear is lack of a reflexive association between INDIs to codify likely identity, for which we can currently use only manually paired ASSO tags for a miserable workaround without narrative. There is a universe of application, even to replace or generalise standard family relationships.

Some other examples are friendships, apprenticeship-employment, host-boarder, sub-events (birth registration as "part of" birth, licence/banns/marriage/divorce as part of "marriage relationship", campaign/battle in a war).

You might also start to believe that GEDCOM-type events are just sub-classes of a prototype, best described by templates for specific cases. Don't want bang on interminably, clogging up this discussion, so will leave it there.

Happy days
Paul

-----Original Message-----
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Sunday, August 25, 2024 1:54 AM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

I have found value in both FamilySearch and Ancestry. I agree that Ancestry's support of GEDCOM is poor. Whenever I export something from Ancestry to GEDCOM I have to post process the file. Their extraction software is buggy.

Now I want to read about FH.

Personas must remain intact...


paul...@gmail.com

unread,
Aug 25, 2024, 9:00:38 AM8/25/24
to root...@googlegroups.com

Hi Richard, so glad you reacted - and with a lot of insight!

 

Last thing first, yes, a complete GEDCOM replacement is what I'd aim for. And to the sceptic I'd also argue that its structural generality would allow for seamless import from old "standards", with software potential for guided or customised migration to newer structures for those that want to go down that road.

 

And to be clear, I'm not in any way endorsing certain well-known non-GEDCOM initiatives that in my view miss the point entirely. But more to the point, you've awakened another angry bee in my bonnet: data resources, collections, and records.

 

The "clearly useful" and widely used FreeBMD could be massively more useful if only... Suppose each original source record (a handwritten or printed line, or manual addition had its own unique source record ID. Suppose each transcription therefrom had its own unique transcription ID.

 

Then the collection of source IDs would constitute (something like) the full set of people born, died, or married in England & Wales since mid-1837. Ditto for GRO Online. Ditto for census records, etc, etc.

 

Those IDs *should* be carried through to all re-publishers, so we always know the provenance and never get confused if something seems "new". And the transcription IDs with their source ID association would ensure we never get suckered into believing that different readings are unrelated personae.

 

FreeXXX.org agreed there was merit in the idea but, as you would expect, did not envisage any retrospective coding - given their voluntary funding.

 

We have the same situation for *all* resources. It goes right back to every repository's initial cataloguing, through page annotation, filming/scanning, transcriptions, and indexing. Not a problem (in principle) for new systems, but a hopeless prospect for any retrospective recoding.

 

Then there are the data collections (re)published by the likes of Ancestry and FindMyPast, just to name a couple, where it's a case of "never mind the quality, feel the width". The most aggravating failure is the proportion of records that lack an original source reference (even if that is well known).

 

And re-publisher collections should tag every record with the collection and record ID it is derived from.

 

In short, every source record and stage of processing/publishing should be tagged with a unique ID. And every repository, collection and publisher should be represented in a public catalogue with successor links where appropriate.

 

Yep, I've probably missed plenty of details.

 

Paul

 

 

From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Richard Light
Sent: Sunday, August 25, 2024 11:49 AM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

 

Well I wasn't expecting this when I asked an innocent question about which version of GEDCOM to publish ...

paul...@gmail.com

unread,
Aug 25, 2024, 11:07:25 AM8/25/24
to rootsdev
Replying to self re: GECDOM replacement, Personae, Relationships, and (now) DNA.

Please excuse my primitive understanding of DNA testing, but...

Starting with a real-life example, a very white man (me) in Britain has a DNA match with a mixed-race woman in North America. We are remote cousins, to the point that a common ancestor cannot be identified at present - and possibly never will.

DNA matches via Ancestry, and related public trees, "confirms" "cousins" already present in (or added to) those trees. In theory, that pair of lineages and common ancestor could be incorrect if either line of descent contained errors. However some ball-park common ancestor is hardly in dispute.

Now consider unconfirmed matches that could outnumber the confirmed by a factor of perhaps thousands (exact number not relevant), such as in the example above. At present there is no way to capture that data in structured form within GEDCOM. What do we know?

1. There was once a real man-woman couple (that we could represent by one unnamed male persona and one female), in *some* kind of relationship, from whom we both descended one way or another.
2. Our degree of relatedness, by autosomal testing, is measured in xxx centiMorgans.
3. That was arrived at via an unknown (to us) protocol/version in our respective cases, based on sample submission and testing at certain dates.
4. According to Ancestry, each of us has deep(-ish) ancestry, to varying precision, in certain geographical regions.

What could we usefully do with that data, beyond file it away for a rainy day?

Well, in the first place it's "big data" obtained somehow from "big data" and referenced to more "big data". So our prospects for independently generating leads are limited.

My first wish would be for Ancestry, with the cousin's consent, to give me a summary of common regional clustering - a kind of heat map of the intersecting regions. The second wish would be a way of recording the cousin and our common heritage in my own tree.

GEDCOM defines parentage only at the first level. I'd like an "ancestor/descendant of" kind of relationship to enter a pair of ancestral personae with my cousin as another descendant. That couple could then be qualified as likely originating in some region(s) at some approximate date. And if Y-DNA results from at least one male side were also known then a fair stab at surname could be added, once that kind of testing is more widespread.

I kinda promised myself to shut up, but there you go.

Paul

Andrew Hatchett

unread,
Aug 25, 2024, 1:03:20 PM8/25/24
to rootsdev
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
Wanna bet that the "FreeXXX.Org" was responsible?  LOL  :)

paul...@gmail.com

unread,
Aug 25, 2024, 4:55:27 PM8/25/24
to rootsdev
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
Thanks Richard, Andrew, Wayne.
Love and (kisses)

Richard Light

unread,
Aug 25, 2024, 5:53:51 PM8/25/24
to root...@googlegroups.com
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
On 25/08/2024 14:00, paul...@gmail.com wrote:
The "clearly useful" and widely used FreeBMD could be massively more useful if only... Suppose each original source record (a handwritten or printed line, or manual addition had its own unique source record ID. Suppose each transcription therefrom had its own unique transcription ID.

 

Then the collection of source IDs would constitute (something like) the full set of people born, died, or married in England & Wales since mid-1837. Ditto for GRO Online. Ditto for census records, etc, etc.

 

Those IDs *should* be carried through to all re-publishers, so we always know the provenance and never get confused if something seems "new". And the transcription IDs with their source ID association would ensure we never get suckered into believing that different readings are unrelated personae.

 

FreeXXX.org agreed there was merit in the idea but, as you would expect, did not envisage any retrospective coding - given their voluntary funding.


We do accept the value of having persistent identifiers, at least for the source records themselves. In fact, we would like to go beyond this to having full Linked Data URIs. This means that as well as being persistent these identifiers would be "dereferenceable", i.e. that they would give access to the sort of machine-processible data I described in an earlier reply.

We do actually have identifiers for each record - they are cited on the current site on the details page:

At present these URLs fall short in two respects: they don't deliver useful data (just the relevant web page) and they aren't guaranteed to be persistent. This is because the unique identifier is generated as a hash of the data itself. If the data changes (because of corrections) then the hash changes too.

This is frustrating, but as you say it is totally built in to the current system design. Please rest assured that I'm doing my level best to improve the situation.

Best wishes,

paul...@gmail.com

unread,
Aug 25, 2024, 7:14:11 PM8/25/24
to rootsdev
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
Oh dear, Richard, getting well past it and didn't recognise your name - so sorry!
Now won't forget again so easily, your being just a short range missile distance from IoW.

Thanks for the clarification and update. A standard GUID would be welcome.
I wish more organisations had similar good intentions and publicised that.

All the best
Paul White

Thomas Wetmore

unread,
Aug 26, 2024, 3:21:04 PM8/26/24
to root...@googlegroups.com
Briefly back to the original thread. GEDCOM 5.5.1 is the latest official LDS version. GEDCOM 5.5.5 is the creation of Tamura Jones, and has NO relationship with the LDS. Tamura has been persistant in his quest to make GEDCOM made consistent across all platforms. He has published interminable screeds about how genealogical programs misunderstand and misuse GEDCOM. I think he eventually had enough and decided to make his own version. Of course, promulgating an unofficial version and thinking that the industry will beat a path to your door is the height of hubris.

In my opening steam salvo I mentioned that 5.5.5 was inconsistent as in earlier versions in use of terms like "line" and "record." I went back to check again how 5.5.1, the official rendering handles it and found I was very wrong. Here is the pertinent passage in 5.5.1:

"A GEDCOM transmission represents a database in the form of a sequential stream of related records. A record is represented as a sequence of tagged, variable-length lines, arranged in a hierarchy. A line always contains a hierarchical level number, a tag, and an optional value. A line may also contain a cross-reference identifier or a pointer. The GEDCOM line is terminated by a carriage return, a line feed character, or any combination of these.
The tag in the GEDCOM line, taken in its hierarchial context, identifies the information contained in the line, in the same sense that a field-name identifies a field in a database record. This means that the data is self-defining. Tags allow a field to occur any number of times within a record, including zero times. They also allow the use of different or new fields to be included in the GEDCOM data without introducing incompatibility, because the receiving system will ignore data which it does not understand and process only the data that it does understand. The hierarchical relationships are indicated by a level number. Subordinate lines have a higher level number. The hierarchy allows a line to have sub-lines, which in turn may have their own sub-lines, and so forth. A line and its sub-lines constitute a context or enclosure, that is, a cluster of information pertaining directly to the same thing. This hierarchical arrangement corresponds with the natural hierarchy found in most structured information. A series of one or more lines constitutes a record. The beginning of a new record is indicated by a
line whose level number is 0 (zero)."

That's excellent, and of course I think that because I agree with it. Here is how Tamura handles the overall view in 5.5.5:

"A GEDCOM files[sic] consists of records. A record that is contained within another record is a subrecord. Previous versions of GEDCOM have also referred to records and subrecords as structures and substructures, even “record structures”. They've also been referred to as “tag”, “tag context” and just “context”, while “context” is generally used to refer to the enclosing record, not the subrecord itself. A GEDCOM record consists of several parts, the tag is only part of the record. This version of GEDCOM consciously avoids using “tag” when “record” is meant. A GEDCOM record starts with a level number. Records with level number zero are known as top-level records. Previous GEDCOM versions have referred to top-level records as “zero-level records”, “logical GEDCOM records”, “logical records”, and “record at level zero”; this specification always uses “top-level record”. The term records includes subrecords, unless restricted by a modifier, as in “top-level records”, or “INDI records”."

This is what gets the steam rising, especially when he uses the CORRECT 5.5.1 definition later in 5.5.5. This quote clearly avoids using the term "line" (does not appear) and uses the term "record" to mean "line." Very awkwardly and confusingly.

So I was wrong. I thought that in 5.5.5 Tamura got the interpretation half right, correcting the errors in 5.5.1, when in fact, 5.5.1 has it completetly right, and Tamura messed it up. Since 5.5.5 is unofficial I have no right to critcize real GEDCOM because of someone's misrepresentation.

Tom Wetmore
~

John Cardinal

unread,
Aug 26, 2024, 6:59:04 PM8/26/24
to root...@googlegroups.com
GEDCOM 7 (7.0.14) is the current "LDS" version. See: https://gedcom.io/specifications/FamilySearchGEDCOMv7.html

John Cardinal


-----Original Message-----
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Monday, August 26, 2024 3:21 PM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

Re: "GEDCOM 5.5.1 is the latest official LDS version."



Thomas Wetmore

unread,
Aug 26, 2024, 9:38:38 PM8/26/24
to root...@googlegroups.com
John,

Thanks for correcting me. I thought 7 was at a dead end. There's a 2024 date on that document! I'm clearly out of touch. Does anyone know if any major genealogical program has committed to support 7 (or 5.5.5 for that matter). I want to see what it says about the line vs record question.

I just looked at the FHISO (Family History Information Standards Organization) website to see where they are. They've been quiet since 2021. I was heavily involved with them the for a couple of years (called BetterGEDCOM then), but after years of arguing and nothing happening I uninvolved. I see they have published a few obtuse documents that contain many words, all of which I don't understadnd. A case of academics coming in late and academic'ing things to death.

There is a lot of convoluted history in the GEDCOM area. There was an Event GEDCOM which was well suited for personas, as an event can be thought of as a generalization of any kind of historical record. And each event can be thought of as a collection of personas, one for each role player.

Early on in the history of BetterGEDCOM the LDS put together a push for a new LDS internal data format that they now use behind the scenes for internal purposes. It is called GEDCOM X and it has a web-based interface. This was fully developed and running in what seemed no more than a few months. Seeing the idealistic FHISO organization floundering, while the big boy came in and developed a new internal standard almost overnight was an object lesson in reality.

Sorry for the randomness.

Tom Wetmore

paul...@gmail.com

unread,
Aug 27, 2024, 12:40:34 AM8/27/24
to root...@googlegroups.com
*almost* any kind of historical record? Interesting debate.
Absolutely spot on, by my lights. Way to go, bro.
How come I never heard of Event GEDCOM? [he says, loading up Google]

-----Original Message-----
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Tuesday, August 27, 2024 2:38 AM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

Thomas Wetmore

unread,
Aug 27, 2024, 2:23:34 AM8/27/24
to root...@googlegroups.com
Paul,

If you cannot find it via Google (I could'nt) I have a copy. It was published in 1994 by COMMSOFT.

Tom Wetmore

paul...@gmail.com

unread,
Aug 27, 2024, 2:30:30 AM8/27/24
to root...@googlegroups.com
Well, not quite. I believe Event GEDCOM is right to view relationships as (often) defined by or derived from events.
However, it inherits from the (arguably deficient) Object Role Modelling (ORM) paradigm.
That essentially treats roles as "zero-dimensional" or scalars.
And, when they are treated as subordinate to events, those events must be "frozen" too.

To see why this matters, we can expose yet another flaw in GEDCOM itself.
Every standard event is effectively instantaneous (date periods do not apply).
So you either have a role in it or you do not.

On the other hand, Residence is an attribute and allows for extension in time.
Why should we not redefine this as a generalised "event"?
From a historical perspective that makes some sense. In close-up, events have duration, but in a history book the Battle of Hastings happened in 1066.

(A generalised event may also extend over space if we zoom in. Think wars, campaigns, even battles.)

A residence is usually a family home, but what is a family?
As a *household* it has varying composition over the years, even possibly including non-family. Its membership is in a state of flux.
Any in a "household member" role are there for a duration, sometimes more than once.
Some may have an additional role there (as I did once), sofa surfing after college.

For argument's sake, then, "events" may be extended in time and space.
Roles may come and go. One person may even play multiple parts in an event.
Roles can be re-cast as "participations" (I never thought of a better term).
They have attributes such as date (period) and possibly place (space).

Participations encode the many-many links between events and (e.g.) people.
Roles are a lossy projection down to a scalar.

I really should dredge up my old notes to check if anything's missing.

-----Original Message-----
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Tuesday, August 27, 2024 2:38 AM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

Event GEDCOM

Enno Borgsteede

unread,
Aug 27, 2024, 8:07:31 AM8/27/24
to root...@googlegroups.com
Hello Tom,

> Thanks for correcting me. I thought 7 was at a dead end. There's a 2024 date on that document! I'm clearly out of touch. Does anyone know if any major genealogical program has committed to support 7 (or 5.5.5 for that matter). I want to see what it says about the line vs record question.

Some are, but most are not, as you can see in this list:

https://www.familysearch.org/en/GEDCOM/implementation-progress

And I say that, because there's a lot of TBD in the available column.
And in general, I see more support in Europe than in the Americas.
Support is available in Ancestris, Family Historian, and a dozen or so
smaller programs, of which many are made in Germany.

> Early on in the history of BetterGEDCOM the LDS put together a push for a new LDS internal data format that they now use behind the scenes for internal purposes. It is called GEDCOM X and it has a web-based interface. This was fully developed and running in what seemed no more than a few months. Seeing the idealistic FHISO organization floundering, while the big boy came in and developed a new internal standard almost overnight was an object lesson in reality.

Well, sort of. GEDCOM X is there, and I have a couple of programs that
use it, like Ancestral Quest, Legacy, and RootsMagic, where it's hidden
in plain sight, as it is used to communicate with the FS tree, to
exchange tree data and associated citations.

I say sort of, because one of the parts that I'm interested in, which is
better support for citation elements, is not ready, which means that the
only things that we can download now, using these programs, or other
implementations of the FS API, are formatted citations, and I don't like
that.

Another thing is that, although GEDCOM X has support for personae,
a.k.a. extracted person records, they are not available in the FS API,
so I can't download them in the way that I can download persons,
relations, and citations.

Regards,

Enno

paul...@gmail.com

unread,
Aug 27, 2024, 8:58:12 AM8/27/24
to root...@googlegroups.com

Thanks Tom, found it last night without too much trouble – even though the original links were broken.

 

From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Tuesday, August 27, 2024 7:23 AM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

 

Paul,

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/74D29505-7BDF-4D18-9B9F-E98EECC07506%40gmail.com.

paul...@gmail.com

unread,
Aug 27, 2024, 9:28:33 AM8/27/24
to rootsdev

Event participation/participants…

Example 1: A Zoom call. Event in the cloud with start and end times. Many participants at various locations with a variety of durations and in various roles.

Example 2: A funeral. Two component events. One a burial attended by close family. The other a wake.

Example 3: Chapel membership (modelled as an event). Building as venue, person as pastor, others as assisting, individuals of the flock belonging. Extensible to mainstream institutions with affiliation, regulation, licensing for marriage, etc.

John Cardinal

unread,
Aug 27, 2024, 9:31:20 AM8/27/24
to root...@googlegroups.com
Tom,

https://www.familysearch.org/en/GEDCOM/implementation-progress describes progress towards adoption of GEDCOM 7.

I believe some of the people involved in prior attempts to revise GEDCOM 5+ were involved in the GEDCOM 7 specification effort. See: https://gedcom.io/specifications/FamilySearchGEDCOMv7.html#contributors

John Cardinal


-----Original Message-----
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Monday, August 26, 2024 9:38 PM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines

--

---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/70235D54-6007-446B-B090-64846DE349BD%40gmail.com.

Richard Light

unread,
Aug 27, 2024, 10:59:12 AM8/27/24
to root...@googlegroups.com
Paul,

This sort of event-centred modelling has been commonplace for a long time within the cultural heritage community. My own experience comes from working on and with museum data standards and frameworks. These share many of the features and challenges of genealogical research; in particular uncertainty and imprecision, and the absolute need to accommodate multiple, and possibly conflicting, interpretations of the same event.

I'm minded to have a go at using one of these frameworks to model one of FreeBMD's GRO index records, just to see what comes out of it. Although at first glance these are simple records, what we actually have is:

    - a scan, i.e. an image
        - of a page from a handwritten, typed or printed GRO index
        - containing an index entry
            - which asserts directly that a person with name [forename(s)] [surname] was B/D/M
            - which also points to a page in a volume in a GRO Register for a specific Registration District
                - which contains a copy of a B/D/M certificate recording this B/D/M event in more detail

We can infer from the index entry that the event took place before the end of the quarter/year specified in the index, and we [choose to] infer that the event took place in the Registration District specified therein. We can't infer that it took place after the start of the specified quarter/year, although in most cases this would be a reasonable working assumption.

Either way, we are forced to accept the imprecision, both geographical and temporal, which is built into this particular source.

The original GRO index from which the scan was made, and the GRO Register are both invisible to us, so shouldn't form part of our metadata.

If this is all expressed using a framework such as the CIDOC CRM or Linked Art, we can choose which aspect to give emphasis to, since all these connections are two-way and can run in either direction. My initial instinct is to focus on the person, and record a mini-biographical record for them, noting that they had a certain name and were associated with a B/D/M event for which we have the following evidence. That would make it much easier to merge other events into the same biographical record, if one wished to do so (and noting your concern to keep sources separate from interpretations/conclusions).

Best wishes,

Richard

Reply all
Reply to author
Forward
0 new messages