Thomas,
Following up on your steam-letting, I did a search for GEDCOM versions. I see that there is also a version 7 (https://gedcom.io/specifications/FamilySearchGEDCOMv7.html).
This isn't an area I know about, so I would welcome advice. When generating GEDCOM as a download from the updated FreeBMD web site (work in progress ...), which version of GEDCOM should I conform to? I see that at present I am producing 5.5.1 output.
My instincts are to go for a version which is Unicode-based (i.e. 5.5.5 or 7). However, there is the danger that too 'modern' an encoding might limit the usefulness of our GEDCOM, if the various applications/web sites which might consume it haven't kept up to date with the development of the specification.
I would welcome your thoughts.
Best wishes,
Richard Light
On Aug 24, 2024, at 2:02 AM, Richard Light <richard...@gmail.com> wrote:Thomas,
Following up on your steam-letting, I did a search for GEDCOM versions. I see that there is also a version 7 (https://gedcom.io/specifications/FamilySearchGEDCOMv7.html).
This isn't an area I know about, so I would welcome advice. When generating GEDCOM as a download from the updated FreeBMD web site (work in progress ...), which version of GEDCOM should I conform to? I see that at present I am producing 5.5.1 output.
My instincts are to go for a version which is Unicode-based (i.e. 5.5.5 or 7). However, there is the danger that too 'modern' an encoding might limit the usefulness of our GEDCOM, if the various applications/web sites which might consume it haven't kept up to date with the development of the specification.
I would welcome your thoughts.
Best wishes,
Richard Light
On Aug 24, 2024, at 11:38 AM, John Cardinal <jfcar...@gmail.com> wrote:GEDCOM 7.x supports UTF-8 only. AFAIK, FamilySearch never released a standard named "GEDCOM 5.5.5" so I don't know what people are referring to with that. GEDCOM 5.5.1 supports UTF-8. GEDCOM 5.5 does *not* support UTF-8, but many programs that claim to write GEDCOM 5.5 files offer UTF-8 as an option, and many programs will read a GEDCOM 5.5 file even if it uses the non-standard (for 5.5) UTF-8 encoding.
You can see the specs for several GEDCOM versions here: https://gedcom.io/specs/
Unfortunately, though, FH makes no attempt to accommodate the genealogist's distinction between person and persona. That is presumably in part for lack of demand, but mainly (in my opinion) because GEDCOM does not support this explicitly. Merging personae into persons is surely not the answer: destroying (or at least masking) evidence and assessment, and obviating (or at least hindering) "undo" operations. Any future not-a-GEDCOM should build the distinction into its core structure, as well as supporting many other kinds of person-person relationship.
Well I wasn't expecting this when I asked an innocent question about which version of GEDCOM to publish ...
I looked at the 5.5.5 and 7 specs, and was pleased that there was movement in the direction of a sensible use of Unicode, for example. However, if there is a view that the model is not fit for some purposes, specifically for recording personae, I would say that is an argument for starting something fresh.
As it happens, the data published by Free UK Genealogy is all of the "persona" type (a.k.a. "evidence", a.k.a. "attestation"). Our largest resource - https://www.freebmd.org.uk - consists of transcriptions of scans of an index (hand-crafted) to a Register of birth, marriage and death certificates. In other words, it's a pretty indirect piece of evidence, with plenty of scope for mistakes to creep into the data, and a general lack of precision (dates only accurate to the quarter or month; places only specified as Registration Districts). It's clearly useful, but equally clearly it requires the sort of additional interpretation and theory-forming that this thread has been discussing.
It won't be easy or happen in my lifetime.
Well, let's see. One achievable goal might be to create a machine-processible data format for expressing the exact nature of "personae". For example, for FreeBMD we would want to say that we have a scan (and point to it), transcribed by person X (and possibly person Y), which asserts that in the GRO Register for births, Registration District R, quarter Q, volume V, page P there is a certificate which asserts that a person with the name F[orename(s)] S[urname] was born. Where our processing of the data flags up that the data might be suspect (wrong Volume for the District; Page number outside the expected range) this could be included as a caveat. All of this information could be delivered as a single JSON or XML payload, giving the researcher everything we ourselves know about that entry, and the means to double-check it for themselves by going back to the scan.
If we built this, would anyone come (i.e. use it)? One advantage
of delivering it as JSON or XML is that you wouldn't have to write
a parser to interpret it; conversely you would need JSON- or
XML-aware software tools to work with it.
Apart from the GRO Index, our other projects relate to parish registers (baptisms, marriages and burials) and census data. So between them, I think, they cover a worthwhile subset of the sorts of persona data we might want to encode.
Interestingly ("we are not alone") the Pelagios project
(https://pelagios.org/activities/) is interested in working on
pretty much the same problem in its People activity.
A religious obsession with "lineage" explains the original GEDCOM data model and why no amount of hacking will ever fix it. And, most often overlooked, it has only patchy support for "family history".
I'm not suggesting tackling the bigger issue of properly supporting "family history". However, if we can make progress on a relatively simple framework for recording primary and secondary evidence, that might point a way forward for this too.
Maybe we just leave GEDCOM behind ...
Richard
Hi Richard, so glad you reacted - and with a lot of insight!
Last thing first, yes, a complete GEDCOM replacement is what I'd aim for. And to the sceptic I'd also argue that its structural generality would allow for seamless import from old "standards", with software potential for guided or customised migration to newer structures for those that want to go down that road.
And to be clear, I'm not in any way endorsing certain well-known non-GEDCOM initiatives that in my view miss the point entirely. But more to the point, you've awakened another angry bee in my bonnet: data resources, collections, and records.
The "clearly useful" and widely used FreeBMD could be massively more useful if only... Suppose each original source record (a handwritten or printed line, or manual addition had its own unique source record ID. Suppose each transcription therefrom had its own unique transcription ID.
Then the collection of source IDs would constitute (something like) the full set of people born, died, or married in England & Wales since mid-1837. Ditto for GRO Online. Ditto for census records, etc, etc.
Those IDs *should* be carried through to all re-publishers, so we always know the provenance and never get confused if something seems "new". And the transcription IDs with their source ID association would ensure we never get suckered into believing that different readings are unrelated personae.
FreeXXX.org agreed there was merit in the idea but, as you would expect, did not envisage any retrospective coding - given their voluntary funding.
We have the same situation for *all* resources. It goes right back to every repository's initial cataloguing, through page annotation, filming/scanning, transcriptions, and indexing. Not a problem (in principle) for new systems, but a hopeless prospect for any retrospective recoding.
Then there are the data collections (re)published by the likes of Ancestry and FindMyPast, just to name a couple, where it's a case of "never mind the quality, feel the width". The most aggravating failure is the proportion of records that lack an original source reference (even if that is well known).
And re-publisher collections should tag every record with the collection and record ID it is derived from.
In short, every source record and stage of processing/publishing should be tagged with a unique ID. And every repository, collection and publisher should be represented in a public catalogue with successor links where appropriate.
Yep, I've probably missed plenty of details.
Paul
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Richard Light
Sent: Sunday, August 25, 2024 11:49 AM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines
Well I wasn't expecting this when I asked an innocent question about which version of GEDCOM to publish ...
The "clearly useful" and widely used FreeBMD could be massively more useful if only... Suppose each original source record (a handwritten or printed line, or manual addition had its own unique source record ID. Suppose each transcription therefrom had its own unique transcription ID.
Then the collection of source IDs would constitute (something like) the full set of people born, died, or married in England & Wales since mid-1837. Ditto for GRO Online. Ditto for census records, etc, etc.
Those IDs *should* be carried through to all re-publishers, so we always know the provenance and never get confused if something seems "new". And the transcription IDs with their source ID association would ensure we never get suckered into believing that different readings are unrelated personae.
FreeXXX.org agreed there was merit in the idea but, as you would expect, did not envisage any retrospective coding - given their voluntary funding.
We do accept the value of having persistent identifiers, at least for the source records themselves. In fact, we would like to go beyond this to having full Linked Data URIs. This means that as well as being persistent these identifiers would be "dereferenceable", i.e. that they would give access to the sort of machine-processible data I described in an earlier reply.
We do actually have identifiers for each record - they are cited on the current site on the details page:
At present these URLs fall short in two respects: they don't deliver useful data (just the relevant web page) and they aren't guaranteed to be persistent. This is because the unique identifier is generated as a hash of the data itself. If the data changes (because of corrections) then the hash changes too.
This is frustrating, but as you say it is totally built in to the current system design. Please rest assured that I'm doing my level best to improve the situation.
Best wishes,
Thanks Tom, found it last night without too much trouble – even though the original links were broken.
From: root...@googlegroups.com <root...@googlegroups.com> On Behalf Of Thomas Wetmore
Sent: Tuesday, August 27, 2024 7:23 AM
To: root...@googlegroups.com
Subject: Re: [rootsdev] GEDCOM Records vs. Lines
Paul,
--
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rootsdev/74D29505-7BDF-4D18-9B9F-E98EECC07506%40gmail.com.
Event participation/participants…
Example 1: A Zoom call. Event in the cloud with start and end times. Many participants at various locations with a variety of durations and in various roles.
Example 2: A funeral. Two component events. One a burial attended by close family. The other a wake.
Example 3: Chapel membership (modelled as an event). Building as venue, person as pastor, others as assisting, individuals of the flock belonging. Extensible to mainstream institutions with affiliation, regulation, licensing for marriage, etc.
We can infer from the index entry that the event took place
before the end of the quarter/year specified in the index, and we
[choose to] infer that the event took place in the Registration
District specified therein. We can't infer that it took
place after the start of the specified quarter/year, although in
most cases this would be a reasonable working assumption.
Either way, we are forced to accept the imprecision, both geographical and temporal, which is built into this particular source.
The original GRO index from which the scan was made, and the GRO Register are both invisible to us, so shouldn't form part of our metadata.
If this is all expressed using a framework such as the CIDOC CRM or Linked Art, we can choose which aspect to give emphasis to, since all these connections are two-way and can run in either direction. My initial instinct is to focus on the person, and record a mini-biographical record for them, noting that they had a certain name and were associated with a B/D/M event for which we have the following evidence. That would make it much easier to merge other events into the same biographical record, if one wished to do so (and noting your concern to keep sources separate from interpretations/conclusions).
Best wishes,
Richard