Acid test for GEDCOM

49 views
Skip to first unread message

Justin York

unread,
Sep 9, 2013, 10:28:34 AM9/9/13
to root...@googlegroups.com
I think the industry could benefit from an Acid test for GEDCOMs, validating the format of GEDCOMs as exported from popular products.

Do you think this would be valuable?

Anybody want to take this on? I don't have enough experience with GEDCOMs to take this on (nor enough time).

- Justin York

Andrew Hatchett

unread,
Sep 9, 2013, 10:48:51 AM9/9/13
to root...@googlegroups.com
If anyone takes this on they may find the information posted in 2010 and 2011on the BetterGEDCOM Blog ( http://bettergedcom.blogspot.com/ ) somewhat useful.


--
 
---
You received this message because you are subscribed to the Google Groups "rootsdev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rootsdev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tony Proctor

unread,
Sep 9, 2013, 10:55:21 AM9/9/13
to root...@googlegroups.com
 
    Tony Proctor
--

Ben Laurie

unread,
Sep 9, 2013, 10:58:17 AM9/9/13
to root...@googlegroups.com
On 9 September 2013 15:55, Tony Proctor <to...@proctor.net> wrote:

The problem is, though, that there's not a standardised interchange format that can capture the content of most genealogical packages, which means people extend GEDCOM to do so.

Thomas Wetmore

unread,
Sep 9, 2013, 12:25:00 PM9/9/13
to root...@googlegroups.com
Justin,

I don't think this will generate much traction. It has been kicked around for years. A number of pathological GEDCOM files have been created over the years, and different studies have been done to evaluate the abilities of different products to handle them. The studies are nice to look at and tsk, tsk, tsk over, but nothing positive comes from them.

In my opinion the vendors generally could care less about GEDCOM issues. Every vendor extends GEDCOM in different ways that are incompatible with other vendors. In fact each vendor tends to misunderstand and incorrectly implement GEDCOM in their own idiosyncratic ways. And they really don't care and have no vested interest in fixing things. From a vendor's point of view there is no disadvantage to locking in your customers with a version of GEDCOM that does not work well or at all with others. Transporting data from one system to another using GEDCOM, with the high level of loss and mistranslation that occurs, is one of the most griped about issues in the technical parts of genealogy. We've been griping for over 30 years about this, but with no strong backing for GEDCOM from anywhere, and absolutely no way to enforce the standard, given we should even call it that, we will simply continue to gripe and gripe.

The only hope are solutions like Ancestor Sync or waiting for a more complete, and actually maintained and supported, standard to become, well, standard. Even if such a standard existed today (I would claim that GEDCOM-X is probably not that standard, and nothing real is on the horizon from FHISO), it would still take 20 years or more before it became prevalent enough throughout the industry to make any difference. I believe that products like Ancestor Sync are the only hope for true data sharing in the immediate future, maybe from now to more than five years out. [I am associated in an ancilliary way with the people working on Ancestor Sync, so take my comments with a grain of salt.]

Tom Wetmore

Dallan Quass

unread,
Sep 9, 2013, 1:07:22 PM9/9/13
to root...@googlegroups.com
I agree with Tom.  I believe the best we can hope for in the near-term is to first come up with a "de facto" model for GEDCOM: a reasonably-simple model that extends the GEDCOM standard along the same lines as popular vendors have extended it (you have to balance reasonably-simple with how many extensions to include). And then ask how closely GEDCOMs adhere to that model.
 
I attempted to do that as part of my gedcom parser: https://github.com/DallanQ/gedcom . The object model is here: https://github.com/DallanQ/Gedcom/wiki/UML-Diagrams . It is based upon an analysis of the 7000 GEDCOMs that were submitted to WeRelate from 2007-2011.  In addition to extending the GEDCOM standard in various ways, certain parts of the standard are omitted: https://github.com/DallanQ/Gedcom/wiki/Specification-differences because they do not generally appear in GEDCOMs in the wild and including them would complicate the model unnecessarily. 
 
The model represents most of the information found in most of the GEDCOMs analyzed, and all of the information found in 50% of the GEDCOMs analyzed.  The model also allows for "extensions", which allows it to represent all of the information found in 98% of the GEDCOMs analyzed. The result is that 98% of 7000 GEDCOMs analyzed can be converted to the object model (representable in a JSON format) and then converted back to GEDCOM without any loss of information.
 
Dallan
--

Enno Borgsteede

unread,
Sep 9, 2013, 1:24:25 PM9/9/13
to root...@googlegroups.com
Tony,
I just read your article at

http://parallax-viewpoint.blogspot.ie/2013/08/are-we-modelling-data-or-commerce.html

and like to add that I agree with Tom. I see no way to enforce anything, and know from experience that GEDCOM implementation don't play a big role in the products that I choose to use. I switched from PAF to Gramps last year, and chose Ancestral Quest and RootsMagic to connect to the FS tree. And since the contents of that tree are mostly without sources now, there is no need to worry about GEDCOM much, because names and vitals are transferred well enough, and I run a modified Gramps that also supports the _FSFTID tag.

For me this means that in a way the FS API is the killer product now, so it is FS again. The FS API works for me right now, but it may not work so well when more sources are available, because the source model used on FS looks quite limited to me. I really like to see more attributes, but no EE.

That said, because the FS API is only used with the FS tree, it is also quite limiting. If I want to exchange data with sites that are better sourced, like WeRelate.org, I have to switch back to GEDCOM now, and that IS a problem, when it comes to source formats.

So in the end, a product like AncestorSync may appeal more to me, provided that it supports Gramps and FS, and WeRelate, and wikitree, and so on. And that's a fantasy that I read on Randy Seaver's blog years ago.

On the long run, I don't think that I care about GEDCOM that much. Importing large amounts of people in my main database is quite risky, so I don't do that often, and in most cases GEDCOM is good enough for uploading to ancestry, genealogie online, geneanet, and so on.

So I think sync is it, and that's proprietary now, at Ancestry, FS, and My Heritage.

regards,

Enno

Louis Kessler

unread,
Sep 9, 2013, 8:06:40 PM9/9/13
to root...@googlegroups.com
Justin,

I've found single acid tests don't work well. It is difficult to detect
everything in one test in a reasonable way.

I like the idea of having a large number of very small tests for specific
issues regarding GEDCOM. Then individual problems can easily be identified
and easily fixed because there will be a small test program to use to
correct the program with.

For example, Tamura Jones has published a number of small tests for
GEDCOM. Just today, he published another good one, which half the tested
programs failed:
http://www.tamurajones.net/IdentCONT.xhtml

Louis


----- Original Message -----
From: "Justin York" <justi...@gmail.com>
To: <root...@googlegroups.com>
Sent: Monday, September 09, 2013 9:28 AM
Subject: [rootsdev] Acid test for GEDCOM


>I think the industry could benefit from an
> Acid<http://en.wikipedia.org/wiki/Acid3>test for GEDCOMs, validating

Tony Proctor

unread,
Sep 10, 2013, 3:38:07 AM9/10/13
to root...@googlegroups.com
It seems we're all a little pessimistic today. Although I believe there are vendors out there who only pay "lip service" to full import/export, I also know that I cannot generalise across all vendors.
 
Perhaps the only way out of this is to get the user-friendly vendors co-operating on a standard mechanism, and then embarrass the remainder into supporting it. It wouldn't be long before the word got around that 'product A' will lock you in, whereas 'product B' will not. It would take some 'pretty damn charming MF' features to persuade someone to go with 'product A'  :-)
 
    Tony Proctor
----- Original Message -----

Justin York

unread,
Sep 10, 2013, 10:20:14 AM9/10/13
to root...@googlegroups.com
Louis I like the way you're thinking. Assembling Tamura's tests into one dashboard would quickly reveal the most frequent offenders and enable us to track progress across versions of a given product. We wouldn't even have to tackle sources and citations to have a positive impact.

Doug Blank

unread,
Sep 10, 2013, 11:17:01 AM9/10/13
to root...@googlegroups.com
Having an agreed upon set of GEDCOM files for testing would indeed be
great! Some of Tamura's tests would be a good start, but like any good
set of unit tests, it would be a large undertaking to get full
coverage. I see that we could break down testing into Import
compatibility and Export compatibility. For Import, it would be nice
to have small and large test GEDCOMs, as those test two very different
things.

For testing Import, it would be nice to see:

* messages produced during import
* evidence that the data was correctly parsed (maybe use Export to
test, but that introduces additional possible errors)
* time it took to load
* assumptions made for unspecified data
* how it deals with non-official extensions (should not crash, but just ignore)
* overall score

For testing Export, it would be nice to see:

* differences between resulting GEDCOM and original (maintain original facts)
* listing non-official extensions (perhaps producing extra information
should not be penalized?)
* time it took to save
* overall score

Users could self-report running the tests on different products:

* test file, version
* product, version
* operating system, version
* other relevant software versions
* computer specs (RAM, CPU, etc)

I think someone could put together a Ruby on Rails or Django site
fairly quickly...

-Doug

Thomas Wetmore

unread,
Sep 10, 2013, 2:39:06 PM9/10/13
to root...@googlegroups.com
My pessimism concerns GEDCOM, its possible future, and vendor involvement with it. Since GEDCOM is not a standard, with no "standard body" maintaining it, or certifying applications that use it, claiming something is GEDCOM compatible means nothing.

The solution you mention here is the same one I listed as number one in my post, the presumed FHISO approach, of creating a new standard, maintaining it, and certifying applications that claim to support it. In this case a vendor would have to certify their implementation and the phrase XXXXX-compatible would mean something.

But I am pessimistic about that solution also, but not nearly as much as about GEDCOM. I am pessimistic hat FHISO can do the job, and I'm pessimistic about the twenty or more years it would take, given the job were done, for the effects to spread through the industry.

What will probably happen is that the GEDCOM-X API will come to define how genealogical search results are returned from the big servers, others will start using that API, and the data parts of that API will morph into the data model used by new systems. This will come to be the "persona data model"; whether it will become the conclusion level data model is a wait and see.

The only useful thing that can be done with GEDCOM today is to study its different implementations from all the vendors, paying special attention to extensions that vendors have added, with the goal of understanding the "super-model" of genealogical data implied by all the variants and extensions. What is useful about understanding this super-model is that it provides one of the starting points that should be evaluated during the work of, presumably FHISO, when defining the requirements that any new data model and data format for genealogy should support.

Collecting pathological GEDCOM files and evaluating what different programs do to them has been done many times already. Why waste time and effort and mental energy on this, when GEDCOM has no future, when vendors couldn't care less about the results of these tests, nothing effective would come from the tests, and real work with formats that would go way beyond GEDCOM are so much more important?

Tom Wetmore
Reply all
Reply to author
Forward
0 new messages