Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion GEDCOM Requirements [was RE: [BeyondGen] What's wrong with GEDCOM?]

MIME-Version: 1.0
Message-ID: <97e6e7b9-218b-4cc8-91f7-a4265ec84152@2g2000hsn.googlegroups.com>
Date: Tue, 1 Apr 2008 19:45:56 -0700 (PDT)
Received: by 10.150.133.17 with SMTP id g17mr274998ybd.8.1207104356469; Tue, 
	01 Apr 2008 19:45:56 -0700 (PDT)
In-Reply-To: <94B0B6D99C2381418360349500613693D75346@northfacemail2.northface.local>
X-IP: 84.142.47.224
References: <abb33cbd0803280830p7e044503u1d1b5fb79e57aeaf@mail.gmail.com> 
	<94B0B6D99C2381418360349500613693D75346@northfacemail2.northface.local>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; 
	.NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 
	1.1.4322),gzip(gfe),gzip(gfe)
Subject: Re: GEDCOM Requirements [was RE: [BeyondGen] What's wrong with 
	GEDCOM?]
From: j...@daubnet.com
To: BeyondGen <beyondgen@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Before going too much into the subject, I'd like to introduce myself.
My name is J=F6rn Daub, author of the genealogy software "Ages!" (Btw.
Jorn is perfectly fine, if your keyboard lacks German umlauts) I've
been working with GEDCOM files for ten years now, and would like to
leave some (rather lengthy) comments on the various topics in this
post.

"ZIPPED up GEDCOM" Proposal
I totally agree that there should be a defined way to store
genealogical data along with binary media in a zipped archive format.
I also agree that such a format should have a distinctive file
extension. I am unsure, however, if a dual extension (like .ged.zip)
wouldn't better serve that purpose. Such an extension has the
advantage of being accessible to the operating system and/or standard
zip software, with no further action or knowledge about the format
required by the user.
The disadvantage of this however would be that you cannot associate a
dual extension on MS operating systems. So a double click will not
bring up the file in your genealogy software, but in a ZIP software
instead. I'd like to read your opinions on this.

Proposed Family Search API to GEDCOM mapping
I scanned through the word-document, I will do more in-depth research
at a later point in time. Here's what came to my mind when reading the
doc.
[API Mapping / Section: Person]
Not having done anything with the familysearch API yet, I'm unsure
what the "version=3D" attribute does, but it seems reasonable to put a
tag inside of INDI.CHAN, if - and only if - that number cannot be
converted to a date-time value pair, and the version number serves any
purpose above date/time versioning. In your example the version number
and the date/time both end with 771, so I guess the prior is just the
same data in a different format. If that were the case, it should just
be left out of the GEDCOM file. If a tag is needed however, it should
have an underscore, because the GEDC.VERS and CHAR.VERS both relate to
versions of a standard, which has little in common with record
modification versioning, so you would actually "overload" two meanings
onto the same tag.
[API Mapping / Section: Information]
I would prefer to use the REFN/TYPE combination with a fixed TYPE
value, such as "FamilySearchAPI" or something of that sort. Parsing
should not be an issue, and such information has a good chance of
surviving when traveling through programs that do not know about the
FSAPI.
[API Mapping / Assertions]
Adding a CHAN sub-record to SEX, NAME and events makes sense. Since
the information is both syntactically and semantically similar to the
standard GEDCOM tags, I would also use CHAN/DATE/TIME without
underscores. As stated above, I would try to not store the
version=3D"xx" number, and I'd stick with _VERS and _USER instead of re-
using VERS and SUBM in a different context with different meanings.
Using SUBM would make sense, if you also intend to include "u.
100000168" as a SUBM record, but if you did, it should read 3 SUBM @u.
1000168@. I simply don't understand what the _FSID # is doing there,
but that may be due to my lack of FamilySearchAPI experience.
[API Mapping / Family Generation]
I am unsure why you specified how you create the GEDCOM pointers. The
way you do it is perfectly fine, but basically no software should rely
on GEDCOM pointers being constructed in a certain manner, so I don't
see any need to document it publically. Most programs will start
with@I1@, @I2@ ... etc, but it could be @JOHN_DOE1@ @JOHN_DOE2@ as well,
without making any difference. These pointers should not be assumed to
have any meaning. Using record IDs for anything but reconstructing
record links is a trap that many programmers have fallen into, and it
is basically calling for trouble.

Assertions and GEDCOM
There has been much debate about facts, sources, evidence and
assertions and their corresponding elements in GEDCOM. In my personal
opinion, this whole debate is flawed, because there are no such things
as "facts" in genealogy. It is all about assertions of some sort.
There is not a single type of source that hasn't been proven wrong at
some point in time. Once you accept that there are no facts, but only
assertions with varying reliability, the GEDCOM structure does provide
for most of the relevant information. I have yet to be convinced as to
what the differences between so-called "facts" and so-called
"assertions" are, and where to draw a line between those two. To me,
it seems that all that is needed is available: INDI, FAM and events to
store your facts/assertions/conclusions (whatever you want to call
it), source records and pointers to document the underlying data
leading to your 'facts', and binary attachments (MEDI) to store the
evidence itself. GEDCOM does allow for multiple conflicting sets of
information (yes, you can be born twice :-) ), and it does provide
ways to document imprecise information. The only thing that I actually
miss here is to document disproven information - yes, there are a few
spots to store such information, but that's way too little. So for the
most part, you will need to use notes to store information that is
known to be wrong. Documenting how you decided between conflicting
dates is such an individual task that probably any data structure
would be both: have its shortcomings and be "overkill". That's why I'd
stick to notes for that purpose. Validating assertions is a task for
humans anyhow, not for computer systems. And humans will be just fine
with notes.

XML and whatnot
In an ideal world, GEDCOM would be an XML format. But this is not an
ideal world, and ... you guessed it, GEDCOM is not XML. While GEDCOM
could be easily mapped to an appropriate XML schema, what you wind up
with is the same problems, just in different appearance. Any new
format may solve a lot of problems, but XML in itself will not solve
much, the schema might. But once you move to another schema, you will
suffer from something else: Lack of acceptance, just see the
paragraphs above what a task it is to map two semantically different
systems. That is why I'd rather take the pragmatic approach and talk
about small details in GEDCOM. In my opinion, that is much more likely
to solve real problems in the real world, than discussing a new
schema.
Most talk about the deficiencies of GEDCOM and the advantages of XML
comes from people who don't know much about GEDCOM, and have not read
or understood what the specs allow for. No, it is not a perfect
format, but it is something you can work with. If you think that ANSEL
was a bit weird, you are right. But if you think ANSEL is a "real
problem"(tm), you are not. May be this is just me, but I don't think it
would help all too much if all genealogy programmers started to speak
Esperanto. Yes, XML would be nice, but moving the discussion to that
subject ignores the fact that most genealogy programs can effectively
parse GEDCOM files. The real problems in data transmission are defined
by the differences in the source and target system's data structures,
not by the intermittent transmission format. XML won't change that a
bit, it is more likely to just add complexity to the problem.
Thanks for listening... *monologue ends here* :)

J=F6rn Daub