What's wrong with GEDCOM?

Dan Hanks

unread,

Mar 28, 2008, 11:30:47 AM3/28/08

to beyo...@googlegroups.com

Hi folks,

At the recent FHT workshop at BYU, the topic of GEDCOM's deficiencies
was brought up. I know there are folks on both sides of the aisle on
this one. I know some who feel GEDCOM works just fine, while there are
some who feel there are significant deficiences.

I fall just to the side of the line of those who feel there are
deficiencies in GEDCOM, though I think there's a lot that's good
there.

I would like to start a conversation to nail down a comprehensive list
of the strengths and weaknesses that exist with the current GEDCOM
standard. If such a list currently exist, please point me to it.

Do you love GEDCOM? Tell us all why.

Think GEDCOM is not sufficient? Tell us all why.

Know of folks who should be part of this conversation but aren't on
the list? Send them an invite to join, please.

Thanks,

-- Dan

John Finlay

unread,

Mar 28, 2008, 12:41:33 PM3/28/08

to beyo...@googlegroups.com

I've thought a lot about this topic and I think that the approach of listing deficiencies is the wrong way to approach the problem. I think you could do the same thing with any format for encoding genealogical data (xml, object, binary, etc) and you would come up with about the same number of deficiencies.

As was mentioned at the conference, the problem is not the encoding format, the biggest problem is that data is lost as it is translated between data models. That problem is going to exist no matter what encoding/transmission standard you choose. Whether that be going from the internal model to GEDCOM or going from GEDCOM to the internal model. I could have direct binary access to the file format of another system and I would still have the exact same problem. (That is why PGV uses GEDCOM as its internal model because then we can guarentee that we never lose data at least on the receiving end.)

As a model GEDCOM is actually very good. I've never been able to come up with something in genealogy that I wasn't able to model and encode in GEDCOM. There are a few minor tweaks I would make to bring it up to what people are doing with genealogy. For example, geocoding is really big right now in family history, and while GEDCOM can support geocoding, it could use a revamp in the area of encoding places.

I think the correct approach to take is to go back to "requirements". What are the requirements for a genealogical data transmission standard? Remember that requirements should be independent of any encoding format or technology. In other words you can't say that "XML" or "GEDCOM" is a requirement. Though you can say that "backwards compatibility" or "ubiquitous adoption" are requirements.

I think if you look at the "requirements" you will find that GEDCOM meets most of those requirements and meets them well, which is why it has persisted as the standard for so long.

--John

________________________________

winmail.dat

John Finlay

unread,

Mar 28, 2008, 12:41:41 PM3/28/08

to beyo...@googlegroups.com

Regarding GEDCOM requirements, there was one requirement I picked up on at the conference which I've heard from users for a long time but didn't recognize as a requirement until now. The comment was about the inability to trasmit media and keep it attached to the data. GEDCOM has had the ability to encode links to media objects, but those links are useless on the receiving end without the media that they point to (unless they point to a URL which I've found to work quite nicely). Again any text-based encoding format such as GEDCOM or XML is not going to meet this requirement on its own.

I think there is an easy solution to this particular problem. We could just take the same approach that Java does with JARs and WARs and EARs. All a JAR is, is a ZIP file with a certain file structure in it and any zip program can open it.

It would be very easy to do the same thing with GEDCOM. My proposal would be simply to zip up a file structure something like this:
/
|__ gedcom.ged
|__ media/
|__ picture.jpg
|__ subfolder/
|__ picture2.jpg

This structure would just be zipped up and given the filename "gedcom.gedz" where the filename of the zipped file has to match the filename of the GEDCOM file and we use the new ".gedz" extension.

But now I have a problem. Who do I submit this proposal to? Noone :(

And here we arrive at what I think GEDCOM lacks the most and that is "authority". There is no standards body or committee behind it. It has for all intents and purposes been abandoned by the LDS Church. There are no good community backed tools such as parsers or validators. I also think that if a new XML standard were developed we would arrive at the same problem in another 5 years. What we need the most is an open GEDCOM community.

I've contemplated trying to start an open community around GEDCOM for a while now. I haven't because I don't want it to be viewed as PhpGedView's version of GEDCOM. It needs the backing of the entire industry in order to be successful. Up until now I didn't have the contacts in industry to make it happen. But now the cooperation around the new FamilySearch API has started to bring the industry together and I think the time has come to really push the initiative of a GEDCOM standards committee to promote the future of GEDCOM.

--John

________________________________

From: beyo...@googlegroups.com on behalf of Dan Hanks
Sent: Fri 3/28/2008 9:30 AM
To: beyo...@googlegroups.com
Subject: [BeyondGen] What's wrong with GEDCOM?

winmail.dat

Dan Hanks

unread,

Mar 28, 2008, 2:48:03 PM3/28/08

to beyo...@googlegroups.com

On 3/28/08, John Finlay <John....@neumont.edu> wrote:
> As a model GEDCOM is actually very good. I've never been able to come up with something in genealogy that I wasn't able to model and encode in GEDCOM. There are a few minor tweaks I would make to bring it up to what people are doing with genealogy. For example, geocoding is really big right now in family history, and while GEDCOM can support geocoding, it could use a revamp in the area of encoding places.
>

I think this is the crux of the question I am asking to the group, are
there others for whom GEDCOM is not sufficient (as it is for you,
John)? And if not, why not? For the others, what is GEDCOM not able to
do that you would like to do - as far as modelling genealogical
information?

One question I have regarding GEDCOM is how well it can model
different kinds of parent-child relationships? I.e., Can you indicate
that child A of parent X is biological, while child B of parent X is
adopted, for example?

Thanks for your input John,

-- Dan

jay

unread,

Mar 28, 2008, 3:34:38 PM3/28/08

to BeyondGen

A community that I'm sure would have a lot to say about this is the
GenealogyXml community:

http://tech.groups.yahoo.com/group/GenealogyXML/

I echo what John said about locations in gedcom.

John Vilburn

unread,

Mar 28, 2008, 4:09:25 PM3/28/08

to beyo...@googlegroups.com

John,

I would support an open GEDCOM committee and I support you in your efforts
to create such a committee. I believe that I am sensing that same support
from the other major genealogy product vendors. The problem is that the LDS
Church currently owns the copyright for the GEDCOM spec. If we formed a
committee and approached them, I think that they would probably transfer
rights to the committee. It is certainly worth a try.

Aloha,
John Vilburn
Ohana Software LLC

John Finlay

unread,

Mar 28, 2008, 6:11:12 PM3/28/08

to beyo...@googlegroups.com

>
> I think this is the crux of the question I am asking to the group, are
> there others for whom GEDCOM is not sufficient (as it is for you,
> John)? And if not, why not? For the others, what is GEDCOM not able to
> do that you would like to do - as far as modelling genealogical
> information?
>

For anyone interested in GEDCOM place encoding there is the GEDCOM 5.5 EL extension that is used by some programs which has better place support.

The other change I would like to make is the possibility for level 0 source citations which would make it easier to take the source centric approach. This has been a major point of discussion though and it boils down to personal preference so it may not be an appropriate change.

>
> One question I have regarding GEDCOM is how well it can model
> different kinds of parent-child relationships? I.e., Can you indicate
> that child A of parent X is biological, while child B of parent X is
> adopted, for example?
>

Yes, the current GEDCOM spec supports this though I doubt most genealogy apps would read it if you included it. From the GEDCOM spec:
CHILD_TO_FAMILY_LINK:=
n FAMC @<XREF:FAM>@ {1:1}
+1 PEDI <PEDIGREE_LINKAGE_TYPE> {0:1}
+1 STAT <CHILD_LINKAGE_STATUS> {0:1}
+1 <<NOTE_STRUCTURE>> {0:M}
PEDIGREE_LINKAGE_TYPE:= {Size=5:7}
[ adopted | birth | foster | sealing ]
A code used to indicate the child to family relationship for pedigree navigation purposes.
Where:
adopted = indicates adoptive parents.
birth = indicates birth parents.
foster = indicates child was included in a foster or guardian family.
sealing = indicates child was sealed to parents other than birth parents.

CHILD_LINKAGE_STATUS:= {Size=1:15}
[challenged | disproven | proven]
A status code that allows passing on the users opinion of the status of a child to family link.
challenged = Linking this child to this family is suspect, but the linkage has been neither proven nor
disproven.
disproven = There has been a claim by some that this child belongs to this family, but the linkage
has been disproven.
proven = There has been a claim by some that this child does not belongs to this family, but the
linkage has been proven.

Taking your example above:

0 @A@ INDI
1 NAME Child /A/
1 FAMC @F1@
2 PEDI birth

0 @B@ INDI
1 NAME Child /B/
1 FAMC @F1@
2 PEDI adopted

winmail.dat

Jesper Zedlitz

unread,

Mar 29, 2008, 6:29:31 AM3/29/08

to beyo...@googlegroups.com

Dan Hanks wrote:
> Think GEDCOM is not sufficient? Tell us all why.
>

The GEDCOM data model is (nearly) perfect for what (at least I think) it was
invented for: the presentation of genealogical data. A GEDCOM file is
something like an electronic family tree.

But this is not what a lot of people want to transfer. They want to store and
exchange their research progress. This includes not only the final result but
also provenance how the information was found.

In a couple of days I am going to give a speech at a conference in Germany
about genealogical data models. So I already collect some information about
the topic.

1) In genealogy we explore the past (everything that existed and happend). We
know that we won't be able to cover it completly. No matter how hard we try
we can only get very close.

2) All we know about what happend in the past is written in documents. Not
everything has been documenented and there are wrong documents (created by
mistake or purposely), too.

3) Finally we combine information from these sources and make conclusions
because it seems rational for us. During the research we might find other
sources that confirm these conclusions or disprove them.

For a complete documentation of genelaogical research we need to provide
information about layers 2 (the sources) and 3 (researcher's assertions)
without mixing them up. This gives us a (limited) view to layer 1.

Using the GEDCOM you try to model layer 1 directly using as much information
from layer 2 that is needed to prove the created layer 1 data and
possibly "correcting" some of it to match the created layer 1 data. Duplicate
and conflicting is ignored. Research progress from layer 3 is completely
ignored.

Jesper

--
Jesper Zedlitz E-Mail : jes...@zedlitz.de
Homepage : http://www.zedlitz.de
ICQ# : 23890711

signature.asc

Jay Askren

unread,

Mar 29, 2008, 9:30:38 AM3/29/08

to beyo...@googlegroups.com

I believe that is the idea behind the GenTech model:
http://www.ngsgenealogy.org/ngsgentech/projects/Gdm/Gdm.cfm

Along the same lines as Jesper's point, it has annoyed me for some time that the modern genealogy software that I know of doesn't really help me do my genealogy. All it does is help me make a pretty family tree after I've done the genealogy. I have to gather all of the source information, make my assertions and then I can enter the data in my family tree software. Gathering the source data and making the assertions, is much more difficult and time consuming than creating the family tree. It would be nice if the software helped me earlier on during the hard part of the process. Because software doesn't generally work this way, it's very easy for the user to get sloppy, and not document sources. It's also very easy for information to get lost, because typically only one assertion can be made at a time about an individual. I think this is a very important requirement.

Jay

John Finlay

unread,

Mar 29, 2008, 11:26:45 AM3/29/08

to beyo...@googlegroups.com

> Along the same lines as Jesper's point, it has annoyed me for some time that the modern genealogy software that I know of
> doesn't really help me do my genealogy. All it does is help me make a pretty family tree after I've done the genealogy. I
> have to gather all of the source information, make my assertions and then I can enter the data in my family tree software.
> Gathering the source data and making the assertions, is much more difficult and time consuming than creating the family
> tree. It would be nice if the software helped me earlier on during the hard part of the process. Because software doesn't
> generally work this way, it's very easy for the user to get sloppy, and not document sources. It's also very easy for
> information to get lost, because typically only one assertion can be made at a time about an individual. I think this is a very > important requirement.
>

While this moves away from transmission and model requirements, I completely agree with this assessment.

How would you like software to help with your research?

--John

winmail.dat

Jay Askren

unread,

Mar 29, 2008, 8:54:26 PM3/29/08

to beyo...@googlegroups.com

I'm not by any means a professional genealogist, just a software engineer with an interest in usability who has worked on my own genealogy and worked in the family history center helping others with their genealogy. I'm also currently working on an open source project to implement my ideas, but it's still early and I haven't thought everything through yet. Off the top of my head, I can think of five main things I want my genealogy software to do.

First, I want to collect "facts" from sources. When I'm looking through the 1900 Census, it would be nice if I could bring up a screen that looks like the 1900 Census. I could type in the "facts" straight from the census with a citation saying precisely where it came from and save a copy of the image with it. Ex: I find a person in 1900 census for Harrison County Indiana, named Samuel Askren. He had a wife and four dependents.

Second, I want to attach each fact to the appropriate individuals/families in my family tree. Ex: I tell my program that the Samuel Askren in the census is the same person as my great-grandfather. (Later I may decide that he isn't the same person as my grandfather. I don't want to delete the fact, but I do want to un-attach it from my grandfather. The fact itself may be useful later.)

Third, I want to make conclusions. In order to make a decision over which source I believe more, it's very important to be able to see conflicting information and be able to judge which piece of information is more reliable. I also want to assign a value of how confident I am that the data is correct. Ex: The "facts" about Samuel Askren include a birth certificate saying he was born in 1873 and an obituary saying he was born in 1872. The birth certificate seems more trustworthy. Both facts are visible, but I am able to mark the fact from the birth certificate as the source I am using and give reasons why. (Again, I want to be able to change my mind as I find more sources.)

Fourth, I want to visualize the results, but that's another topic altogether.

Fifth, I want to share my research with others not only the final results of my research (step 3 above), but also steps 1 and 2.

This brings back to the topic at hand. So, does GEDCOM allow me to say Samuel Askren has a birthdate of 1873 according to the obituary and 1872 according to his birth certificate and that one birth certificate is more likely to be the correct one? I don't believe GEDCOM allows me to keep my "facts" or evidence separate from my conclusions as Jesper was talking about. I don't believe that standard GEDCOM allows me to store conflicting opinions about an individual, though John you mentioned at the FHT workshop that you've been able to create a GEDCOM variation to store conflicting opinions about an individual. I would like more information about that.

As a side note, I think also Mark Tucker has some additional good ideas about sources. We should look at genealogy best practices and model our software after those best practices:
http://www.thinkgenealogy.com/wp-content/uploads/Genealogy%20Research%20Map%201280x800.jpg

Finally, I haven't investigated it as much as I would like, but I believe that GenXml was created based on the GenTech model which is supposed to be a model for doing genealogy research rather than just creating family trees. Unfortunately, I don't know enough about GenXml to say whether it's good or bad, but it may be something to look into as an alternative to GEDCOM.

Jay

Annette Harper

unread,

Mar 30, 2008, 2:21:54 PM3/30/08

to beyo...@googlegroups.com

Hi. I’m new to this list and don’t want to appear to be pushing my own product. I’m here because I’m interested in improving the GEDCOM standard and my product’s integration with family tree applications via GEDCOM.

That said, I’d like to tell Jay that my product, facTree, is an attempt to meet his first two points. facTree forms currently exist for all of the US population schedules and allow you to input data on a form that looks like the census form while it converts that information into GEDCOM records behind the scenes. You can then import the resulting GEDCOM file into whatever family tree application that you use and merge it using that application’s merge capabilities. A free version that includes the form for the 1880 census is available at http://www.thegenealogyshop.com/Downloads.html. We plan to expand into other form types including birth, death, marriage, draft registration, etc.

facTree basically creates a GEDCOM that incorporates all of the facts available from a single source, both direct, and indirect based on user-customizable assumptions. The impetus was the desire stated by Jay, to have a data entry mechanism that mimics the source document. This increases accuracy and speed.

Annette

Dan Hanks

unread,

Mar 30, 2008, 8:24:12 PM3/30/08

to beyo...@googlegroups.com

On Sat, Mar 29, 2008 at 7:30 AM, Jay Askren <jay.a...@gmail.com> wrote:
> I believe that is the idea behind the GenTech model:
> http://www.ngsgenealogy.org/ngsgentech/projects/Gdm/Gdm.cfm
>
> Along the same lines as Jesper's point, it has annoyed me for some time that
> the modern genealogy software that I know of doesn't really help me do my
> genealogy. All it does is help me make a pretty family tree after I've done
> the genealogy. I have to gather all of the source information, make my
> assertions and then I can enter the data in my family tree software.
> Gathering the source data and making the assertions, is much more difficult
> and time consuming than creating the family tree. It would be nice if the
> software helped me earlier on during the hard part of the process. Because
> software doesn't generally work this way, it's very easy for the user to get
> sloppy, and not document sources. It's also very easy for information to
> get lost, because typically only one assertion can be made at a time about
> an individual. I think this is a very important requirement.

Mark Turner mentioned seomthing in his presentation that I though was
a pretty important point. He was speaking with Elizabeth Shown Mills
(author of "Evidence" and "Evidence Explained,") who suggested that
people learn how to do genealogy research by the software they use. In
most cases that means we start with an empty form and start filling
out names, dates and places, and source materials are an afterthought.

I've given some thought to software that starts the other way around,
by asking you, "What are you looking at right now?" and guiding you
through the process to extract the information from the document you
happen to be looking at, making sure it's properly cited so that
others can more easily find your sources when they come after you. Or
perhaps the software would start even earlier in the process, by
asking you,

"What do you want to find out?"
"I want to find out more information about my grandfather."
"Do you know about when and where he was born?"
"Around 1870 in Park City, Utah".
"Ok, at thet point in time Park City was known as Parley's Park. Here
are some records you may wish to search to find out more...when you
have obtained copies of these documents, come back and we'll gather
the information about those records into your database..."

I believe there was/is some software called GenSmarts that does this
kind of thing based on the data you already have in your record
manager. I'd like to see software that walks you through the research
process, perhaps getting less and less verbose as you gather more
information and become more familiar with the process. All the while
as you are gathering documents it's helping you accurately store the
information, cite the source it came from and so forth. As mentioned
in another post, this kind of software would be helping you accumulate
"facts" from source documents, and helping you to make conclusions
about those facts you have gleaned.

Underlying all this has to be some form of a "fact engine" that allows
you to store arbitrary facts about individuals, events, places, and so
forth, each of which is linked to a specific source document from
which the fact came from, all the while allowing you to overlay the
conclusions you have made based on these facts.

I think that's why I've been discouraged a bit by trying to use some
kind of 'one-size-fits-all' sort of data model where the data fields
are already defined. I see the solution as being some kind of model
that allows you to define the fields as you go as it were.

Thanks for your input,

-- Dan

Dan Hanks

unread,

Mar 30, 2008, 8:42:13 PM3/30/08

to beyo...@googlegroups.com

On Fri, Mar 28, 2008 at 9:30 AM, Dan Hanks <danh...@gmail.com> wrote:
> Think GEDCOM is not sufficient? Tell us all why.

How well does GEDCOM allow for the storing of more than one name for
an individual, based on different sources? I have a relation for whom
I found 4 differnet variations of her name, from various sources (a
DUP history, her gravestone, census, etc).

The open-source Gramps software allows you to store name variants,
each tied to a specific source. Does GEDCOM allow you to do this?

<looking up the answer myself.../>
Consulting the GEDCOM standard, I see that an individual record allows
{0:M} <<PERSONAL_NAME_STRUCTURE>> records, each of which may have
{0:M} source citations associated with it. So it looks like yes,
GEDCOM supports this very thing. +1 for GEDCOM :-).

-- Dan

Logan Allred

unread,

Mar 31, 2008, 8:25:35 AM3/31/08

to beyo...@googlegroups.com

On Mar 28, 2008, at 10:41 AM, John Finlay wrote:
> But now I have a problem. Who do I submit this proposal to? Noone :(
>
> And here we arrive at what I think GEDCOM lacks the most and that
> is "authority". There is no standards body or committee behind
> it. It has for all intents and purposes been abandoned by the LDS
> Church. There are no good community backed tools such as parsers
> or validators. I also think that if a new XML standard were
> developed we would arrive at the same problem in another 5 years.
> What we need the most is an open GEDCOM community.

I mostly agree with John that the biggest detriment to GEDCOM is the
lack of authority, and thus the ability to grow and adapt, as well as
the ability to measure compliance. With some care and feeding, GEDCOM
could be updated to handle most needs, as well as tidy up it's
eccentricities and ambiguities.

However, I'd really like to see it move towards XML. I believe that
would allow for better extensibility (namespaces) and a wealth of
tools to parse and manipulate the data. I realize that deprecates the
wealth of experience in existing GEDCOM parsers, but for future
growth I think it's worth it.

The GEDCOM data model is adequate, but I think a richer model that
distinguishes conclusions from extractions and adds detail about the
research process would increase progress in research and
collaboration, and likely spawn a whole new set of tools and services.

However, without "authority" or some sort of community agreement,
moving to XML or new data models is mostly moot, as they will revert
to the same fate as GEDCOM--poor compliance, ambiguity, non-standard
extensions, and the perception (or reality) of lost data.

It's not an easy problem to solve, but the lack of good data transfer
mechanisms really hinders collaboration and hurts everyone.

At worst, GEDCOM needs some modernization and clarification in a
standardized process. At best, I'd like to see a richer more
extensible data model.

Logan

John Finlay

unread,

Mar 31, 2008, 11:11:28 AM3/31/08

to beyo...@googlegroups.com

> This brings back to the topic at hand. So, does GEDCOM allow me to say Samuel Askren has a birthdate of 1873 according to the obituary
> and 1872 according to his birth certificate and that one birth certificate is more likely to be the correct one? I don't believe GEDCOM allows
> me to keep my "facts" or evidence separate from my conclusions as Jesper was talking about. I don't believe that standard GEDCOM
> allows me to store conflicting opinions about an individual, though John you mentioned at the FHT workshop that you've been able to create
> a GEDCOM variation to store conflicting opinions about an individual. I would like more information about that.
>

It seems that there is a lot of misconception about what GEDCOM can and cannot do.

GEDCOM already allows for multiple opinions and it is very good at keeping source citations on those opinions. Taking your "Samuel Askren" example and encoding it in GEDCOM you would have the following:
0 @I1@ INDI
1 NAME Samuel /Askren/
2 GIVN Samuel
2 SURN Askren

1 BIRT
2 DATE 1872
2 SOUR @S1@
3 DATA
4 TEXT Text from birth certificate

1 BIRT
2 DATE 1873
2 SOUR @S2@
3 DATA
4 TEXT Text from obituary

I've added the extra lines to this example for readability.

Notice the 2 BIRT records. In GEDCOM when you have multiple of the same event types, by convention the event that comes first in the record is the preferred one. This is the one that should appear in any charts or reports where only 1 is shown.

You'll also notice that the first BIRT record has a source citation pointing to source S1 which in this case would be the birth certificate. The second BIRT is sourced by S2 which would point to the obituary.

At the conference I talked about some changes that I made to further improve this. The changes I made were largely made to improve integration and translation between the FamilySearch API XML model and the GEDCOM model and prevent data loss. You can find a copy of a document describing the alterations needed to support this here:
http://code.google.com/p/php-fsapi/source/browse/trunk/PHP-FamilySearchAPI/FamilySearch%20API%20XML%20to%20GEDCOM%20Mapping.doc

In the area of multiple opinions, the additions that were needed were support for assertion level modification tracking (modified dates, versioning, and contributors). Basically the changes would allow you to keep track of who submitted each of the two BIRT assertions above and when they were last changed. This information is really only important in a multi-user system like PhpGedView or FamilySearch. As far as your requirements to be able to keep track of multiple assertions based on source citations, the pure GEDCOM 5.5 spec allows for that.

--John

winmail.dat

John Finlay

unread,

Mar 31, 2008, 11:39:58 AM3/31/08

to beyo...@googlegroups.com

> However, I'd really like to see it move towards XML. I believe that
> would allow for better extensibility (namespaces) and a wealth of
> tools to parse and manipulate the data. I realize that deprecates the
> wealth of experience in existing GEDCOM parsers, but for future
> growth I think it's worth it.
>

I have had this discussion many times to ;)

The *only* advantage that I see XML having over GEDCOM is that XML is natively supported in more of the upcoming RIA frameworks and that shouldn't be ignored. But other than that GEDCOM already has many years of development over XML in the Family History domain and it would take a long time for XML to catch up.

GEDCOM is just as extensible as XML. GEDCOM uses 40% less bytes to encode the same data. GEDCOM is already an industry standard. An open community could provide the open tools that XML currently has. The industry already has these tools, we just need to open them up so that anyone can use them.

So I think we would all get to where we want to be faster by sticking with GEDCOM instead of switching to another technology. If XML really is better then it will naturally emerge into its rightful place. (Though in a recent training on marketing I learned that it is "better to be first than to be better").

I honestly see both XML and GEDCOM having a role in the future of the industry.

--John

winmail.dat

wcstarks

unread,

Apr 1, 2008, 1:16:02 PM4/1/08

to BeyondGen

Hello everybody.

Steven Law and I (Wade Starks), on the Information Architecture team
for the Family & Church History Department, have been working on a
project which closely relates to the topic being discussed here. We
were tasked to develop a model, which would allow us to describe
historical documents in a normalized fashion. Our legacy data base
for storing all the extraction data generated for the department,
currently has over 800 fields. Many of these fields are really
dealing with the same data types. For example, to store the names of
various individuals in an event record, the following fields and more
have been created.

Principal's Given Name, Principal's Surname, Father's Given Name,
Father's Surname, Mother's Given Name, Mother's Surname, Maternal
Grandfather's Given Name, Maternal Grandfather's Surname, . . .. All
are dealing with the names of the individuals implied - their given
and surnames. Further, we also have fields like Principal's Baptism
Date, Principal's Baptism Place. Then we compound all these with
actual versions and standard versions, e.g., Principal's Given Name
Actual and Principal's Given Name Standard.

As you can see, in the current legacy models, participants in the
event records are inferred in the naming of the fields. This creates
a nearly unending variety of fields required to describe very few
unique data types. What we did as we developed this normalized model
was to instantiate individuals implied in the documents, giving them
roles, much as is done in the lineage-linked data bases and in
GEDCOM. This allowed us to take all the event type, role and
relationship information out of the field names. This, and other
normalizing principles developed for this model, makes it possible to
describe, store and communicate information about historical documents
with as few as 18 unique entities.

One result of this model, is that all "field" data is stored in a
single entity called Form. The meaning of the data stored in Form is
contolled by Form's type and by the context Form is found in among the
other entities. Not only can this model work in the evidentiary
domain, but it can work in the assertion domain as well. In fact, it
very naturally supports siting sources in the evidentiary domain for
the assertion domain. This model allows us to manage our research in
the evidentiary domain similar to the way we have traditionally
managed our lineage linked data in the assertion domain.

This model is highly normalized, data driven and extensible. It
appears that it can describe any historical document. Not only can it
store abstracts of historical documents, It can also perform the task
of GEDCOM as a communication vehicle. Being data driven, it relies
heavily on standardized controlled vocabularies. As mentioned by
another in this forum, a robust, centralized standards authority is
required, not only for this model, but for the future of effective
genealogy data management.

Last month, we started a GoogleGroup, where we posted a ppt
presentation to describe this model and the principles fundamental to
it. While these priciples are conveniently described in XML, they
need not be restricted to XML encoding. These principles can just as
well be implemented in a relational in a data base application. This
model could be used to communicate data from one system to another, or
GEDCOM could be modified and enhanced to also support communicating
this model

Please visit

http://groups.google.com/group/evidentiary-document-model?hl=en

and view the presentation to get a high level perspective of the model
and the priciples it supports. The presentation has evolved from the
original Word document, which is also available in the evidentiary-
document-model group. This document, however, needs to be re-worked
to be current with the present state as reprensented in the ppt. We
currently re-working this document and will post the revised version
as soon as it is completed.

We would appreciate any feedback you might have about the model and
its principles of normalizing the representation of historical
documents, as well as its extensibility.

Wade Starks

j...@daubnet.com

unread,

Apr 1, 2008, 10:45:56 PM4/1/08

to BeyondGen

Before going too much into the subject, I'd like to introduce myself.
My name is Jörn Daub, author of the genealogy software "Ages!" (Btw.
Jorn is perfectly fine, if your keyboard lacks German umlauts) I've
been working with GEDCOM files for ten years now, and would like to
leave some (rather lengthy) comments on the various topics in this
post.

"ZIPPED up GEDCOM" Proposal
I totally agree that there should be a defined way to store
genealogical data along with binary media in a zipped archive format.
I also agree that such a format should have a distinctive file
extension. I am unsure, however, if a dual extension (like .ged.zip)
wouldn't better serve that purpose. Such an extension has the
advantage of being accessible to the operating system and/or standard
zip software, with no further action or knowledge about the format
required by the user.
The disadvantage of this however would be that you cannot associate a
dual extension on MS operating systems. So a double click will not
bring up the file in your genealogy software, but in a ZIP software
instead. I'd like to read your opinions on this.

Proposed Family Search API to GEDCOM mapping
I scanned through the word-document, I will do more in-depth research
at a later point in time. Here's what came to my mind when reading the
doc.
[API Mapping / Section: Person]
Not having done anything with the familysearch API yet, I'm unsure
what the "version=" attribute does, but it seems reasonable to put a
tag inside of INDI.CHAN, if - and only if - that number cannot be
converted to a date-time value pair, and the version number serves any
purpose above date/time versioning. In your example the version number
and the date/time both end with 771, so I guess the prior is just the
same data in a different format. If that were the case, it should just
be left out of the GEDCOM file. If a tag is needed however, it should
have an underscore, because the GEDC.VERS and CHAR.VERS both relate to
versions of a standard, which has little in common with record
modification versioning, so you would actually "overload" two meanings
onto the same tag.
[API Mapping / Section: Information]
I would prefer to use the REFN/TYPE combination with a fixed TYPE
value, such as "FamilySearchAPI" or something of that sort. Parsing
should not be an issue, and such information has a good chance of
surviving when traveling through programs that do not know about the
FSAPI.
[API Mapping / Assertions]
Adding a CHAN sub-record to SEX, NAME and events makes sense. Since
the information is both syntactically and semantically similar to the
standard GEDCOM tags, I would also use CHAN/DATE/TIME without
underscores. As stated above, I would try to not store the
version="xx" number, and I'd stick with _VERS and _USER instead of re-
using VERS and SUBM in a different context with different meanings.
Using SUBM would make sense, if you also intend to include "u.
100000168" as a SUBM record, but if you did, it should read 3 SUBM @u.
1000168@. I simply don't understand what the _FSID # is doing there,
but that may be due to my lack of FamilySearchAPI experience.
[API Mapping / Family Generation]
I am unsure why you specified how you create the GEDCOM pointers. The
way you do it is perfectly fine, but basically no software should rely
on GEDCOM pointers being constructed in a certain manner, so I don't
see any need to document it publically. Most programs will start
with@I1@, @I2@ ... etc, but it could be @JOHN_DOE1@ @JOHN_DOE2@ as well,
without making any difference. These pointers should not be assumed to
have any meaning. Using record IDs for anything but reconstructing
record links is a trap that many programmers have fallen into, and it
is basically calling for trouble.

Assertions and GEDCOM
There has been much debate about facts, sources, evidence and
assertions and their corresponding elements in GEDCOM. In my personal
opinion, this whole debate is flawed, because there are no such things
as "facts" in genealogy. It is all about assertions of some sort.
There is not a single type of source that hasn't been proven wrong at
some point in time. Once you accept that there are no facts, but only
assertions with varying reliability, the GEDCOM structure does provide
for most of the relevant information. I have yet to be convinced as to
what the differences between so-called "facts" and so-called
"assertions" are, and where to draw a line between those two. To me,
it seems that all that is needed is available: INDI, FAM and events to
store your facts/assertions/conclusions (whatever you want to call
it), source records and pointers to document the underlying data
leading to your 'facts', and binary attachments (MEDI) to store the
evidence itself. GEDCOM does allow for multiple conflicting sets of
information (yes, you can be born twice :-) ), and it does provide
ways to document imprecise information. The only thing that I actually
miss here is to document disproven information - yes, there are a few
spots to store such information, but that's way too little. So for the
most part, you will need to use notes to store information that is
known to be wrong. Documenting how you decided between conflicting
dates is such an individual task that probably any data structure
would be both: have its shortcomings and be "overkill". That's why I'd
stick to notes for that purpose. Validating assertions is a task for
humans anyhow, not for computer systems. And humans will be just fine
with notes.

XML and whatnot
In an ideal world, GEDCOM would be an XML format. But this is not an
ideal world, and ... you guessed it, GEDCOM is not XML. While GEDCOM
could be easily mapped to an appropriate XML schema, what you wind up
with is the same problems, just in different appearance. Any new
format may solve a lot of problems, but XML in itself will not solve
much, the schema might. But once you move to another schema, you will
suffer from something else: Lack of acceptance, just see the
paragraphs above what a task it is to map two semantically different
systems. That is why I'd rather take the pragmatic approach and talk
about small details in GEDCOM. In my opinion, that is much more likely
to solve real problems in the real world, than discussing a new
schema.
Most talk about the deficiencies of GEDCOM and the advantages of XML
comes from people who don't know much about GEDCOM, and have not read
or understood what the specs allow for. No, it is not a perfect
format, but it is something you can work with. If you think that ANSEL
was a bit weird, you are right. But if you think ANSEL is a "real
problem"(tm), you are not. May be this is just me, but I don't think it
would help all too much if all genealogy programmers started to speak
Esperanto. Yes, XML would be nice, but moving the discussion to that
subject ignores the fact that most genealogy programs can effectively
parse GEDCOM files. The real problems in data transmission are defined
by the differences in the source and target system's data structures,
not by the intermittent transmission format. XML won't change that a
bit, it is more likely to just add complexity to the problem.
Thanks for listening... *monologue ends here* :)

Jörn Daub

Reply all

Reply to author

Forward