Beyond GEDCOM 5.5/6.0

15 views
Skip to first unread message

wcstarks

unread,
Jan 12, 2009, 9:53:45 PM1/12/09
to Open Ancestry
I was invited to participate in the group. My name is Wade Starks. I
recently retired from the Family History Department as Information
Architecture. The current GEDCOM implementation has many
shortcoming. Perhaps its greatest problem is that it cannot deal with
source documents. Communicating a document model from one database to
another via GEDCOM, can only be done by converting document data into
a lineage-linked data model. In so doing, information is inevitably
lost. The amount of lost data can be quite extensive depending on the
type of data being transfered.

Take a census record for example. If the census happens to give
relationships then the problem is not so great as census records which
do not show relationships. In either case there are still significant
problems attempting to pass along all that is found in the census
record.

The census enumerator went about documenting households, not just
individuals. Households are not by definition or in practice
necessarily the same as a family. A household may contain a family,
no family at all, or a combination of family and non-family
individuals. When attempting to communicate a household via GEDCOM,
it becomes difficult pass on individuals who are not family members
without losing the context of the household. Individuals such as
laborers get passed as individuals fully out of context and are
essentially lost in the receiving database.

Ideally, when building a data model for documenting census records, we
would want to be able to build a grouping record which represents the
household. We can then associate individuals to the household by the
roles/relationships they played in the household. If we wanted to be
even more faithful to the census enumeration, we might also wish to
create another grouping record which represents the dwelling, with
which we can associate multiple housholds, which in fact often occurs
in the enumerations. Try to pass that along with GEDCOM's lineage-
linked model.

In the last couple of years before I retired, I worked on developing a
data model that particularly address the needs of documenting source
records of all kinds. The current internal data base structures for
managing extraction projects and the IGI and the other systems the FHD
is working on includes over 800 fields and is still growing. Many of
these fields are about the same data, but in different contexts. For
example, a given name is not just a given name field. It is many.
There is the principal's given name, the principal's father's given
name, the principal's mother given name, the principal's father's
father's given name and the principal's father's mother's given name
and on and on. In otherwords, part of the name of the field is the
event of the souce record and the relationships of the individuals in
the record. Then on top of that we have actual and standard dates and
places and actual names and standard names for each of individual.

My team was asked to see how we might be able to reduce these 800 plus
fields down to some more managable number. When we started, we
weren't sure just where this project would take us. Over time a data
model evolved which indeed not only reduced the number of fields
significantly, but it just as easily manage lineage-linked data as it
did source event records. It can also communicate most any kind of
genealogical information between various systems.

What we ended up with for a model that can describe any event document
we could find with just 17 different classes. These classes are
divided into two groups. One we called authority classes, for lack of
a better term and the other we called structural classes.

The authority classes we ended up with are: "Record, Event, Date,
Time, Name, Age, Place, Address, Persona, Group".

The structural classes are "Role, Piece, Form, Attribute, Reference,
Label, Note".

With these classes we were undable to find a document we could not
document fully. And it goes well beyond the current implementations
of GEDCOM.

If there is any interest in discussing this model to find out what it
is about and how it might be refined, and validated, I am prepared to
lead the discussion.


Dave Lester

unread,
Jan 12, 2009, 11:46:36 PM1/12/09
to Open Ancestry
Hi Wade,

Welcome to group. I'm very interested in hearing more about this. By
any chance could you provide us with an example implementation or
diagram?

Best,
Dave Lester

Mathieu Steele

unread,
Jan 13, 2009, 12:06:30 AM1/13/09
to open-a...@googlegroups.com
Wade:

How do you see the future (if any) of gedcom in the public use?

Obviously with the New FamilySearch, there will tend to be an abandonment of PAF as it becomes less integral.

Moving forward, do you expect an all-purpose data format to be adopted for general consumption?

Mathieu

Dan Hanks

unread,
Jan 13, 2009, 4:01:49 AM1/13/09
to open-a...@googlegroups.com
On Mon, Jan 12, 2009 at 7:53 PM, wcstarks <wcst...@gmail.com> wrote:
> What we ended up with for a model that can describe any event document
> we could find with just 17 different classes. These classes are
> divided into two groups. One we called authority classes, for lack of
> a better term and the other we called structural classes.
>
> The authority classes we ended up with are: "Record, Event, Date,
> Time, Name, Age, Place, Address, Persona, Group".
>
> The structural classes are "Role, Piece, Form, Attribute, Reference,
> Label, Note".
>
> With these classes we were undable to find a document we could not
> document fully. And it goes well beyond the current implementations
> of GEDCOM.

Can you provide some samples of records and how they would be
represented by this model?

--
Kiva.org - Loans That Change Lives

Annette

unread,
Jan 13, 2009, 9:00:13 AM1/13/09
to Open Ancestry
Wade,

I would like to add my interest to those already expressed in seeing
the details of your source data model.

I am the author of facTree, a modest little application that attempts
to take census data and convert it to GEDCOM. My goal in facTree is to
simply make it easier for a user to enter source data and get it into
their family tree application with ease, and with all genealogical
facts that can be gleaned from it. Because it must use GEDCOM as the
conduit, there are "issues" to overcome, and assumptions must be made.
I would love to see a better data model for describing source data
adopted by the industry.

Annette Harper

wcstarks

unread,
Jan 13, 2009, 11:01:59 AM1/13/09
to Open Ancestry
The data model is described using XML. I have examples from a variety
of source documents, including vital events, census, a wedding
anniversery announcement, citation of a chapter in a book, a
manuscript in an archive, a rural cemetery headstone; an obit, source
citation from a lineage-linked record in the assertion domain to a
source document domain in an example of communicating data, etc. We
have also explored describing source citation templates using an
extention of this model.

This model seems to be able to work in three areas of concern
expressed here - manage 1) source document descripitons (source/event
records), 2) lineage-linked records and 3) communication of this data
from one system to another - all without the need to convert from one
form to another. It should also be able to manage a library catelog
system records, with the addition of several new authority classes.

Most current data models are person centric. This model is document
centric. As we developed this model, it became immediately apparent
that it would be necessary to get the event, relationship and function
information out of the personal field names. The only reasonable
strategy seemed to be to manage document descriptions similar to how
we deal with the entities in the lineage linked models, where all
relationship info is "external" to and removed from the individual
data.

That required making "real" the individuals implied in the documents
and managing the relationships and functions of the individuals in
their Roles. This normalization also extended beyond just the
individuals, to include making "real" the other implied authority
entities in the documents, such as the entities: Events, Names,
Dates, Places, Ages, etc..

This model has been extensively normalized, with the result that only
one Piece entity services all the authority entities having multiple
parts. There is also only one entity in which to store all document
data. In the process of developing this model, we chose to fully
normalize the entities so as to better understand the elements of the
model. It is easy enough to back up from this degree of normalization
if needed, e.g., to create separate classes for name pieces, place
pieces and date pieces, instead of using one generic piece for all.
While the Attribute entity in this model handles gender, since gender
is such an important property of individuals, it may be useful to
create a special class dedicated to gender.

The key to this more fully normalized schema is in the use of Role to
manage all the relevant functional and relationship information of
individuals in the document. This frees up the representation of each
person in the record to use the same terms for the same type of
attributes, just as is done in current lineage-linked schemas. In
fact, Role can be considered equivalent to the links between
individuals in the lineage-linked model.

Before we get into some examples of this model, I am wondering if we
should not first talk about the various objects, how they work and how
they are used. I also have a powerpoint presentation we developed
last spring, which shows why the model was developed and how the model
works in a more simplified approach.

How would I post a ppt to this site for all to see? What is the best
way to manage posts to this topic. I just responed to my first post.
I am not sure how to make an additional post at the same level as my
first post.

wcstarks

unread,
Jan 13, 2009, 11:07:08 AM1/13/09
to Open Ancestry
I forgot to mention that this model is only a blueprint so to speak,
it has not been implemented.
> > lead the discussion.- Hide quoted text -
>
> - Show quoted text -

wcstarks

unread,
Jan 13, 2009, 11:13:27 AM1/13/09
to Open Ancestry
I can't speak for the direction the FHD is or will take in the future,
but I do expect some form of this model could become an all-purpose
model to describe, manage and transmit all types of data, genealogical
or other. I believe that an implementation of some form of this model
would be much easier for developers to use to transmit their data than
is GEDCOM as currently implemented.

wcstarks

unread,
Jan 13, 2009, 3:50:25 PM1/13/09
to Open Ancestry
I have created a new subject, "An Evidentiary Model . . .", where I
have posted all the classes mentioned here. Please view that posting.
Reply all
Reply to author
Forward
0 new messages