Gedcom X

Richard Smith

unread,

Apr 27, 2012, 9:16:54 AM4/27/12

to

Has anyone here taken the time to investigate the Gedcom X project?
It's aiming to produce a replacement for GEDCOM based on XML and RDF,
and as it is being driven by FamilySearch and therefore has the LDS
behind it, it seems likely that it will gain market traction. From
what I can see, they're fixing a lot of the big problems in GEDCOM.
It's seems much more evidence based, with support contradictory
evidence, and reasoning based on that. It appears to be completely
extensible. And it has a standard vocabulary that encompasses all of
the existing GEDCOM tags to allow existing GEDCOM files to be
imported.

However, unless I'm missing quite a lot (which is entirely possible),
it appears woefully short on documentation, has no examples, and the
website is full of dead links and holding pages with no content. But
despite all this, the project appears to be quite active with quite a
lot of work going on. Is anyone aware of a decent introduction to it,
ideally with examples?

Richard

Tim Powys-Lybbe

unread,

Apr 27, 2012, 10:57:46 AM4/27/12

to

I agree that the site is more than a little opaque but I really do think
they are using some fastidious methodology.

There are all sorts of interesting facets in the Self Guided Tour:

<https://github.com/FamilySearch/gedcomx/wiki/Self-Guided-Tour>

I think the project has to be completed before they can provide the
examples that you want.

And there is a feedback page for you can complete to express the need
for such at:
<http://familysearch.github.com/gedcomx/feedback/2012-03-23.html>

--
Tim Powys-Lybbe t...@powys.org
for a miscellany of bygones: http://powys.org/

Richard Smith

unread,

Apr 27, 2012, 11:59:00 AM4/27/12

to

On Apr 27, 3:57 pm, Tim Powys-Lybbe <t...@powys.org> wrote:

> I agree that the site is more than a little opaque but I really do think
> they are using some fastidious methodology.

Yes, it does look like it's probably very well thought through, which
is why I'm persisting in trying to understand it, despite the
difficulty of finding anything on the website. In particular, the
fact that they're using Semantic Web technologies seems very
promising.

> There are all sorts of interesting facets in the Self Guided Tour:
>
> <https://github.com/FamilySearch/gedcomx/wiki/Self-Guided-Tour>
>
> I think the project has to be completed before they can provide the
> examples that you want.

I hope not. They say that they're now at the stage where they're
interested in public feedback, but it's really quite difficult to do
that without being clear what it currently does. It's still early
days, but if they're serious about engaging with the public (and I
believe they are), they do need to make it a bit easier for the public
to see what it's capable of.

> And there is a feedback page for you can complete to express the need
> for such at:
> <http://familysearch.github.com/gedcomx/feedback/2012-03-23.html>

Thanks. I've put a request for some simple examples in their bug
tracking system as they mention on the website that requests for
website improvements should go there.

Richard

Tony Proctor

unread,

Apr 28, 2012, 5:23:42 AM4/28/12

to

"Richard Smith" <ric...@ex-parrot.com> wrote in message
news:b1eede2e-feef-45a0...@fv28g2000vbb.googlegroups.com...

GEDCOM-X is an API project (i.e. actual code) as opposed to a generic data
reference model (from which file formats can be defined).

FS is one of the organisations currently talking with FHISO
(http://fhiso.org) and we hope there will be several press releases in the
very near future. I'll keep you posted as things develop.

Tony Proctor

Richard Smith

unread,

Apr 28, 2012, 6:47:12 AM4/28/12

to

On Apr 28, 10:23 am, "Tony Proctor" <tony@proctor_NoMore_SPAM.net>
wrote:

> GEDCOM-X is an API project (i.e. actual code) as opposed to a generic data
> reference model (from which file formats can be defined).

Not according to their website. The first sentence of (one of) their
home page, <http://www.gedcomx.org/Home.html> reads: "GEDCOM X defines
an open data model and an open serialization format for exchanging the
components of the genealogical proof standard." Yes, there is also
work on APIs which I'm sure will be valuable, together with open
sourced Java that seems to implement them, though right now of less
interest to me.

> FS is one of the organisations currently talking with FHISO
> (http://fhiso.org) and we hope there will be several press releases in the
> very near future. I'll keep you posted as things develop.

That's promising news, and I look forward to reading more in due
course.

Richard Smith
Mythic Beasts Ltd

Ian Goddard

unread,

Apr 28, 2012, 1:01:45 PM4/28/12

to

Richard Smith wrote:
> On Apr 28, 10:23 am, "Tony Proctor"<tony@proctor_NoMore_SPAM.net>
> wrote:
>
>> GEDCOM-X is an API project (i.e. actual code) as opposed to a generic data
>> reference model (from which file formats can be defined).
>
> Not according to their website. The first sentence of (one of) their
> home page,<http://www.gedcomx.org/Home.html> reads: "GEDCOM X defines
> an open data model and an open serialization format for exchanging the
> components of the genealogical proof standard."

But where is the open data model? All I found was a number of XML
schemas and Java source. XML is one of their serialisation formats
(they also mention JSON), not their model and Java is an implementation,
again not a model.

It's not very encouraging when they say that they maintain UML diagrams
by hand and that they're out of sync with the model. It seems that what
they're really doing is hacking XML & Java. I don't call that
developing a data model of any sort, let alone an open one.

--
Ian

The Hotmail address is my spam-bin. Real mail address is iang
at austonley org uk

Richard Smith

unread,

Apr 28, 2012, 5:53:17 PM4/28/12

to

On Apr 28, 6:01 pm, Ian Goddard <godda...@hotmail.co.uk> wrote:
> Richard Smith wrote:
> > On Apr 28, 10:23 am, "Tony Proctor"<tony@proctor_NoMore_SPAM.net>
> > wrote:
>
> >> GEDCOM-X is an API project (i.e. actual code) as opposed to a generic data
> >> reference model (from which file formats can be defined).
>
> > Not according to their website. The first sentence of (one of) their
> > home page,<http://www.gedcomx.org/Home.html> reads: "GEDCOM X defines
> > an open data model and an open serialization format for exchanging the
> > components of the genealogical proof standard."
>
> But where is the open data model?

Well, there is a bit (and only a bit) of documentation on the data
model on these two pages:

http://www.gedcomx.org/Conclusion-Model.html
http://record.gedcomx.org/Record-Model.html

> All I found was a number of XML
> schemas and Java source.

And RDF schemas too, which is one of the bits I'm most interested in,
as that's higher level than a serialisation format. An RDF schema is
really a machine-readable representation of a data model, rather than
a serialisation format for one. RDF/XML is one serialisation format
commonly used for RDF, and it appears that much of their XML is
supposed to be RDF/XML.

> It's not very encouraging when they say that they maintain UML diagrams
> by hand and that they're out of sync with the model. It seems that what
> they're really doing is hacking XML & Java. I don't call that
> developing a data model of any sort, let alone an open one.

I agree, though I would mitigate that view with a few caveats. First,
there's clearly a lot of content on the site that's hard to find, so
there may be a stuff on the data model side that we haven't found.
Second, the project has only been public for two months, but has been
in development for a lot longer: there may still be stuff that is not
currently publicly-accessible. Third, given who's behind it, it seems
that this is likely to go somewhere, even if it does turn out to be
half-baked. And finally, since starting this thread, I've had an (off-
list) email from someone involved in the Gedcom X development, and
that seemed quite positive. So I'm prepared to give it the benefit of
the doubt at the moment.

Richard

Ian Goddard

unread,

Apr 29, 2012, 6:14:16 AM4/29/12

to

Richard Smith wrote:
> On Apr 28, 6:01 pm, Ian Goddard<godda...@hotmail.co.uk> wrote:
>> Richard Smith wrote:
>>> On Apr 28, 10:23 am, "Tony Proctor"<tony@proctor_NoMore_SPAM.net>
>>> wrote:
>>
>>>> GEDCOM-X is an API project (i.e. actual code) as opposed to a generic data
>>>> reference model (from which file formats can be defined).
>>
>>> Not according to their website. The first sentence of (one of) their
>>> home page,<http://www.gedcomx.org/Home.html>

Wow. That page has changed dramatically since I looked previously by
the inclusion of a diagram which actually gives an overview of their
thinking.

>>> reads: "GEDCOM X defines
>>> an open data model and an open serialization format for exchanging the
>>> components of the genealogical proof standard."
>>
>> But where is the open data model?
>
> Well, there is a bit (and only a bit) of documentation on the data
> model on these two pages:
>
> http://www.gedcomx.org/Conclusion-Model.html
> http://record.gedcomx.org/Record-Model.html

Thanks, I'd missed that. It separates Persona (record) from Person
(conclusion) which is a good start.

>> All I found was a number of XML
>> schemas and Java source.
>
> And RDF schemas too, which is one of the bits I'm most interested in,
> as that's higher level than a serialisation format. An RDF schema is
> really a machine-readable representation of a data model, rather than
> a serialisation format for one. RDF/XML is one serialisation format
> commonly used for RDF, and it appears that much of their XML is
> supposed to be RDF/XML.

Nevertheless, we have several versions of the same thing. Which is is
the primary one?

>> It's not very encouraging when they say that they maintain UML diagrams
>> by hand and that they're out of sync with the model. It seems that what

>> they're really doing is hacking XML& Java. I don't call that

>> developing a data model of any sort, let alone an open one.
>
> I agree, though I would mitigate that view with a few caveats. First,
> there's clearly a lot of content on the site that's hard to find, so
> there may be a stuff on the data model side that we haven't found.
> Second, the project has only been public for two months, but has been
> in development for a lot longer: there may still be stuff that is not
> currently publicly-accessible. Third, given who's behind it, it seems
> that this is likely to go somewhere, even if it does turn out to be
> half-baked. And finally, since starting this thread, I've had an (off-
> list) email from someone involved in the Gedcom X development, and
> that seemed quite positive. So I'm prepared to give it the benefit of
> the doubt at the moment.

My concern is that the process seems to be largely a matter of writing
implementations and back fitting a model to them. This might be a good
process for some types of development but I don't think it's a good one
here.

I think the risk is that we end up with a data model which depends on
programmers convenience and maybe even presentation. I'd much prefer to
start with an abstract data model which arises from a consideration of a
large and varied body of sample data. That may be difficult to program
but much better that software developers solve the problems than
throwing them over the wall to users who are then left trying to
force-fit date onto a model that they don't really fit at all.

Tony Proctor

unread,

Apr 29, 2012, 6:22:44 AM4/29/12

to

"Richard Smith" <ric...@ex-parrot.com> wrote in message

news:57fc00ab-0f9e-462f...@w5g2000vbp.googlegroups.com...

-----------------

A standard reference model is a generic description of the structure of the
data, and should be at the core of all subsequent physical manifestations of
the data (or physical data schemas), including. file formats. This is the
primary goal of FHISO. Those physical manifestations will probably
constitute subsequent standards, or appendices to the reference model
standard.

A physical manifestation will include serialisation formats (amongst
others). A serialisation format might be used for wire transmission or for
file storage. XML is a good example and is used in both wire transmission
(e.g. XMLHTTP) and file storage. Hence, file formats can be considered a
subset of the more general serialisation formats.

Physical manifestations may also include indexed systems, either in memory
or in a database. These are not serialisation formats but can still be
designed around a standard reference model.

An API would usually be a Web service in this context, and would be used to
front an indexed manifestation of the data. Hence, the structure of an API
is also impacted by a standard reference model. This is the basic thread
being discussed between FHISO and FS.

RDF is a method of 'semantic tagging' used by the Semantic Web. If
genealogical data is exchanged between genealogical software then such
tagging is not required - they each understand the data through the data
tags that are defined by its schema. However, generic, non-genealogical
software needs a standard set of semantic tags added to the data to give it
a clue as to how to handle (e.g. search, correlate, & combine) that, and
other forms of, data. This therefore includes genealogical data that may be
part of the Semantic Web. However, it is arguable whether that should be
considered a separate physical manifestation, and so the RDF tags would be
inappropriate in other physical manifestations such as file formats. There
are other semantic tagging schemes (e.g. Dublin Core which is relevant to
citation references in our world) and so the overall design has to be
thought out very carefully.

If you're as interested in all this as I am, then I would strongly recommend
approaching FHISO regarding membership. We could tease out lots of good
stuff here but that's one of the issues that FHISO is trying to rectify -
i.e. uncoordinated discussions, proposals, and research. :-)

Tony Proctor

Tim Powys-Lybbe

unread,

Apr 29, 2012, 6:42:13 AM4/29/12

to

On 29 Apr at 11:22, "Tony Proctor" <tony@proctor_NoMore_SPAM.net> wrote:

> A standard reference model is a generic description of the structure
> of the data,

"The data"? Which data? Which process? What is the context/ What is
the objective? Until these are stated, we cannot begin to say whether
the dat is even appropriate to the purpose.

> and should be at the core of all subsequent physical manifestations of
> the data (or physical data schemas), including. file formats. This is
> the primary goal of FHISO. Those physical manifestations will probably
> constitute subsequent standards, or appendices to the reference model
> standard.

Agreed, of course. But secondary to the purpose of the data.

Tony Proctor

unread,

Apr 29, 2012, 6:55:11 AM4/29/12

to

"Tim Powys-Lybbe" <t...@powys.org> wrote in message
news:mpro.m38lqd0l...@powys.org...

You're right Tim. I'm writing from the idealistic view that the FHISO
members had already done the necessary research to create a 'requirements
catalog' - one that will support genealogy for at least another decade, and
for all cultures and all time periods. That's a huge task in itself, and one
which FHISO believes needs to be done by the community rather than any
single contributor, and require inputs from people with relevant expertise
and technical skills, in addition to people from those relevant cultures. I
wouldn't profess to know, for instance, all the requirements of an East
Asian culture.

...I even struggle with English some days ;-)

Tony Proctor

Tim Powys-Lybbe

unread,

Apr 29, 2012, 7:12:42 AM4/29/12

to

On 29 Apr at 11:55, "Tony Proctor" <tony@proctor_NoMore_SPAM.net> wrote:

> "Tim Powys-Lybbe" <t...@powys.org> wrote in message
> news:mpro.m38lqd0l...@powys.org...
> > On 29 Apr at 11:22, "Tony Proctor" <tony@proctor_NoMore_SPAM.net>
> > wrote:
> >
> > > A standard reference model is a generic description of the
> > > structure of the data,
> >

> > "The data"? Which data? Which process? What is the context? What

> > is the objective? Until these are stated, we cannot begin to say

> > whether the data is even appropriate to the purpose.

>
> You're right Tim. I'm writing from the idealistic view that the FHISO
> members had already done the necessary research to create a
> 'requirements catalog' - one that will support genealogy for at least
> another decade, and for all cultures and all time periods.

Aha, this is beginning to resolve some of a puzzle. I had assumed that
the purpose of the GEDCOM analysis was to facilitate data transfer
between systems.

What you are suggesting is that the purpose of GEDCOM is to hold all
data relevant to a genealogy system. The system will hold all records
of as many people as are required. The context of the data to be held
is identifying the people, to some extent what they did in their lives
and in particular their ancestry and descendants.

Curiously the GEDCOMX website pages include a few statements that
various LDS events are not part of this database and are the subject of
a separate study. So, how much closer can we get to the purpose of the
GEDCOMX database?

And, far more importantly for me, is there any way the GEDCOMX study,
whatever it is, can be extended to cover portability of data between
systems?

I have a strong suspicion that portability will have to have a few
constraints on the data entities and attributes to be included.

(I've remembered to use the spelling checker this time!)

Ian Goddard

unread,

Apr 29, 2012, 8:23:23 AM4/29/12

to

Tim Powys-Lybbe wrote:
> On 29 Apr at 11:55, "Tony Proctor"<tony@proctor_NoMore_SPAM.net> wrote:
>
>> "Tim Powys-Lybbe"<t...@powys.org> wrote in message
>> news:mpro.m38lqd0l...@powys.org...
>>> On 29 Apr at 11:22, "Tony Proctor"<tony@proctor_NoMore_SPAM.net>
>>> wrote:
>>>
>>>> A standard reference model is a generic description of the
>>>> structure of the data,
>>>
>>> "The data"? Which data? Which process? What is the context? What
>>> is the objective? Until these are stated, we cannot begin to say
>>> whether the data is even appropriate to the purpose.
>>
>> You're right Tim. I'm writing from the idealistic view that the FHISO
>> members had already done the necessary research to create a
>> 'requirements catalog' - one that will support genealogy for at least
>> another decade, and for all cultures and all time periods.
>
> Aha, this is beginning to resolve some of a puzzle. I had assumed that
> the purpose of the GEDCOM analysis was to facilitate data transfer
> between systems.
>
> What you are suggesting is that the purpose of GEDCOM is to hold all
> data relevant to a genealogy system.

This goes back to your original question of "which data". I don't think
the development of the existing line of GEDCOMs didn't even consider
that question fully, or at least not until it was too late to do
anything about it. The consequence is that you can look at a GEDCOM
entry and not know whether it tells you: (a) what an original record
said, (b) what someone thought an original record said or (c) what
someone has concluded from a consideration of a lot of records which
you're not being told about.

You can't actually communicate data without having some sort of data
model even if it's one that's simply implicit in the data format.
Existing GEDCOM is in the situation of having such an implicit model and
that implicit model isn't really adequate. What's worse, that implicit
model has, I think, influenced the thinking of software designers.*

In order to get a transfer format that communicates clearly we have to
go back and do things in the right order which is to understand that
data, produce a good model on the basis of our understanding and then
design a transfer format on top of that.

*Do you have S/W which requires you to "merge" "people" from various
records when you think that the records belong to the same historical
person? The concept of merging people has no basis in reality so your
S/W shouldn't expect you to do it. If your S/W does that I think it's
likely that it's because existing GEDCOM doesn't make the distinctions I
mentioned in my first paragraph and the S/W developer has followed that.
If you're using something like that how easy is it to disentangle the
situation when you change your mind and realise that some of your merges
were wrong?

Tony Proctor

unread,

Apr 29, 2012, 8:30:11 AM4/29/12

to

"Tim Powys-Lybbe" <t...@powys.org> wrote in message

news:mpro.m38n560m...@powys.org...

You're almost there Tim. The new reference model (& its physical
manifestations) would not be called GEDCOM, though, and not even look like
GEDCOM. Supporting 'family history' in the general sense is crucially
important these days, and that on its own suddenly increases the scope
dramatically. I wouldn't expect FS to say they've accounted for all the
requirements, any more than any other researcher could.

A basic requirement is that there's a well-defined way of converting a
GEDCOM dataset into the new form but that should go without saying.

That new standard has to be open, free, culturally neutral,
locale-independent, and developed through consensus.

Tony Proctor

Ian Goddard

unread,

Apr 29, 2012, 8:31:49 AM4/29/12

to

I just found myself writing "existing GEDCOM" several times which covers
everything up to and including GEDCOM 5.x (and 6 except that never
really got off the ground. We really need a shorthand to differentiate
all these from GEDCOM X. How about GEDCOM E?

Ian Goddard

unread,

Apr 29, 2012, 8:34:00 AM4/29/12

to

Tony Proctor wrote:
>
> A basic requirement is that there's a well-defined way of converting a
> GEDCOM dataset into the new form but that should go without saying.

As a piece of evidence which needs to be evaluated before being used as
the basis for conclusions, and no more.

Tony Proctor

unread,

Apr 29, 2012, 8:38:53 AM4/29/12

to

"Ian Goddard" <godd...@hotmail.co.uk> wrote in message
news:a04q5r...@mid.individual.net...

You're getting very specific Ian, although I see your point. I believe a
standard reference model should not be biased with regard to the way the
data was collected, processed, or any specific software product. I'd
already made this point is my own research before I joined FHISO
(www.parallaxview.co/familyhistorydata/research-notes/musings-standardisation).

Similarly with the concept of 'persona' and the merging of "evidence
persons"
(www.parallaxview.co/familyhistorydata/research-notes/evidence-conclusion),
although there are more varied viewpoints on the BetterGEDCOM wiki
(http://bettergedcom.wikispaces.com/).

All of this still has to be discussed for a new standard.

Tony Proctor

Tony Proctor

unread,

Apr 29, 2012, 8:56:15 AM4/29/12

to

"Richard Smith" <ric...@ex-parrot.com> wrote in message

news:b1eede2e-feef-45a0...@fv28g2000vbb.googlegroups.com...

Probably a bit OT but did anyone read the article on GEDCOM in Your Family
History last month? FHISO got permission to make a pdf copy available on
their blog at:
http://fhiso.org/2012/04/did-someone-say-download-more-on-building-a-bettergedcom-april-2012/.

Tony Proctor

Tim Powys-Lybbe

unread,

Apr 29, 2012, 9:16:50 AM4/29/12

to

I can't see what the difference is between (a) and (b). All statements
are made by people and thus represent their thoughts.

> You can't actually communicate data without having some sort of data
> model even if it's one that's simply implicit in the data format.
> Existing GEDCOM is in the situation of having such an implicit model
> and that implicit model isn't really adequate. What's worse, that
> implicit model has, I think, influenced the thinking of software
> designers.*
>
> In order to get a transfer format that communicates clearly we have to
> go back and do things in the right order which is to understand that
> data, produce a good model on the basis of our understanding and then
> design a transfer format on top of that.

And do you think that is being achieved within the GEDCOMX site? If
not, please make some corrections. (I am not yet clear enough on what
the purpose of the data model is to be to allow myself to make any
serious comments on its content.)

> *Do you have S/W which requires you to "merge" "people" from various
> records when you think that the records belong to the same historical
> person? The concept of merging people has no basis in reality so your
> S/W shouldn't expect you to do it. If your S/W does that I think it's
> likely that it's because existing GEDCOM doesn't make the distinctions
> I mentioned in my first paragraph and the S/W developer has followed
> that.
>
> If you're using something like that how easy is it to disentangle the
> situation when you change your mind and realise that some of your
> merges were wrong?

I have never merged other people's GEDCOMs. To do so is a travesty of
genealogical research and just has to lead to problems. I think any new
standard which includes transferring data between systems should
explicitly declare that it is not for merging data from any other
database.

But I think this discussion is certainly making progress for me and I
might make another stab at putting something somewhere on the GEDCOMX
site. So thanks.

singhals

unread,

Apr 29, 2012, 9:20:53 AM4/29/12

to gen...@rootsweb.com

The underlying pitfall there is -- while the developer can
be as anal-retentive as he likes in setting up his Utopian
model, no one can force the user to abide by it. Once that
fact is acknowledged, then transferring data between any two
users will be fraught with the same uncertainty as it is
now. Because, no matter how you slice it, (a) and (b) are
implicit in any original document as is (d) what someone
thought the record SHOULD HAVE said.

Since there's little point to a system which locks out users
...?

Cheryl

Ian Goddard

unread,

Apr 29, 2012, 9:54:55 AM4/29/12

to

OK, here's an example.

There's an entry in the local chapel registers which says simply:

"Wife of John Goddard ch" (a)

This appeared on the old IGI as the wife of John Goddard being baptised
as Christiana (b).

I think you'll agree there's a difference between (a) and (b) in that
the original doesn't say baptised and it doesn't give her name. The
register,by the way, is entitled "Christenings and Churchings".

Had GEDCOM provided a means of uploading what the record actually said
then, although it might have left a lot of people puzzled, it wouldn't
have left them misinformed.

>
> I have never merged other people's GEDCOMs. To do so is a travesty of
> genealogical research and just has to lead to problems.

It's not a matter of merging GEDCOMS. It's a matter of what happens
when, in the same database, you acquire a record which says "John, son
of William Smith bapt" and then, twenty years further on there's another
record which ways "John Smith and Mary Brown nupt" followed a year or so
later by "Robert son of John Smith bapt". You now have three records
which all name John Smith. If, following GEDCOM's model, your S/W
doesn't different between name and person you actually have three
*person* records in your database, all named "John Smith". If you reach
the conclusion that all these names refer to the same person John Smith
and the S/W acts as described it will probably require you to merge (or
some equivalent term) two of these John Smiths into the remaining one.
The messiness of this is revealed if you then discover the burial of
John, son of William Smith 6 weeks after the baptism. As you say, this
is a travesty of genealogical research but seems to be how some S/W works.

Ian Goddard

unread,

Apr 29, 2012, 10:30:12 AM4/29/12

to

singhals wrote:
> The underlying pitfall there is -- while the developer can be as
> anal-retentive as he likes in setting up his Utopian model, no one can
> force the user to abide by it.

Can we agree on a hierarchy of reliability:

Best, an image of the document.

Next best, a transcript of the document.

Not so good, a restricted interpretation of what the document said.

If we're agreed on that, tell me what's anally-retentive about a system
that enables best and/or next best to be used as opposed to one than
enforces not so good.

Richard Smith

unread,

Apr 29, 2012, 12:30:52 PM4/29/12

to

On Apr 29, 11:22 am, "Tony Proctor" <tony@proctor_NoMore_SPAM.net>
wrote:

> RDF is a method of 'semantic tagging' used by the Semantic Web.

No. That's just one use of RDF. At its most general, RDF is no more
than a way of saying thing S (the subject) is related in way P (the
predicate) to another thing O (the object). Effectively it's just a
formalising for putting statement in a computer-readable way. For
example, the English statement "John is 23" which we can ordinarily
infer means "the male person whose name is John is aged 23 years"
could be represented as a series of four RDF statements:

Subject Predicate Object
---------------------------------
X is a Person
X has gender male
X has name John
X has age 23 years

It's not just properties, either. I can also write "John's mother,
Mary, is 57", thereby linking two people X (John) and Y (Mary):

Subject Predicate Object
---------------------------------
X has parent Y
Y has gender female
Y has name Mary
Y has age 57 years

By itself RDF is so abstract as to be almost useless. Nothing stops
me from saying that John's gender is "cactus", that is his age is "21
January 2101", and that his mother is the word "blue".

That's where RDF Schema comes in. You can write an RDF Schema that
defines what the various terms mean and the constraints on their use.
So you might say that the "has parent" must have a person as its
subject and its object, and that the value (object) of "has age" must
be a duration. This is the stage where you make data model decisions
like does a person have an age, or does a person have a set of
characteristics that may include an age? In UML terms, is 'age' a
property in the 'person' box, or is it in a separate 'characteristics'
box linked to possibly more than one person box? Effectively, the RDF
Schema acts like UML in more traditional data modelling; indeed, the
two can be thought of as basically the same thing but in a different
format. The W3C discuss this relationship more thoroughly here:

http://www.w3.org/TR/NOTE-rdf-uml/

So as soon as we have a data model, we more or less have an RDF
Schema. (The "more or less" is because an RDF Schema forces you to
name certain concepts that might have been anonymous in your abstract
data model.) And once we have the RDF Schema for the data model, any
data conforming to the data model can be expressed as RDF conforming
to the RDF Schema.

Why would you want to do that? One reason is that by doing so, you
can immediately leverage existing RDF technologies. For example, RDF
comes with several exchange formats, the two principle ones being RDF/
XML and N3. That immediately gives you the new genealogical file
format, and you can start using existing parsers for it.

There are existing RDF query tools that allow you to do sophisticated
searching in conjunction with other existing data sources. For
example you could write "find all people called John Smith baptised
between 1780 and 1790 in a parish within twenty miles of Dunny-on-the-
Wold". That could use a copy of the Ordnance Survey's RDF on parish
locations and boundaries (downloadable from their website), to find
out whether Inverdunnaidh is within twenty miles of Dunny-on-the-Wold
and whether therefore to include it in the result set.

And you get all that for free. To me the question isn't why would you
want to think of genealogical data in terms of RDF, but rather why
wouldn't you? And I think the main answer is that RDF is unfamiliar
to many people.

> If
> genealogical data is exchanged between genealogical software then such
> tagging is not required - they each understand the data through the data
> tags that are defined by its schema.

I think perhaps your confusing RDFa, which is a specific means of
applying RDF tags to other documents (typically HTML, but in principle
anything XML-based) to pick out semantic information, with RDF which
is general framework, or with RDF/XML which is the standard XML-based
serialisation of RDF. For example, the following bit of XML is a
perfectly good RDF/XML serialisation of the four RDF statement I gave
earlier on for "the male person whose name is John is aged 23 years":

<person>
<name>John</name>
<gender>male</gender>
<age>23</age>
</person>

I rather suspect you'd end up with something looking quite like that
even if you didn't know the first thing about RDF.

> If you're as interested in all this as I am, then I would strongly recommend
> approaching FHISO regarding membership.

Thanks for the suggestion. I have now done so.

Richard

Richard Smith

unread,

Apr 29, 2012, 12:47:03 PM4/29/12

to

On Apr 29, 11:14 am, Ian Goddard <godda...@hotmail.co.uk> wrote:

> My concern is that the process seems to be largely a matter of writing
> implementations and back fitting a model to them. This might be a good
> process for some types of development but I don't think it's a good one
> here.
>
> I think the risk is that we end up with a data model which depends on
> programmers convenience and maybe even presentation. I'd much prefer to
> start with an abstract data model which arises from a consideration of a
> large and varied body of sample data. That may be difficult to program
> but much better that software developers solve the problems than
> throwing them over the wall to users who are then left trying to
> force-fit date onto a model that they don't really fit at all.

Conversely, the danger of developing the data model in vacuo and only
later implementing it is that it turns out to be too cumbersome to use
in the simpler real-world examples, or sufficiently hard to program
that no-one is willing to do it. Arguably it is issues such as these
that have lead to the Gentech data model largely being ignored,
despite seeming very well thought out on paper. (And arguably the
other major factor that's lead to it being ignored is that it doesn't
have a data exchange format.)

I think my preferred strategy is to have example code to test the data
model while developing it. Gedcom X seem to be doing that in Java
which isn't what I would have chosen, but nor is it obviously a bad
choice.

Richard

singhals

unread,

Apr 29, 2012, 3:28:23 PM4/29/12

to gen...@rootsweb.com

Ian Goddard wrote:
> singhals wrote:
>> The underlying pitfall there is -- while the developer can be as
>> anal-retentive as he likes in setting up his Utopian model, no one can
>> force the user to abide by it.
>
> Can we agree on a hierarchy of reliability:
>
> Best, an image of the document.
>

Assuming said image hasn't been photo-shopped or otherwise
enhanced, and is otherwise legible, OK, that's best.

> Next best, a transcript of the document.
>

Assuming the transcription was done by someone competent
with experience in the language and handwriting of the
document, OK.

> Not so good, a restricted interpretation of what the document said.
>

Probably.

But your hierarchy, which I have just effectively endorsed,
tends to assume that all parties are equally competent.
That is not in universal evidence. If the person making the
restricted interpretation is more experienced in the era,
handwriting, and language than the person doing the
transcription of the document, then the interpretation will
outrank the transcription.

> If we're agreed on that, tell me what's anally-retentive about a system
> that enables best and/or next best to be used as opposed to one than
> enforces not so good.
>

I've already said: the end user cannot be forced to abide by
the developer's definitions. So long as user 1 _assumes_
that everyone else is using "best", but user 2 is grateful
to have "not-so-good" there is going to be miscommunication
that can be blamed on the transfer media (i.e.,
GED-whatever). And these surety-ranking systems suck...it
is possible to be both POSITIVE and WRONG. Ask the woman
who said 'Yeah I'm sure -- del *.*" thinking she was on A:
not c:

Cheryl

Richard Smith

unread,

Apr 29, 2012, 5:04:01 PM4/29/12

to

On Apr 29, 8:28 pm, singhals <singh...@erols.com> wrote:
> Ian Goddard wrote:
> > singhals wrote:
> >> The underlying pitfall there is -- while the developer can be as
> >> anal-retentive as he likes in setting up his Utopian model, no one can
> >> force the user to abide by it.
>
> > Can we agree on a hierarchy of reliability:
>
> > Best, an image of the document.
>
> Assuming said image hasn't been photo-shopped or otherwise
> enhanced, and is otherwise legible, OK, that's best.
>
> > Next best, a transcript of the document.
>
> Assuming the transcription was done by someone competent
> with experience in the language and handwriting of the
> document, OK.
>
> > Not so good, a restricted interpretation of what the document said.
>
> Probably.
>
> But your hierarchy, which I have just effectively endorsed,
> tends to assume that all parties are equally competent.
> That is not in universal evidence. If the person making the
> restricted interpretation is more experienced in the era,
> handwriting, and language than the person doing the
> transcription of the document, then the interpretation will
> outrank the transcription.

All very good points. And there are reasons why a genuinely less
accurate form can be advantageous if it allows you to do something
that can't be done with the original. I can use a computer to search
the text in a transcription, but I can't do that with an image, for
example.

The fact that different versions of a resource have different
qualities is a reason why it's important to distinguish them. I want
my system to be able to store an image, a transcription made from it,
and some interpretation based on it. Often I won't have all these,
but when I do, I want them to be known to the system. I want to be
able to record who made the resource (that is, the copy, the
transcription, the interpretation), when and how. I want to be able
to record what resources were used to produce the resource, for
example, was the interpretation made from the transcription, the image
or the original? Was it made with reference to other, additional
sources?

Obviously it won't be workable if all of this information is
required. But it should be possible. Otherwise how can a future
researcher evaluate what I've done? For that matter, how can I
understand what prompted me to come to a particular conclusion some
years in the future? I frequently find myself returning to old
conclusions wondering whether I was aware of some particular piece of
evidence when I reached that conclusion, and therefore whether I need
to review the conclusion in light of that evidence.

If you don't want to record this extra level of detail, that's fine.
Perhaps you have a better memory than me, or rely more heavily on
paper notes, or whatever. No problem. And if you then publish your
research and I download it, I can still use it as I can at the moment,
but I have far less knowledge of how and why you came to the
conclusions you did, especially if I think I may have found evidence
that you missed. That means I'm going to be more suspicious of your
conclusions and, to some extent, have to them on trust.

And what happens if I find a piece of evidence that shows, beyond
reasonable doubt, that in you have an error in some obscure branch of
your research. Perhaps you've assumed two records refer to the same
person when in fact they were separate people with the same name. It
would be a great waste to throw away all of your research on that
branch of the family simply because of that one mistake, but if I
cannot unpick your conclusions to return to the evidence, what choice
do I have?

Nor would such a system require a wholesale change to everything at
once. Let's say I have a body of research in the GEDCOM data model
(i.e. as recorded in almost any current genealogy program). From time
to time I'll have cause to review and extent various areas of the
family, and as I do that I can take the opportunity to store more
information about how and why I reached the conclusions I did, and to
import the various sources I might have.

The problem here isn't whether the users choose to use the new
facilities. Most probably never will, and their research will be
viewed with circumspect by serious researchers, much as a lot of the
research on the Internet currently is. But the new system will also
make it possible to do and publish serious genealogy work in way that
can be used to its full potential by other serious researchers.

Richard

Richard Smith

unread,

Apr 29, 2012, 6:58:24 PM4/29/12

to

On Apr 29, 1:38 pm, "Tony Proctor" <tony@proctor_NoMore_SPAM.net>
wrote:

> You're getting very specific Ian, although I see your point. I believe a
> standard reference model should not be biased with regard to the way the
> data was collected, processed, or any specific software product. I'd
> already made this point is my own research before I joined FHISO

> (www.parallaxview.co/familyhistorydata/research-notes/musings-standard...).

>
> Similarly with the concept of 'persona' and the merging of "evidence
> persons"
> (www.parallaxview.co/familyhistorydata/research-notes/evidence-conclusion),
> although there are more varied viewpoints on the BetterGEDCOM wiki
> (http://bettergedcom.wikispaces.com/).

I'm inclined to think that the distinction between evidence personae
and conclusion persons is a false dichotomy that artificially limits
the model. Let me explain by way of an fictional example.

Suppose I go some distant library and locate a book, 'The Fitzbalricks
of Dunshire', by some long-dead researcher. The book looks very
professional. His conclusions are persuasively argued, he cites the
supporting evidence, and where he has found a piece of evidence that
seems to contradict his conclusion he gives a through discussion of
his reasons for dismissing it. This book tells me that Edmund
Fitzbalrick of Dun Hall married Patty Miggins and they had three
daughters, Alice, Betty and Caroline. I look up the baptisms of
Alice, Betty and Caroline in the Dunny-on-the-Wold parish register,
and find them each exactly as the book says, each giving just the
father's name. But the Dunny Magna marriage register, where the book
says Edmund and Patty married, was destroyed during the Blitz, and the
will that the author uses to show that Edmund in these four register
entries is the same person is written in Welsh, a language I can't
read. I do however take a photocopy of the will and verify that the
names Patty, Alice, Betty and Caroline all appear in it.

How do I enter all this, if I want to do it thoroughly? I'm sure I
want to add the baptism records normally, including an Edmund source
persona for each baptism, as these are primary sources I've checked
myself. What about the marriage? I'll probably add the marriage
record, as interpreted by the book, and add a Edmund source persona
for that too. The book concludes that the four Edmunds are all the
same person, so I want to merge them, presumably into a 'conclusion
person', and I'll cite the book and perhaps even include a paragraph
or two of text from it justifying this. I'll also include a scan of
the photocopy of the will in case some day I want to get it
professionally translated and attach a note saying what the book
infers from it.

In my own research, I find an illegitimate son of Edmund Fitzbalrick
of Dun Hall mentioned in Dunmouth's maintenance orders for bastards.
The reference to Dun Hall makes it clear that this is the same Edmund
mentioned in the book, though based solely on the primary evidence
I've personally consulted, I couldn't come to that conclusion because
I've not seen 'Dun Hall' mentioned in any other primary source. How
do I enter this? Obviously I add a new evidence persona for the
father of the illegitimate child, but how do I combine this into a
conclusion person? I don't want to merge all the Edmund personae into
a single conclusion person because I've not seen the marriage and
don't understand the will. I can't merge the baptism personae with
the maintenance order persona because it's the reference to 'Dun Hall'
that ties them together and that doesn't appear in any of the
baptisms. What I really want to do is merge the book's conclusion
person with the new maintenance order persona to produce my own
conclusion person.

But that gives three levels of personae. The Edmund of the book is
both a conclusion person from the point of view of the reasoning in
the book, and a source persona from my point of view. My view,
therefore, is that we should allow arbitrarily deep merging of
personae into higher level 'conclusion' personae as the conclusions of
one stage of research are fed in as the source material of the next
stage, which may be done by the same person later on, or someone
else. In a similar vein, I regard primary sources, secondary sources,
and summaries of my conclusions on a given matter as different
subtypes of genealogical resource. The processes of genealogical
research -- extraction, transcription, translation, interpretation,
analysis, and so on -- is then the synthesis of new resources from
existing ones. As we analyse our new information, we'll find
ourselves grouping together the personae in the preceding resources in
new ways, and occasionally admitting that we've made a mistake,
disregarding an earlier conclusion and grouping the personae
differently.

At each stage, we are creating new conclusion resources: we never edit
them or delete them, so we have a complete history of what we thought
at each stage and why, and those nagging "did I know X when I
concluded Y" doubts can now easily be answered. We may choose not to
share all of that information with others. We might, for example,
create 'summarised conclusions' that cut through the multiple layers
of conclusions as we refine them; and we may similarly choose to keep
discarded conclusions to ourselves. But I rather hope researchers are
willing to share it all. Sometimes, knowing all the blind alleys a
researcher pursued is as valuable as knowing their ultimate
conclusions, if only so we don't explore the same blind alleys
ourselves.

Richard

Ian Goddard

unread,

Apr 30, 2012, 9:55:34 AM4/30/12

to

Richard Smith wrote:
> On Apr 29, 1:38 pm, "Tony Proctor"<tony@proctor_NoMore_SPAM.net>
> wrote:
>
>> You're getting very specific Ian, although I see your point. I believe a
>> standard reference model should not be biased with regard to the way the
>> data was collected, processed, or any specific software product. I'd
>> already made this point is my own research before I joined FHISO
>> (www.parallaxview.co/familyhistorydata/research-notes/musings-standard...).
>>
>> Similarly with the concept of 'persona' and the merging of "evidence
>> persons"
>> (www.parallaxview.co/familyhistorydata/research-notes/evidence-conclusion),
>> although there are more varied viewpoints on the BetterGEDCOM wiki
>> (http://bettergedcom.wikispaces.com/).
>
> I'm inclined to think that the distinction between evidence personae
> and conclusion persons is a false dichotomy that artificially limits
> the model.

A Persona represents the person who played a role in a specific event in
a specific record. All we know about a Persona is what we can find in
the original source: the name, the event and the role. There are, of
course, implied relationships to other Personae in the same event as
defined by the roles those Personae played. I don't know what the
Gentech project's basis was for the name but I came up with the same
concept and name before I'd encountered Gentech and my name was based on
the the cast of characters in a Shakespeare play being headed Dramatis
Personae. This seemed to sum up the situation well: there's a name
associated with a role which is not to be confused with the identity of
the actor who might be playing the role.

This enables us to handle a situation when we have the Person but aren't
sure which events belong to it. I'll pick a real example, 3xggfather
John Goddard. The key event for me is his role as father in the baptism
of 2xggfather so I have one event and one Persona to start off the
reconstruction. It wasn't difficult to pick up the baptisms of a few
other children all with fathers named John Goddard, born in the same
well-defined locality at likely looking intervals and the last of these
fell after the local curate very helpfully recorded mothers' maiden
names so I could find a marriage giving the two names albeit somewhat
earlier than the first of the baptisms. I could also add a burial of
the wife of John Goddard living in a nearby locality (where 2xggfather
subsequently lived) and then a burial of John Goddard giving age at
death. So this is a collection of events all featuring Personae named
John Goddard and all reasonably played by the same Person, to continue
the Shakespeare analogy. In data terms one would have a set of links
between the Person and the various Personae.

Up to this point you might feel that the separation of Persona & Person
objects doesn't add anything, it's merely a hub for a series of links to
the Personae (but see below). However, the date of burial and age at
death gave me an approximate date of birth and, in fact, this gives me
two candidate baptisms with fathers Jonathan & William and, given that
there were two Williams, three candidate 4xggfathers and then, given
that one of the Williams was Jonathan's brother, two candidate
5xggfathers and that's only considering the male line.

There's also the matter of the gap between the marriage and the closely
linked cluster of baptisms already established. Given the two baptisms
there were clearly two separate Persons. Some of the baptisms and other
events (including the churching event misinterpreted on IGI) falling
into the gap may relate to the alternative Person.

How do we record all this in a system which is purely Person based?
There are clearly two real people to be represented and a number of
events which clearly belong to one or the other but which cannot be
confidently be assigned to either. We can make a stab at this and
decide that the John Goddards of two of the events are one and the same
Person and we would have to merge them to record this, discarding one of
the separate putative identities. But how would we then handle a
correction if we decided we were wrong?

(The genie S/W I use is Gramps. It doesn't really address the problem
of recording the fact that alternatives are being kept in mind. To some
extent it makes an attempt to handle the problem by supporting merges
with a history mechanism so that they can be undone. However certain
operations, including importing a GEDCOM, wipes this history so a
delayed change of mind requires re-entering the lost data. And, of
course, if I wanted to send you a GEDCOM there's no way for it to
communicate the discarded alternatives even if they're still in the
history.)

In the absence of S/W that can record alternatives as of right how do
you handle them? Maintain separate databases in the S/W you use?
Doubtful; you'd have to cope with a combinatorial explosion of databases.

You'd do better keeping it off-line with with pencil and paper - and a
rubber to take out discarded alternatives. And such pencil and paper
approach is what the Person/Persona split models. In my example I'd
give the system two Person objects and as many Personae objects as it
takes and add a set of light-weight link objects between them. The
links are equivalent of the lines you'd pencil in or rub out as required.

One thing I alluded to above was an additional function for the Person
object. That's to hold a standardised name for an individual. A prize
example of this is another of my 3xggfathers for whom there are about 9
overall spelling variations discovered so far with about half a dozen
each for both his Christian name and surname. Personae records would
allow me to record each variation as found in the original records and
the Person would allow me to record him with my preferred standard form
of Dearnley for his surname, this being the current standard spelling
hereabouts, and Amond for his Christian name, this being the spelling of
his only known signature on his wedding entry (the clerk wrote it up as
Hammond).

Finally, let me deal very quickly with your fictional example:

> Let me explain by way of an fictional example.
>
> Suppose I go some distant library and locate a book, 'The Fitzbalricks
> of Dunshire', by some long-dead researcher. The book looks very
> professional. His conclusions are persuasively argued, he cites the
> supporting evidence, and where he has found a piece of evidence that
> seems to contradict his conclusion he gives a through discussion of
> his reasons for dismissing it. This book tells me that Edmund
> Fitzbalrick of Dun Hall married Patty Miggins and they had three

> daughters, Alice, Betty and Caroline. etc.

>
> How do I enter all this, if I want to do it thoroughly? I'm sure I
> want to add the baptism records normally, including an Edmund source
> persona for each baptism,

%><

> The book concludes that the four Edmunds are all the
> same person, so I want to merge them, presumably into a 'conclusion
> person',

No you don't merge them, you link them.

> do I enter this? Obviously I add a new evidence persona for the
> father of the illegitimate child, but how do I combine this into a
> conclusion person? I don't want to merge all the Edmund personae into
> a single conclusion person because I've not seen the marriage and
> don't understand the will. I can't merge the baptism personae with
> the maintenance order persona because it's the reference to 'Dun Hall'
> that ties them together and that doesn't appear in any of the
> baptisms. What I really want to do is merge the book's conclusion
> person with the new maintenance order persona to produce my own
> conclusion person.

If you think there was only one Edmund then you only have one person to
represent and therefore only one Person at the conclusion level. The
Person object represents the real historical person. It links to
everything you think you know about him.

> But that gives three levels of personae.

No, you just have one level with one persona for each record of an
event. You may have varying degrees of confidence in different links or
different reasons for making the link: "Blackadder says 'blah bla'",
"Maintenance order specifies Edmund of Dun Hall" etc. A good model
would allow you to record this on a link-by-link basis.

Ian Goddard

unread,

Apr 30, 2012, 10:56:56 AM4/30/12

to

Richard Smith wrote:
> On Apr 29, 11:14 am, Ian Goddard<godda...@hotmail.co.uk> wrote:
>
>> My concern is that the process seems to be largely a matter of writing
>> implementations and back fitting a model to them. This might be a good
>> process for some types of development but I don't think it's a good one
>> here.
>>
>> I think the risk is that we end up with a data model which depends on
>> programmers convenience and maybe even presentation. I'd much prefer to
>> start with an abstract data model which arises from a consideration of a
>> large and varied body of sample data. That may be difficult to program
>> but much better that software developers solve the problems than
>> throwing them over the wall to users who are then left trying to
>> force-fit date onto a model that they don't really fit at all.
>
> Conversely, the danger of developing the data model in vacuo and only
> later implementing it is that it turns out to be too cumbersome

IME a well-done simple model can fit complex data easily. An
ill-thought-out model can be cumbersome, fit all data badly and be a pig
to implement & maintain: been there, had one imposed, the clients were
for ever having to take the back off it and fiddle with the innards for
every change in requirement.

> to use in the simpler real-world examples,

This depends partly on the thought that went into implementation. I've
experience of an ERP system which had hundreds of tables. My clients
deployed it in wildly divergent businesses ranging from order processing
and warehouse management to booking service visits. Because it had a
consistent interface for each screen and a place for everything and
everything in its place (i.e. properly normalised) it was, AFAICR,
pretty slick.

> or sufficiently hard to program that no-one is willing to do it.

IME getting the model right is likely to make implementation easier.

> Arguably it is issues such as these
> that have lead to the Gentech data model largely being ignored,
> despite seeming very well thought out on paper.

I'm not so sure about that:

- It doesn't separate Person from Persona.

- The Assertion seems to be applied to join all sorts of data types,

- The Repository entity is, IMV superfluous - it would be far simpler
and more flexible to have a single self-referential entity for all
levels of a provenance chain, the repositories, publishers, etc. being
those instances with a null parent field.

- The citation chain is on the face of it equally superfluous as the
material it contains should be part of the provenance chain; I think
it's there to contain the damage from ESW fanboys who can't separate
presentation and data layers.

- But I think its main problem is that it's an ER design in what ought
to be an OO domain. Consider, for instance how much easier it would be
to be able to simply specify "PersonalName" in the first cut of the
design and specify it as being implemented by its own class knowing that
you'll be not only be able to sub-class it later to deal with the
different name structures of different cultures but also add sub-classes
whenever the demand for a new cultural requirement is added.

> (And arguably the
> other major factor that's lead to it being ignored is that it doesn't
> have a data exchange format.)

Very likely.

> I think my preferred strategy is to have example code to test the data
> model while developing it. Gedcom X seem to be doing that in Java
> which isn't what I would have chosen, but nor is it obviously a bad
> choice.

Agreed up to a point but I don't get the impression that that's what
they're doing. ISTM that they're defining schemas and Java classes in
parallel and that's there model. The Java isn't a test of the model,
it's part of it. And harking back to your earlier point, a good test
would be code that sets out to test how well the model can be wrapped in
a user-friendly task-oriented interface.

Richard Smith

unread,

Apr 30, 2012, 6:52:43 PM4/30/12

to

Ian Goddard wrote:

> Richard Smith wrote:
> > Conversely, the danger of developing the data model in vacuo and only
> > later implementing it is that it turns out to be too cumbersome
>
> IME a well-done simple model can fit complex data easily. An
> ill-thought-out model can be cumbersome, fit all data badly and be a pig
> to implement & maintain: been there, had one imposed, the clients were
> for ever having to take the back off it and fiddle with the innards for
> every change in requirement.

I don't think we're really disagreeing here. I suspect that what I'm
referring to when I talk about a model developed in vacuo and without
real-world testing is an example of what you're calling an ill-thought-
out model.

> > Arguably it is issues such as these
> > that have lead to the Gentech data model largely being ignored,
> > despite seeming very well thought out on paper.
>
> I'm not so sure about that:
>
> - It doesn't separate Person from Persona.

It does in a way. Its persona object can be grouped into higher-level
personae. In your terminology, the lower-level one is the Persona and
the higher level one is the Person. The difference is that Gentech
allows multiple layers of such groupings where as your model does
not. However as I've argued elsewhere in this thread (and so won't
repeat again here), I think the ability to have multiple layers of
personae is a good thing.

> - The Assertion seems to be applied to join all sorts of data types,

Yes, in many ways that's a mess, especially as not all the
combinations are really meaningful.

> - The Repository entity is, IMV superfluous - it would be far simpler
> and more flexible to have a single self-referential entity for all
> levels of a provenance chain, the repositories, publishers, etc. being
> those instances with a null parent field.

I can't make up my mind whether I agree with you here. I think on
balance there is a useful distinction here, but that Gentech doesn't
quite get it quite right.

The Gentech source object is hierarchical. A source can represent a
page, another source can be the parish register enclosing the page, a
third source can be the collection of records deposited for a given
parish, and each of these sources has a single, possibly null, parent
field (the Higher-Source-ID field). So a page cannot be in two
different registers.

By contrast, a repository is the location of a source, and a source
may exist in multiple repositories. For example, a book or microfiche
may exist in several libraries. In it's original intent, it's clearly
meant as a physical place: somewhere with opening hours, an address
and a phone number. But almost certainly any modern implementation
would extend it to include websites, so that Archive.org or
Ancestry.com would be a respository. That would require certain
changes to the repository data model (at the very least to include its
URL) which could have been pre-empted by using a more general and
extensible contact information model.

This way Gentech distinguishes between a source and a specific copy of
a source. If I'm including a citation in a published report, I just
want to reference the source -- ordinarily I wouldn't state where I
accessed it, unless the only copies happened to be in obscure places.
But it's nevertheless useful to record where I found a copy of that
source so that I can plan future research: for example, I can ask my
system to give me a list of the tasks on my 'to-do' list that need
doing in the Public Record Office and that can't also be done online
or in my local library.

But with online sources this gets complicated. Gentech makes the
medium a property of the source: a source might be a book, or a
collection of loose leaves, or a photograph, or a map. That makes
sense. but an online copy clearly has a different type: it has a MIME
type (e.g. image/jpeg), a resolution, whether it's colour or
greyscale, and so on. And different online copies will have different
properties. Gentech has no facility to store this sort of per-
instance metadata. The alternative view is that each functionally
distinct online copy is a separate source, and there's some mechanism
for recording that one source is derived from another. But again,
Gentech has no means of recording that one source is derived from
another. Either way, it's a deficiency.

> - The citation chain is on the face of it equally superfluous as the
> material it contains should be part of the provenance chain; I think
> it's there to contain the damage from ESW fanboys who can't separate
> presentation and data layers.

I expect you're right. When I implemented some of that, I certainly
wasn't able to come up with a good reason why the chain of Citation-
Parts was separate from the chain of Sources. But given the lack of
examples and rationale in the Gentech spec, I wouldn't rule out there
being a valid reason.

> - But I think its main problem is that it's an ER design in what ought
> to be an OO domain. Consider, for instance how much easier it would be
> to be able to simply specify "PersonalName" in the first cut of the
> design and specify it as being implemented by its own class knowing that
> you'll be not only be able to sub-class it later to deal with the
> different name structures of different cultures but also add sub-classes
> whenever the demand for a new cultural requirement is added.

I agree that this is a big problem, but not for the reason you say.
The ER design is really just the way in which the data model is
documented. It doesn't preclude an OO implementation. Personal names
are complicated because they're partly folded into the general
characteristics subsystem. The Personal-Name attribute on the Persona
entity is simply a text string recording what the source said the name
was. If Persona is associated with a source which refers to him as
'John Smith', then that's what goes there. If the Persona entity
represents a conclusion person, then the the Personal-Name attribute
is your preferred display name. Arguably that whole Personal-Name
attribute is redundant and should be dropped; but in any case it's not
really pertinent to Gentech's ER name mechanism which is done with
characteristics.

A name is a type of characteristic. In the case of a name with
multiple components, such as "Homer J Simpson", this is a single name
Characteristic, but each word is a separate Characteristic-Part.
"Homer" is a given name; "J" is an initial; "Simpson" is a surname.
The correct order of these parts is ensured by the Sequenence-Number
attribute on the Characteristic-Part. The list of possible name
components is extensible because "given name", "initial", "surname",
together with things like "patronym" and "regnal number" are all
Characteristic-Part-Type entities. A program implementing this in a
language with OO duck-typing (such as Python or Perl) may very well
implement these as dynamically generated subclasses of an abstract
name class.

But I suspect that's not quite what you mean. I suspect you mean more
hardcoded classes for "western name" (being one or more given names,
followed by a surname), "Slavic name" (given name, patronymic,
surname), "royal name" (given name, regnal number) and so on. That's
entirely achievable in this sort of ER design too. In the ER paradigm
you would simply add a Characteristic-Type entity which would name and
aggregate a series of Characteristic-Part-Type entities. Translated
into OO terms, the *-Type entities describe the class hierarchy, the
other entities contain the data in them.

To summarise my point, an ER specification is entirely compatible with
an OO implementation and an OO exchange format. In the specific case
of names, I think their specification isn't ideal, but only in its
details. The ER specification may be less familiar to those used to
OO paradigm, but I'm not sure it's inappropriate here. After all, I
suspect a lot of people will be thinking in terms of an SQL-backed
implementation, in which case the ER formalism is much more natural.

Richard

singhals

unread,

May 1, 2012, 7:29:21 AM5/1/12

to gen...@rootsweb.com

Richard Smith wrote:
> Ian Goddard wrote:
>> Richard Smith wrote:
>>> Conversely, the danger of developing the data model in vacuo and only
>>> later implementing it is that it turns out to be too cumbersome
>>
>> IME a well-done simple model can fit complex data easily. An
>> ill-thought-out model can be cumbersome, fit all data badly and be a pig

>> to implement& maintain: been there, had one imposed, the clients were

How does it handle each of the following scenarios:

1) The Cherokee and Navajo nations change a person's name at
various stages of his/her life. Baby Girl is born and is
called Morning Glory until some (unknown to me) milestone is
reached; at that point Baby Girl becomes Flying Hare; later
she's Dove Lite and then Sleeping Doe, Evening Mist and
Shining Star. Her parents also have similar name changes, as
do her husband and children. One is unlikely to find any
two events for her citing the same name/label. (Names are
fictional for illustration and probably do not accurately
reflect the facts as known to Cherokee or Navajo nationals.)

2) The Dravidians of South India use a name structure of
(birth-place)(father's formal name)(baby's formal name),
with the first two parts generally reduced to a single
letter -- M. K. Kumari, for instance.

Cheryl

Ian Goddard

unread,

May 1, 2012, 3:58:25 PM5/1/12

to

singhals wrote:

> 1) The Cherokee and Navajo nations change a person's name at various
> stages of his/her life. Baby Girl is born and is called Morning Glory
> until some (unknown to me) milestone is reached; at that point Baby Girl
> becomes Flying Hare; later she's Dove Lite and then Sleeping Doe,
> Evening Mist and Shining Star. Her parents also have similar name
> changes, as do her husband and children. One is unlikely to find any two
> events for her citing the same name/label. (Names are fictional for
> illustration and probably do not accurately reflect the facts as known
> to Cherokee or Navajo nationals.)

It would be difficult to find a better example of the advantage of
retaining the Personae untouched as each would be labelled with its own
particular name. Presumably the process of linking the correct records
together is pretty difficult and subject to a lot of "oops, no" moments.
Do genealogists working in these cultures have a system of assigning a
canonical name which one could then use for the Person?

>
> 2) The Dravidians of South India use a name structure of
> (birth-place)(father's formal name)(baby's formal name), with the first
> two parts generally reduced to a single letter -- M. K. Kumari, for
> instance.

I recall reading of one Scandinavian country, but can't remember which,
where, until fairly recent times, the family name was the name of the farm.

These different naming systems are one reason I favour an
object-oriented approach. One of the characteristics of an object
oriented system is that one can declare a class (say PersonalName) and
then have sub-classes for names of different naming systems such as
AngloAmericanPersonalName, IcelandicPersonalName, DravidianPersonalName
etc. The designer can then specify that some variable is a PersonalName
during the development stage but when the program runs it will actually
use the sub-class appropriate to the circumstances.

Ian Goddard

unread,

May 1, 2012, 4:10:20 PM5/1/12

to

Richard Smith wrote:
> To summarise my point, an ER specification is entirely compatible with
> an OO implementation and an OO exchange format. In the specific case
> of names, I think their specification isn't ideal, but only in its
> details. The ER specification may be less familiar to those used to
> OO paradigm, but I'm not sure it's inappropriate here.

I'm long enough in the tooth to have used both. However, AICR Gentech
uses a number of type indicators. An out-and-out OO approach would
subsume those within the Class system but I do agree they give it an OO
flavour

> After all, I
> suspect a lot of people will be thinking in terms of an SQL-backed
> implementation, in which case the ER formalism is much more natural.

My own inclination would be to store the objects within an application
either as binary blobs or maybe as XML fragments, possibly within an RDB
database indexed by a UUID and then use some other database, maybe SQL,
maybe some other text search engine, for indexing.

Bob Melson

unread,

May 1, 2012, 5:21:31 PM5/1/12

to

On Tuesday 01 May 2012 13:58, Ian Goddard (godd...@hotmail.co.uk) opined:

> singhals wrote:
>
>> 1) The Cherokee and Navajo nations change a person's name at various
>> stages of his/her life. Baby Girl is born and is called Morning Glory
>> until some (unknown to me) milestone is reached; at that point Baby Girl
>> becomes Flying Hare; later she's Dove Lite and then Sleeping Doe,
>> Evening Mist and Shining Star. Her parents also have similar name
>> changes, as do her husband and children. One is unlikely to find any two
>> events for her citing the same name/label. (Names are fictional for
>> illustration and probably do not accurately reflect the facts as known
>> to Cherokee or Navajo nationals.)
>
> It would be difficult to find a better example of the advantage of
> retaining the Personae untouched as each would be labelled with its own
> particular name. Presumably the process of linking the correct records
> together is pretty difficult and subject to a lot of "oops, no" moments.
> Do genealogists working in these cultures have a system of assigning a
> canonical name which one could then use for the Person?
>
>>
>> 2) The Dravidians of South India use a name structure of
>> (birth-place)(father's formal name)(baby's formal name), with the first
>> two parts generally reduced to a single letter -- M. K. Kumari, for
>> instance.
>
> I recall reading of one Scandinavian country, but can't remember which,
> where, until fairly recent times, the family name was the name of the
> farm.

Finland. Up until about 1900 there were several naming conventions in
Finland, one being that surnames came from the steading on which the
individuals were currently living. If the individual changed steadings ..
yep, you got it, the surname changed, as well. Finns on the west coast
and along the border with Sweden followed the Swedish (Scandanavian?)
practice of <given name> <father's given name>{son|dottir} and "townies"
and military had yet another system, not unlike the "English" and
continental practice of taking surnames from occupations or individual
characteristics or, even, because they sounded good.

Let me tell you, tracing my Finn "befores" has been, is, and will likely
continue to be a real detective job, notwithstanding really good records.

<snip>

Stumped Ol' Bob

--
Robert G. Melson | Rio Grande MicroSolutions | El Paso, Texas
-----
The greatest tyrannies are always perpetrated
in the name of the noblest causes -- Thomas Paine

Shmuel Metz

unread,

May 1, 2012, 10:06:02 AM5/1/12

to

In <mpro.m3588a05...@powys.org>, on 04/27/2012
at 03:57 PM, Tim Powys-Lybbe <t...@powys.org> said:

>I agree that the site is more than a little opaque

I had trouble trying to find from their web site the degree to which
they dealt with some issues specific to Jewish genealogy. GEDCOM 5.51
added some facilities that are relevant, but that's just a draft.

Is there provision for recording an event with both Gregorian and
Hebrew dates?

Is there provision for recording a person with both a secular name and
a religous name, and for providing the religous name in both a Hebrew
character set and in a transliteration?

In a more general context, is there provision for tagging a
transliteration with the particular transliteration scheme used?

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spam...@library.lspace.org

singhals

unread,

May 1, 2012, 9:18:14 PM5/1/12

to gen...@rootsweb.com

Ian Goddard wrote:
> singhals wrote:
>
>> 1) The Cherokee and Navajo nations change a person's name at various
>> stages of his/her life. Baby Girl is born and is called Morning Glory
>> until some (unknown to me) milestone is reached; at that point Baby Girl
>> becomes Flying Hare; later she's Dove Lite and then Sleeping Doe,
>> Evening Mist and Shining Star. Her parents also have similar name
>> changes, as do her husband and children. One is unlikely to find any two
>> events for her citing the same name/label. (Names are fictional for
>> illustration and probably do not accurately reflect the facts as known
>> to Cherokee or Navajo nationals.)
>
> It would be difficult to find a better example of the advantage of
> retaining the Personae untouched as each would be labelled with its own
> particular name. Presumably the process of linking the correct records
> together is pretty difficult and subject to a lot of "oops, no" moments.
> Do genealogists working in these cultures have a system of assigning a
> canonical name which one could then use for the Person?
>

AFAICT they use the Anglo name to aggregate, if only because
most records show the Anglo name not the tribal one. Not,
however, my field since the last AmerInd in my family was
Pocahontas's niece.

>>
>> 2) The Dravidians of South India use a name structure of
>> (birth-place)(father's formal name)(baby's formal name), with the first
>> two parts generally reduced to a single letter -- M. K. Kumari, for
>> instance.
>
> I recall reading of one Scandinavian country, but can't remember which,
> where, until fairly recent times, the family name was the name of the farm.
>

Sweden did it for quite a while, but then so did Germany.
Still, the number of people who ever lived on "Old Farm" has
to be significantly smaller than the number of people who
ever lived in, say, Madras/Chennai.

> These different naming systems are one reason I favour an
> object-oriented approach. One of the characteristics of an object
> oriented system is that one can declare a class (say PersonalName) and
> then have sub-classes for names of different naming systems such as
> AngloAmericanPersonalName, IcelandicPersonalName, DravidianPersonalName
> etc. The designer can then specify that some variable is a PersonalName
> during the development stage but when the program runs it will actually
> use the sub-class appropriate to the circumstances.

And how will the program know whether R K Anand is Dravidian
or not?

Ian Goddard

unread,

May 2, 2012, 5:21:55 AM5/2/12

to

singhals wrote:
>
> And how will the program know whether R K Anand is Dravidian or not?

Because the genealogist will tell it. And if the genealogist doesn't
know that's a bigger problem than the program not knowing.

singhals

unread,

May 2, 2012, 10:12:43 AM5/2/12

to gen...@rootsweb.com

Ian Goddard wrote:
> singhals wrote:
>>
>> And how will the program know whether R K Anand is Dravidian or not?
>
> Because the genealogist will tell it. And if the genealogist doesn't
> know that's a bigger problem than the program not knowing.
>

And doesn't that circle us back to: no matter how stringent
the definitions, you can't control the user's use of the field?

Cheryl

Ian Goddard

unread,

May 2, 2012, 12:03:22 PM5/2/12

to

If the user wants to enter a name that works as a Dravidian name it's
going to be /easier/ to make the explicit choice if the S/W then throws
up a name form with birth-place, father's formal name and child's name -
and pre-populates the first two if they've been entered already. If the
genealogist is working with a purely Dravidian family then this could be
made a default name format. If not the program could keep the Dravidian
option conveniently accessible to choose on a case-by-case basis; it
would be easier for the user to make that choice rather than try to
force the data into, say, a Portuguese default.

If S/W is written to encourage best practice then it becomes -easier-
natural for the user to follow best practice. To a large extent this is
up to the developer of the S/W and you can't guarantee that any given
developer or team will write good stuff no matter how good the
underlying data model. But it becomes more difficult to produce
suitable S/W if the underlying data isn't well designed.

It's worth reflecting on possible causes of data mis-entry. ((Wes's
"familysearch.org not to be trusted" thread in s.g.methods makes
interesting reading in this respect.)

One is that the user doesn't understand the data or just doesn't care.
From a genealogical PoV we've suspected this of commercial
transcriptions; Wes's experience suggests that we have to push the blame
further up the managerial food-chain and in respect of GEDCOM X this is
worrying. But if we're dealing with genealogists entering data for
primarily for their own use and only secondarily for sharing we can
maybe eliminate that.

Next is ill-fitting data formats. For instance US-centric S/W for any
application that requires an address always seems to assume every one
lives in a city and provides a city field. I've yet to find any ancestor
of mine who lived in a city. So when I use Gramps to record places I
use the city field to hold townships for which Gramps doesn't provide a
field. But a township isn't really a city and this comes to a head if I
try to incorporate my wife's data. She is Irish and the Irish
equivalent of a township would be a townland and for her ancestors I
need both townland and city fields. However determined the user to
enter data correctly into the given structure and however much the
programmer would like to assist it can't be done if the structure
doesn't fit the data.

Finally there's less-than-helpfully-written S/W. At a simple level this
may just be obtuse presentation, for instance a web-site following the
same city-centric approach when it asks me for my city in my postal
address; the correct term would be "post town". At a more advanced
level we can take name formats as an example. A well-written program
might use a wizard to set up a new genealogy for the geographical area
the user anticipates as being relevant and then pre-populate a short
choice of name-formats. For instance working in India that list might
include Dravidian, other Indian format and colonial formats such as
English and Portuguese with a mechanism allowing the user to add others
as required. A badly written program might hide the Dravidian option
amongst a long and complete list of all known naming systems at all
known dates. The user might recognise the need to enter the Dravidian
format, try to do so and yet fail to locate the option. I don't think
it's reasonable to blame either data model or user if the programmer
produces the latter as an implementation rather than the former.

singhals

unread,

May 2, 2012, 1:02:32 PM5/2/12

to gen...@rootsweb.com

Well, now, wait a sec. I'm not blaming anyone but the user
for most of the problems. In some cases, it's a matter of
the user being too-fussy and in others it's the reverse.

Akin to your city/town/township/townland problem, I deal
with people who don't understand that How Virginia and her
daughter states work. We have no townships. We have
Districts. And if any of those districts EVER functioned as
an administrative entity making and keeping records, it's
news to me and lots of other Virginians. Meanwhile, we've
got these Independent Cities, which are generally the
county-seat/county-town of the county in which they are
physically located but which *do* make and keep completely
separate records.

You can't designate the entire database as "this" "that" or
"tother" because the Irish insist on occasionally marrying
an Englishman, and Dravidians sometimes marry Californians,
and Virginians marry Louisianians. And, when they do, you
do need two systems within the same entry.

Yet -- having to choose a system for each field is going to
get painful after the first dozen fields. Been there,
refused to finish doing that. :)

C

Richard Smith

unread,

May 3, 2012, 4:01:20 PM5/3/12

to

On Apr 30, 2:55 pm, Ian Goddard <godda...@hotmail.co.uk> wrote:

[snip excellent description of the need for separate persona and
person objects]

I've not commented on this part of your email because I don't think
anyone in this thread is disputing the advantage of keeping source
personae separate from conclusion persons. The point I've been making
is that, while two levels (source personae and conclusion persons) is
necessary, it is not always sufficient.

> Richard Smith wrote:
> > On Apr 29, 1:38 pm, "Tony Proctor"<tony@proctor_NoMore_SPAM.net>

> > The book concludes that the four Edmunds are all the
> > same person, so I want to merge them, presumably into a 'conclusion
> > person',
>
> No you don't merge them, you link them.

I think we're talking at cross purposes. When I said "I want to merge
them", I'm not talking about the sort of destructive merge we have to
do in current GEDCOM. I do just mean linking them together into a
higher-level entity. The individual source records still point to the
individual personae, and if in the future I believe I was mistaken in
believing they referred to the same person, I can just discard or
ignore the merged person to go back to the underlying records.

>
> > do I enter this? Obviously I add a new evidence persona for the
> > father of the illegitimate child, but how do I combine this into a
> > conclusion person? I don't want to merge all the Edmund personae into
> > a single conclusion person because I've not seen the marriage and
> > don't understand the will. I can't merge the baptism personae with
> > the maintenance order persona because it's the reference to 'Dun Hall'
> > that ties them together and that doesn't appear in any of the
> > baptisms. What I really want to do is merge the book's conclusion
> > person with the new maintenance order persona to produce my own
> > conclusion person.
>
> If you think there was only one Edmund then you only have one person to
> represent and therefore only one Person at the conclusion level.

Indeed so. One person at *my* conclusion level. But if we try to
shoehorn this into a two-level persona vs person model, what are the
source personae? Do I have one per baptism, one for the marriage
(which I've not seen), one for the will (which I can't verify), and
one for the maintenance order? It's not really accurate to say that
the marriage or the will are my sources, but on the other hand the
book is a source. So perhaps I should have the persona in the book,
together with one per baptism, and one for the maintenance order. But
if I do that, I have no record of the marriage.

I certainly don't want to make it look like I've used the marriage
register, not just because it's claiming something that's not true,
but also because it might confuse me in the future. Where was the
marriage register? Why can't I now find it? But if I don't record
the marriage entry at my source persona level, I don't have a direct
record of what it (purportedly) said.

My mental view of genealogy is that conceptually I'm researching a
family history narrative that would, if I ever wrote and published it,
in term become a secondary source for someone else. In that context,
there's no fundamental difference between the person or personae who
appear in my sources and in my conclusions. They're both limited
representations of certain events in life of the real person.

When I synthesis a conclusion persona, I do it by aggregating the
sources I'm using at the time, and one of those sources may well
(indeed, probably will) be earlier research I've done. If I discover,
say, another child baptised that fits neatly into the sequence of
children I already know about, I doubt I will go back and re-evaluate
all the other individual records I have for the parent: I'll just look
at my previous conclusion, see that the new evidence is entirely
consistent with it, and so conclude the child is part of that family
too. So I have a new conclusion persona for the parent which is
created by the aggregation of two sources: my earlier conclusion, and
the new baptism.

This has a number of advantages, not least a complete 'version
history' of my research, and the simpler data model with just a single
type of person-like entity.

Richard

Richard Smith

unread,

May 3, 2012, 5:09:44 PM5/3/12

to

On May 1, 9:10 pm, Ian Goddard <godda...@hotmail.co.uk> wrote:
> Richard Smith wrote:
> > To summarise my point, an ER specification is entirely compatible with
> > an OO implementation and an OO exchange format. In the specific case
> > of names, I think their specification isn't ideal, but only in its
> > details. The ER specification may be less familiar to those used to
> > OO paradigm, but I'm not sure it's inappropriate here.
>
> I'm long enough in the tooth to have used both. However, AICR Gentech
> uses a number of type indicators. An out-and-out OO approach would
> subsume those within the Class system but I do agree they give it an OO
> flavour

But that's just an implementation detail. Whether the OO
infrastructure knows the type itself, or whether a separate field is
needed to specify the type, the important thing is that something
knows the type. There's absolutely no reason why an OO implementation
and an ER implementation shouldn't interoperate fine. (Indeed, we see
that happening every day when we use OO to manipulate objects in the
code, and a relational database to store them underneath.)

> > After all, I
> > suspect a lot of people will be thinking in terms of an SQL-backed
> > implementation, in which case the ER formalism is much more natural.
>
> My own inclination would be to store the objects within an application
> either as binary blobs or maybe as XML fragments, possibly within an RDB
> database indexed by a UUID and then use some other database, maybe SQL,
> maybe some other text search engine, for indexing.

That's one way of doing it, but if it's no good if you want the
database to be able to access the bits of the object. For example,
storing '<givenname>Joe</givenname> <surname>Bloggs</surname>' is all
well and good, right up to the point where you want to use the
database to fetch the 27th page of 20 people with surname 'Bloggs'.
I'm not saying it's wrong, but it's a different way of doing it, and
punts much stuff that could be done in the database to the code. That
may be a good thing or a bad thing.

Richard

Richard Smith

unread,

May 3, 2012, 7:00:53 PM5/3/12

to

On May 2, 6:02 pm, singhals <singh...@erols.com> wrote:

> You can't designate the entire database as "this" "that" or
> "tother" because the Irish insist on occasionally marrying
> an Englishman, and Dravidians sometimes marry Californians,
> and Virginians marry Louisianians. And, when they do, you
> do need two systems within the same entry.
>
> Yet -- having to choose a system for each field is going to
> get painful after the first dozen fields. Been there,
> refused to finish doing that. :)

These are frequently problems, but in a well-written system they
needn't be problems.

Let me introduce the term 'locale' to refer to a set of conventions --
cultural norms, if you like -- that that might be associated with a
place, or a period of history, or a religion. This is a standard
piece of terminology in software development. In the context of
genealogy, we might have an 'England' locale that knows that the
Julian calendar was used until 1752 and the Gregorian one after that,
that knows the country is divided into counties which are in turn
divided into parishes, that knows the people typically have one or
more given names followed by a surname, and so on.

As you rightly point out, applying a single locale to the whole
database causes problems when you have branches of the family from
different cultures. When I'm documenting my Prussian ancestors, I'd
ideally like the app to assume that "4 Jun 1706" is a Gregorian date,
but when I'm working with my English or Irish ancestors, it should
assume it's a Julian date. But as you also point out, quite
correctly, if I have to manually choose the locale every time I add a
person or event, this becomes incredibly tedious.

So a good user interface would try to guess the correct locale and
allow the user to override it when necessary. Most of the time, I'll
be adding a person, event or source that's linked somehow to an
existing one. For example, I might be adding a child (person) to a
marriage (event), or transcribing all the entries (events) in a parish
register (source). In doing this, the software could simply copy the
default locale from previous entity. In other words, the software
should assume that, unless the user states otherwise, children of a
Irish couple will also be Irish, entries in an English register will
be English, and so on. The user can override this on the odd occasion
it's wrong, and in cases where there is no single correct choice of
locale (e.g. if a person moves backwards and forwards between places),
we apply a 'null' locale, telling the system to make no assumptions at
all.

From a software development point of view, that wouldn't be pretty
easy to implement. I'm not aware of any mainstream app that does
this, but then, there's a lot wrong with most of the current
generation of software, so that's perhaps not so surprising.

Richard

singhals

unread,

May 3, 2012, 8:38:14 PM5/3/12

to gen...@rootsweb.com

As another group frequently remarks, just because the cat
had kittens in the oven, it doesn't make 'em biscuits.

OK, the null locale will work, but down where the rubber
meets the road, that's what we've got now: a program that
takes what we give it and swallows it whole without giving
anyone a clue what we meant by "born".

Cheryl

Richard Smith

unread,

May 3, 2012, 9:09:37 PM5/3/12

to

On May 4, 1:38 am, singhals <singh...@erols.com> wrote:

> OK, the null locale will work, but down where the rubber
> meets the road, that's what we've got now: a program that
> takes what we give it and swallows it whole without giving
> anyone a clue what we meant by "born".

Yes, the null locale is quite like what we have at the moment (though
not precisely: more on that later). But the suggestion is that a
decent program should implement a variety of different locales and
users would use them in the vast majority of cases. The null locale
is just to cope with awkward corner cases. Imagine a modern-day
American family. All those people will be in the American locale
because the first person entered (probably the researcher himself, if
it's his own family tree) will have had the locale explicitly set, and
everyone else entered will have inherited that locale by default.

Now imagine the research discovers that an ancestor emigrated from
Russia. The user will probably set the Russina locale for the
emigrant's parents who (in our example) stayed in Russia. As Russian
grandparents, aunts, uncles, cousins, and so on, get added, they'll
automatically be tagged with the Russian locale because they inherit
the value from the emigrants parents who had it explicitly set.

In this example, the null locale would be used on the immigrant
himself. Some records pertaining to him will be Russian records, and
others will be American records, and it's in situations like this that
we don't want our program to choose a default that may be wrong. For
example, throughout the nineteenth century, America used the Gregorian
calendar and Russia the Julian one. If we add a marriage on 19 Sept
1823 for our immigrant, should the program treat that as Julian or
Gregorian? Best in this case that it does neither and requires the
user to state it explicitly. But for anyone else, if they're in the
Russian branch of the family, a Julian default would be assumed, and
if they're in the American branch, the default would be Gregorian.

I said that the null locale was not quite like the current state, and
that's because I don't know of a single program that will require the
user to explicitly state the calendar for a date. All the programs I
know will either default to Gregorian, or will apply a single switch-
over date (typically 14 Sept 1752, but sometimes configurable). In
doing so, it is implicitly applying a locale, frequently an American
one. If you allow an inheritable, per-person locale, the use of the
null locale will be very rare, and there's no reason to apply any
implicit assumptions in those cases.

I've used dates throughout this post as examples of things with locale-
specific properties, but I could equally have chosen names (Russian
names typically used patronymics) or addresses (Russia wasn't divided
into states).

Richard

Ian Goddard

unread,

May 8, 2012, 10:32:46 AM5/8/12

to

Richard Smith wrote:
> On May 1, 9:10 pm, Ian Goddard<godda...@hotmail.co.uk> wrote:
>> Richard Smith wrote:
>>> To summarise my point, an ER specification is entirely compatible with
>>> an OO implementation and an OO exchange format. In the specific case
>>> of names, I think their specification isn't ideal, but only in its
>>> details. The ER specification may be less familiar to those used to
>>> OO paradigm, but I'm not sure it's inappropriate here.
>>
>> I'm long enough in the tooth to have used both. However, AICR Gentech
>> uses a number of type indicators. An out-and-out OO approach would
>> subsume those within the Class system but I do agree they give it an OO
>> flavour
>
> But that's just an implementation detail. Whether the OO
> infrastructure knows the type itself, or whether a separate field is
> needed to specify the type, the important thing is that something
> knows the type.
>

Knowing the type is only part of the battle. Something needs to know
the structure of the type and that knowledge needs to be shared by each
implementation. A sub-class spec. will do that. In the case of
Gentech, however, this is a moot point because, AFAICS, Gentech's
persona-name is a simple string which you can structure as you please
with punctuation; the example given in their doc. is to put a nickname
in parentheses.

>
>>> After all, I
>>> suspect a lot of people will be thinking in terms of an SQL-backed
>>> implementation, in which case the ER formalism is much more natural.
>>
>> My own inclination would be to store the objects within an application
>> either as binary blobs or maybe as XML fragments, possibly within an RDB
>> database indexed by a UUID and then use some other database, maybe SQL,
>> maybe some other text search engine, for indexing.
>
> That's one way of doing it, but if it's no good if you want the
> database to be able to access the bits of the object. For example,
> storing '<givenname>Joe</givenname> <surname>Bloggs</surname>' is all
> well and good, right up to the point where you want to use the
> database to fetch the 27th page of 20 people with surname 'Bloggs'.
> I'm not saying it's wrong, but it's a different way of doing it, and
> punts much stuff that could be done in the database to the code. That
> may be a good thing or a bad thing.

It's not difficult to extract out the bits you want to index. One of
the reasons for thinking along these lines was that the object store
might be a local cache of part of a genealogical commons. In fact, it
need not even be locally cached.

Ian Goddard

unread,

May 8, 2012, 11:32:57 AM5/8/12

to

Richard Smith wrote:
> My mental view of genealogy is that conceptually I'm researching a
> family history narrative that would, if I ever wrote and published it,
> in term become a secondary source for someone else. In that context,
> there's no fundamental difference between the person or personae who
> appear in my sources and in my conclusions. They're both limited
> representations of certain events in life of the real person

I don't think you need a multi-level persona model to accomplish this.
The evidence part of the model should provide a provenance. Where
you've consulted the actual registers that provenance will be the
registers. Where you've consulted a published transcript of the
registers or an electronic copy of a published transcript (e.g.
http://archive.org/download/registerofparish30thor/registerofparish30thor_bw.pdf)
the provenance will be the published transcript, its electronic version
or whatever. Each record will yield its own set of personae. There may
be duplicates, for instance you may initially get the published
transcript and then consult the original.

In the example you gave you'd have your own extracts from the registers
and the provenance would lead to the registers themselves and thence to
the archive which holds them. You'd also have the book author's version
of the same sources some of which might duplicate your direct evidence
but also with the case where the author has something you don't have and
for these you'll have a provenance leading to the book and its publisher
or the library where you found it. So you now have duplicate records of
some events and personae which contain the same information but with
different provenances.

The question then arises as to whether to add the author's conclusions
as opposed to his evidence. Although the ability to undo "merges" is a
valuable utilitarian feature of a separation it's not, in may view, the
primary reason for making that separation. Half a working lifetime in
scientific investigation has dinned into me the view that maintaining a
mental and formal separation between evidence and conclusion is best
practice for any sort of investigation and it follows that such a
separation should be a feature of any data model we devise to support
such investigations.

It follows that personally I wouldn't incorporate the author's
conclusions as part of my evidence without very good cause. (There may
be good causes. For instance the author may have better palaeographic
or linguistic skills.) If I were to include one of the author's
conclusions then I'd quote as much of his text as necessary, with, of
course, the provenance and the analysis of this would sit alongside
those of the rest of the records.

If you and I were separately working on the same material we would, of
course, each construct our own set of conclusion objects and links (in
an ideal world, of course, the evidence, including the provenance would
be part of a genealogical commons) which we might then publish. If the
original author had had the same set of tools available he might have
done the same. Assuming he hasn't and given that he's no longer around
to make good on the omission, we could, of course, transcribe his
reasoning and conclusion into electronic form which could then be
published alongside our own versions. And maybe the data model needs
some component which can cross-link what purport to be reconstructions
of the same historic individuals.

Tom Wetmore

unread,

May 8, 2012, 3:44:35 PM5/8/12

to

I’m sorry I missed this thread during its formative days. (You might not be, as you would have had to listen to me.)

I’ve been convinced that the evidence-based person record (the “persona”) is the key to any advance to be made in genealogical data models. There are others topics that deserve attention, but without the persona there is no point in even pondering a new data model. The Better GEDCOM effort argued the persona issue for over a year, and could not decide the matter. Maybe in consequence that effort is effectively over. If the FHISO ever rises out of the ashes its acid test will be how it comes to grips with this concept.

I’ve been promoting the persona concept for 15 years. I’ve written reams about it in many fora. What I find so wonderful about this thread is that Richard and Ian have proven to be far more eloquent than I have ever been in describing and arguing the concept.

I also believe very strongly in multi-level persona (as versus the pure two level evidence persona and conclusion person). Richard presented a great argument for the idea.

I wrote reasonably sophisticated automatic combining algorithms that take billions of persona records (extracted from the world wide web by natural language processing) and combine them (by linking, not destructive merging) into 100s of thousands of person records (see the zoominfo.com website for the results). The complexity of the ZoomInfo combination algorithms required a multi-layered view of personas for purely organizational reasons, as combination, to avoid horrendous order-N algorithmic issues, had to be broken down into many phases. But those reasons carry over into the genealogical application just as importantly. It is best to view intermediate personas as intermediate conclusions, all building up to the final linking at the top. This way a person with ten or more personas can be built up by a number of different decisions, and the overall structure of that decision making “tree” is locked into the structure of the persona tree. It is much more natural for a genealogist to decide that two or three personas out of, say ten available, are the same person, before they reach a later conclusion that other personas of the ten also refer to the same person. Those later personas are very naturally linked in on a new level that describes why they are the same person as the one represented by a tree of personas already linked.

Ian Goddard

unread,

May 9, 2012, 10:07:13 AM5/9/12

to

Tom Wetmore wrote:
> I’m sorry I missed this thread during its formative days. (You might not be, as you would have had to listen to me.)
>
> I’ve been convinced that the evidence-based person record (the “persona”) is the key to any advance to be made in genealogical data models. There are others topics that deserve attention, but without the persona there is no point in even pondering a new data model. The Better GEDCOM effort argued the persona issue for over a year, and could not decide the matter. Maybe in consequence that effort is effectively over. If the FHISO ever rises out of the ashes its acid test will be how it comes to grips with this concept.
>
> I’ve been promoting the persona concept for 15 years. I’ve written reams about it in many fora. What I find so wonderful about this thread is that Richard and Ian have proven to be far more eloquent than I have ever been in describing and arguing the concept.

Thank you, kind sir.

> I also believe very strongly in multi-level persona (as versus the pure two level evidence persona and conclusion person). Richard presented a great argument for the idea.
>

> I wrote reasonably sophisticated automatic combining algorithms that take billions of%><

(overlong line truncated)

What you describe sounds very much like an agglomerative clustering
algorithm in which you cluster sub-sets of the original data in order to
tame the combinatorial explosion. It's also different from the
multi-level arrangement Richard described.

However:

After some pass you have intermediate clusters J, comprising Personae a,
b & c and K comprising Personae d & e.

After the next pass the algorithm joins J & K to form a new cluster R.

Is there any merit in saying that R comprises J & K which respectively
comprise a, b & c and d & e as opposed to disposing of J & K and saying
that R comprises a, b, c, d & e? Or alternatively, of detaching c & d
from cluster K, attaching them to cluster J, thus avoiding creating a
new cluster and then discarding the now empty K? After all the separate
J & K clusters were simply a computational convenience - if d & e had
happened to fall into the same subset as a, b & c when the data was
originally partitioned there'd never have been a separate cluster K.
ISTM that after the last pass the remaining clusters are each candidate
Person objects but any intermediate clusters which contributed to them
were temporary objects of no further significance.

Ian Goddard

unread,

May 9, 2012, 12:45:52 PM5/9/12

to

Both Richard & Tom have advocated using multi-level personae in which
the conclusion person object from one piece of work gets recycled as
another. The amount of junk recycled through IGI as member submissions
by this means should be warning enough for anyone. However there's a
really splendid example which crops up in IGI but is much older -
possibly a few centuries older. I'm afraid I'll have to set the scene
at some length based on the limited contemporary documentation.

Sir John Godard married Constance, widow of Sir Peter de Mauley sixth
but born Sutton, some time prior to 12 Dec 1384 when he was pardoned for
marrying without the King's licence as recorded in a Patent Roll. Sir
John is rather difficult to find in any records prior to this but his
marriage seems to have launched him on a public career recorded in the
Fine Rolls & Patent Rolls including becoming the High Sheriff of Yorks
some four years later and also serving as an MP. The last mention I can
find of him in these rolls was on 01 Mar 1392 (newstyle dating). A
secondary (warning!) source (a list of burials in Dominican priories in
an old copy of the Antiquary) gives the date of his Will as 25 April
1392 and of its proving as 13 Mar 1392 (presumaby old-style).

According to her IPM (Inquisition Post Mortem) his wife died 09 Jun
1401. The IPM names a son, John Godard, as Constance's heir.
Contemporary records, therefore, make it quite clear that he was still
married to Constance at the time of his death. Constance herself is
described as "wife of Peter Malu Lacu sixth" (a synonym for de Maulay)
in her IPM, this being the name of her most prestigious husband.

Children of the marriage are not entirely clear.

- The heir, John, must have been a son of the marriage. (John died
young leaving a son, a further John, seemingly last of the male line,
who died aged 13 according to his IPM).

- Constance had a daughter Margaret who is named de Mauley in
contemporary sources but she and her descendants seem to have inherited
no de Maulay property. So taking into account the description of
Constance above it seems likely on balance that Margaret was also a
daughter of the marriage.

- Sir John had a further son, Henry, who seems to have died without
issue. Some secondary sources state that he also was a son of the
marriage but I can't find evidence one way or the other.

- Sir John also had a daughter Agnes, for our purposes his only
important child. The existence of Agnes is not in dispute. She was not
heir to any Sutton property AFAICT but according to the IPM of the
youngest John she was heir to Godard property. I think the lack of
Sutton inheritance rules her out as the daughter of this marriage; as
far as can be determined Sir John was in his 40s when he married
Constance so a previous marriage is entirely feasible. Certainly she
cannot have been the daughter of a later marriage as Constance outlived
Sir John so if she wasn't a child of the marriage she must have been
from an earlier unrecorded marriage.

It's also not in dispute that Agnes Godard married Sir Brian Stapleton.
And this is where the fun starts.

There are numerous claims by Stapleton descendants that they descend
from the Neville family by Maud (Matilda) Neville who married Sir John
and who was the mother of Agnes. We have seen that if Agnes wasn't the
daughter of Constance she must have been the daughter of an earlier
marriage. But Maud outlived both Sir John & Constance by many years so
couldn't have been the earlier wife.

It's not too difficult to see how this misconception occurred. Maud was
also the widow of a Peter de Maulay. The genealogist's syllogism is at
work: Sir John married the widow of a Peter de Maulay, Maud was the
widow of a Peter de Maulay therefore Sir John married Maud.

So we now have an erroneous conclusion - that Agnes was daughter of Maud
Neville - which became the "evidence" for all sorts of later nonsense.

One piece of nonsense is that a Neville pedigree in a heraldic
visitation bears against Maud a note that it was she who married Sir
John & was mother of Agnes. That was probably "evidence" for many.
However the visitation was roughly a couple of centuries later than Sir
John's later life and survives only in a copy which must be at least
some decades later still. What's more the editor of the published
version of this copy comments that this annotation was a later addition.
Although the note must have been at least about three centuries after
Sir John's time it could still be as old as C17th. It's worth noting
that there are two Stapleton pedigrees in the same visitation which name
Agnes but don't give her mother, just Sir John as father.

It seems that some people noticed the contradictions and attempted to
resolve them whilst keeping the prestigious Neville connection. This
attempt concluded that there must have been a "filius Godard" to replace
or supplement Sir John. Some variations of this are in IGI but it seems
that the original tampering must have been no later than the C19th.

As we know "filius" is often abbreviated and it seems likely that this
abbreviation was the "evidence" that gave rise to the most bizarre twist
of all by treating the abbreviation as an initial. In 1886 the editor
of Vol I of the Surtees Society's Testamenta Eboricensia, who certainly
should have known better, added a footnote to Maud's Peter's Will. This
says that his widow, by her second husband Sir Francis Godard, left a
daughter Anne, the wife of Sir Brian Stapleton. This precocious
daughter would have been born, married, had several children and herself
widowed, AFAICR, only a couple of years of so after her alleged mother.
The only surprising thing about this is that the fictitious Sir
Francis seems to have avoided appearing in the IGI.

And that, dear readers, is what happens when you try to recycle someone
else's conclusions as evidence.

Tony Proctor

unread,

May 9, 2012, 12:58:44 PM5/9/12

to

"Ian Goddard" <godd...@hotmail.co.uk> wrote in message
news:a0vl9v...@mid.individual.net...

> Both Richard & Tom have advocated using multi-level personae in which the
> conclusion person object from one piece of work gets recycled as another.
> The amount of junk recycled through IGI as member submissions by this
> means should be warning enough for anyone. However there's a really
> splendid example which crops up in IGI but is much older - possibly a few
> centuries older. I'm afraid I'll have to set the scene at some length
> based on the limited contemporary documentation.
>

... SNIP...

>
> And that, dear readers, is what happens when you try to recycle someone
> else's conclusions as evidence.
>
> --
> Ian
>
> The Hotmail address is my spam-bin. Real mail address is iang
> at austonley org uk

Wow! Thanks for that Ian. Is there a book and a film following this ;-)

Tony Proctor

Tom Wetmore

unread,

May 9, 2012, 1:32:37 PM5/9/12

to

>What you describe sounds very much like an agglomerative clustering
>algorithm in which you cluster sub-sets of the original data in order to
>tame the combinatorial explosion. It's also different from the
>multi-level arrangement Richard described.

In the algos I wrote the N was so very large that a main reason for the
phasing was to handle the combinatorial explosion. There were functional
reasons for the phasing also, that is, each concentrated
on some specific properties of the personas.

However:
>After some pass you have intermediate clusters J, comprising Personae a,
>b & c and K comprising Personae d & e. After the next pass the algorithm
>joins J & K to form a new cluster R.

>Is there any merit in saying that R comprises J & K which respectively
>comprise a, b & c and d & e as opposed to disposing of J & K and saying
>that R comprises a, b, c, d & e?

I believe it is important to keep J & K. This is because J & K each
represent a conclusion that the researcher has made, and by removing J & K
you loose the history of that conclusion.

In the ZoomInfo application, I first did as you suggest, and removed the
J & K layer. It's okay for the Zoom application since no one cares about
the combination history. However, for ME THE DEVELOPER, it became very
important when debugging and tuning the phases to be able to follow the
exact sequences that brought personas together. For example if the algo
merged some businessman named George Bush with the president Bushes,
it was important that I be able to isolate the phase and reason quickly.
I developed a whole suite of software tools just to make this kind of
discovery simple. The analog to genealogy is not clear cut but there is
a connection.

>Or alternatively, of detaching c & d
>from cluster K, attaching them to cluster J, thus avoiding creating a
>new cluster and then discarding the now empty K? After all the separate
>J & K clusters were simply a computational convenience - if d & e had
>happened to fall into the same subset as a, b & c when the data was
>originally partitioned there'd never have been a separate cluster K.
>ISTM that after the last pass the remaining clusters are each candidate
>Person objects but any intermediate clusters which contributed to them
>were temporary objects of no further significance.

I have the same argument against this -- you don't want to loose the
history of your decision making.

AND, AND, AND, by having the history your ability to UNDO incorrect
conclusions is greatly enhanced.

Tom Wetmore

Tom Wetmore

unread,

May 9, 2012, 1:54:50 PM5/9/12

to

Ian,

I don't consider your long example as having any major bearing on
the two-level versus n-level personas argument. It's a nice story
though. Wholly orthogonal in my opinion.

There are no rules in my mind as to where or when multi-levels should
be used. I view the multi-levels as simply the best way that a current
researcher can structure their data in a non-destructive way, and in
such a way that allows them to record their data and the history of their
conclusions about personas at any level and of any type that they
have gleaned from any source.

I look at it as a one size fits all solution.

You would want to nearly always use a two-level approach where the
bottom level is personas extracted directly from evidece records,
and the top level is conclusion persons representing your final
beliefs about the real human beings that lived in the past.

But there are so many special and funny cases that come up, that I
find having the multi-level mechanism available for those cases
is just too handy to ignore.

Tom Wetmore

Bob LeChevalier

unread,

May 10, 2012, 6:51:04 AM5/10/12

to

Ian Goddard <godd...@hotmail.co.uk> wrote:
>In the example you gave you'd have your own extracts from the registers
>and the provenance would lead to the registers themselves and thence to
>the archive which holds them. You'd also have the book author's version
>of the same sources some of which might duplicate your direct evidence
>but also with the case where the author has something you don't have and
>for these you'll have a provenance leading to the book and its publisher
>or the library where you found it. So you now have duplicate records of
>some events and personae which contain the same information but with
>different provenances.
>
>The question then arises as to whether to add the author's conclusions
>as opposed to his evidence.

Is there really a difference?

>Although the ability to undo "merges" is a
>valuable utilitarian feature of a separation it's not, in may view, the
>primary reason for making that separation. Half a working lifetime in
>scientific investigation has dinned into me the view that maintaining a
>mental and formal separation between evidence and conclusion is best
>practice for any sort of investigation and it follows that such a
>separation should be a feature of any data model we devise to support
>such investigations.

The problem is that there isn't really any legitimate distinction
between genealogical evidence and conclusions, from the perspective of
scientific investigation. It is all conclusion, unless we have
personal time travel and the ability to go back and personally witness
the events in question.

One can arbitrarily say that a primary source is "evidence", and then
everything else is some level of "conclusion" derived from some
primary or secondary source (which may or may not be reported by the
concluder). In that case, neither the author's reported evidence nor
conclusions above is a primary source, and hence both are
"conclusion".

But of course all the primary sources are conclusions, as well. A
birth certificate probably reports the mother correctly (but the
recorder still had to conclude that the mother was giving her correct
identity), but he is taking the mother's word for who the father is,
since DNA testing isn't part of the birth certificate process.

lojbab
---
Bob LeChevalier - artificial linguist; genealogist
loj...@lojban.org Lojban language www.lojban.org

Ian Goddard

unread,

May 11, 2012, 6:06:33 AM5/11/12

to

Tony Proctor wrote:
> Wow! Thanks for that Ian. Is there a book and a film following this ;-)

The film may already be out. Sir John seems to have been a typical
knight of his time & involved in any major punch-up from the mid 1360s
onwards. So look carefully in the battle-scenes of any film set at that
time. He may be there....

Ian Goddard

unread,

May 11, 2012, 6:21:06 AM5/11/12

to

Bob LeChevalier wrote:
> The problem is that there isn't really any legitimate distinction
> between genealogical evidence and conclusions, from the perspective of
> scientific investigation. It is all conclusion, unless we have
> personal time travel and the ability to go back and personally witness
> the events in question.

By the same criterion you would have to write off the whole of
palaeontology and palaeoecology as a sciences.

Tim Powys-Lybbe

unread,

May 11, 2012, 6:59:52 AM5/11/12

to

On 11 May at 11:21, Ian Goddard <godd...@hotmail.co.uk> wrote:

> Bob LeChevalier wrote:
> > The problem is that there isn't really any legitimate distinction
> > between genealogical evidence and conclusions, from the perspective
> > of scientific investigation. It is all conclusion, unless we have
> > personal time travel and the ability to go back and personally
> > witness the events in question.
>
> By the same criterion you would have to write off the whole of
> palaeontology and palaeoecology as a sciences.

Quite. And Evolution. And any consideration whether or not light is
instantaneous. And, in fact almost every scientific theory propounded
in the past.

--
Tim Powys-Lybbe t...@powys.org
for a miscellany of bygones: http://powys.org/

Bob LeChevalier

unread,

May 11, 2012, 4:36:21 PM5/11/12

to

Ian Goddard <godd...@hotmail.co.uk> wrote:
>Bob LeChevalier wrote:
>> The problem is that there isn't really any legitimate distinction
>> between genealogical evidence and conclusions, from the perspective of
>> scientific investigation. It is all conclusion, unless we have
>> personal time travel and the ability to go back and personally witness
>> the events in question.
>
>By the same criterion you would have to write off the whole of
>palaeontology and palaeoecology as a sciences.

Not quite, since we have radioactive dating, etc. (But of course
radioactive dates are conclusions, too, but of a different branch of
science). But, yes, it is more challenging to treat those as hard
sciences. My original field was astrophysics, and since we'll never
visit and measure the interior of a star, much of that is also
"conclusions" serving as "evidence".

The answer is of course to redefine what constitutes evidence, with
most "evidence" being based on conclusions from earlier research. But
I think the concept of evidence becomes something other than what I
see used in genealogy, being more or less "anything derived
independently from, and corroborating, a hypothesis". The assumption
being that independence of observation or conclusion is what turns
"conclusions" into "objective evidence". Three different forms of
radioactive dating giving the same result makes those "conclusion"
dates into "evidence".

But all this means that the evidence/conclusion distinction is really
rather arbitrary, except with reference to a specific hypothesis (and
hence hardly generalizable across a data structure used for all
hypotheses). Which is what I was trying to say.

I don't have any problem with considering the information from one
source as "evidence" and then creating personas as "conclusions" from
combining that evidence. But the personas themselves can be used as
"evidence" for later conclusions.

Tom Wetmore

unread,

May 11, 2012, 4:59:03 PM5/11/12

to

An interesting discussion, I suppose, getting far-fetched.

The context of this discussion IMHO was the two level (evidence persona and conclusion person) versus the multi level (personas all the way up) issue.

Richard and I think it's a great idea. Ian and Tim seem to think it could destroy science as we know it, or maybe more fairly, would have destroyed science had Bacon come up with it.

Multi-level personas are simply a way to give a little internal structure to the conclusion making process. It would be this no matter what the true nature of evidence was. And the nature of evidence in the genealogical world is so wholly screwed-up anyway, and so completely unfixable, that any argument on doing things based on the way evidence SHOULD be, is meaningless.

This is an issue of pragmatics, nothing more. What is the best way to represent the information we want to record about persons, and what is the best way to represent the conclusions we make when reasoning about that information? Personas all the way up is a complete solution to both.

Tom Wetmore

Ian Goddard

unread,

May 13, 2012, 5:50:45 AM5/13/12

to

I think there is a big difference in science. To a large extent
conclusions in science become theories. Theories are relied on because
they not only give the simplest explanation for the original evidence on
which they were built but also because they have been used to make
predictions which can then be tested and that such tests have succeeded.
They also remain consistent with ongoing observations. If some new,
verified observation comes along which contradicts the prediction of a
theory then the theory falls and is replaced. Remember the hoo-hah a
few months ago about the possibility of faster then light neutrinos.
And, of course, this was a by-product of the search for the Higgs boson
at the LHC which is a massive program to test a prediction.

The key thing about conclusions in science as theories is that they are,
in essence, temporary.

You are correct in that in some sciences, particularly palaeo sciences,
conclusions might be described as anecdotal in that a typical piece of
work will aim at describing something local. And it's certainly true
that in such fields there's scope for research that looks at large
bodies of such anecdotal research to pull out common trends. But
although such research may be prompted by the anecdotal conclusions it
should go back and look at the original evidence and arrive at
independent conclusions.

Richard Smith

unread,

May 16, 2012, 3:37:04 PM5/16/12

to

On May 9, 6:54 pm, Tom Wetmore <t...@verizon.net> wrote:

> You would want to nearly always use a two-level approach where the
> bottom level is personas extracted directly from evidece records,
> and the top level is conclusion persons representing your final
> beliefs about the real human beings that lived in the past.
>
> But there are so many special and funny cases that come up, that I
> find having the multi-level mechanism available for those cases
> is just too handy to ignore.

And indeed Ian has given some good examples of such cases elsewhere in
this thread: "For instance the author may have better palaeographic or
linguistic skills." I've certainly relied on expert secondary sources
to interpret the exact meaning of an obscure term or phrase,
especially if it's in Latin.

But another of Ian's examples is also good. Ian refers to the large
quantities of dross in member contributions of the IGI. However much
we'd like it to be otherwise, some people are going to produce and
publish nonsense, and others will recycle that in their own so-called
research. So wouldn't we prefer it if we have clear way of tracking
where these conclusions have come from? As we've seen above, there
are genuine reasons for recycling some contributions

Richard