Considerations for a better Import/Export Format

44 views
Skip to first unread message

Tony Proctor

unread,
Sep 28, 2011, 2:37:42 PM9/28/11
to
Some thoughts on a better textual import/export format for genealogical use.

I know I'm setting myself up as a target for trolls here but I promised
Cheryl I would have a go. Please read it as a genuine attempt to help. All
constructive suggestions welcome of course.

References is made to XML purely for illustration. Although it would be a
good candidate, it is not the only possible one and the recommendations
should be as generic as possible.


Goals
Define a universal import/export format
Flexibility. Store virtually anything without having to bend any rules
Locale independence
Potential use as a definitive backup-format or a load-format for databases
Zero-loss when operating between different software units


Locale-independence
The character set should be global which nowadays means UTF-8. This is also
the default with XML. Although the header could explicitly provide a
non-default character set name (again, similar to XML) I think that would
over-complicate the processing and would put an onus for all possible
translations on the receiving software unit.

Another possibility is to use Unicode "escape sequences". There are zillions
of these, e.g. HTML uses a format like "€" whilst Java uses "\u20AC".
The problem with these is having to reserve a magic escape character for
their use ('&' and '\' in these 2 cases). See
http://billposer.org/Software/ListOfRepresentations.html for a good summary.

Data values should be in a locale-neutral format, as with the source code
for programming languages. For this reason, it is sometimes called using a
'programming locale'. This effectively means using a period in all decimal
numbers (not a comma), ISO 8601 format for (Gregorian-)dates (e.g.
yyyy-mm-dd), and unlocalised true/false or just 1/0 for booleans (e.g. for
option selections).

Time Zones (TZ) and Daylight Saving Time (DST) were discussed in this
thread. Although usually applicable to local clock times, they can also
apply to local calendar dates. The importance for genealogy is going to be
slim at best but the area should be clarified. ISO 8601 does not include TZ
designators, with one exception: a 'Z' suffix indicates a UTC (Coordinated
Universal Time) as opposed the default of a 'local date/time'. Local
date/times should be interpreted in the context of the data location rather
than the current location of the user but this would only be significant
when creating a timeline across TZ boundaries.


Main Elements
The definition of a person (or place) should consist of a set of discrete
elements representing

<Events> - Something happening on a particular date (may be approx - see
below). Predefined ones must include BMD events but it should be open-ended
<Notes> - General narrative notes
<Extension> - Extensions to the set of elements, e.g. PlaceOfEducation.
These names should be interpreted only within the particular "namespace"
associated with the current dataset, thus preventing clashes with datasets
from other sources, or with any newer elements names appearing in a future
revision of the schema.
Lineage - see below

All of these elements can contain narrative text, and the narrative text can
have references to people, places, dates, events, references, resources,
etc, embedded within them. This would be a powerful feature allowing a
viewing tool to provide hyperlinks to the associated material.

Each element should have a key associated with it (i.e. a simple name that
is local to the dataset) by which it can be referenced from other elements,
e.g. <Person Name=Tony123>.

If references need to be made across datasets (e.g. when comparing them)
then local names should be decorated with the dataset name to make them
unique, e.g. TonysTree:Tony123.


Lineage
Formats like XML provide an automatic way of depicting a top-down
hierarchical relationship. Unfortunately, genealogical lineage is really a
'network' rather than a pure 'hierarchy'. In effect, a simple nesting of
"offspring" under their associated "parents" is insufficient.

There's also a problem with a top-down approach unless a specific union
between two people has a single representation in the data, but that then
causes further problems with the nature and the lifetime of that union. For
instance, if the father and mother have separate representations in the
data, and they each have links to their associated common offspring, then it
makes it difficult to bring the information together to identity family
units, and also to ensure that there exists two links to each offspring.

I believe it's easier to use a bottom-up representation. Each person has
just one progenitive father and one progenitive mother and so can have
upward links to their appropriate parents (where known). For instance,
<Father> and <Mother> elements. This also makes it easy to have other types
of parentage including <Guardian>, <FosterMother>, <AdoptedMother>, etc.
Key


References
A 'reference' would be a reference to some item of information in an
external catalogue that's not directly accessible from your software. For
example, a BMD reference or a TNA reference. Computers use an extensible
mechanism for such varied names called a URN or Uniform Resource Name. This
is a subset of a URI (Uniform Resource Identifier), similar to the
more-familiar URL and so has the same structure.


Resources
When referring to resources that may be part of your data collection, or to
resources which may be accessed over the Internet, a URL should be employed.
URLs and URN are both subsets of URIs and have similar formats. The 'scheme'
prefix makes them applicable to different stores, protocols and access
methods, e.g. file:// for local files and http:// which we all know about.


Attributes
All element types should accept certain attributes to modify their
interpretation. Suggestions might be:

Sensitivity: Public, Family, Private, Very Sensitive. Default=Public
Surety: Some percentage of how certain the data is. Default=100%
Source: Identity the source of a piece of data

For instance, <Note Sensitivity=Private Surety=20%>...some sensitive note
that I'm not sure about...</Note>


Names
As mentioned already in this thread, names around the world are not used in
the same way. As well as alternative spellings, nicknames, spellings in
alternative languages, and optional parts, the very structure may be
variable leaving the name with little uniqueness and no obvious
interpretation for our forename/middlename/surname concepts.

One possibility is to offer a prioritised set of patters to match. There are
lots of 'pattern definition' languages around but I'll present a very simple
one that can be used for illustration. The stored format doesn't have to use
this syntax itself but it's very convenient when discussing the pattern and
showing written examples.

Let a 'full name' be defined by a list of possible 'sequences'. These would
be in priority order and indicate which should be tested first. Each
'sequence' would be an ordered set from the following:

name - simple name element, e.g. Tony
{name, ...} - 1-or-many alternatives
[name, ...] - 0-or-many alternatives

The following example might belong to someone called Grace Ann Murphy who
doesn't always use her middle name and sometimes goes as Gracie. However,
she's Irish and also has an Irish version of her name. This would require
two 'sequences':

{Grace,Gracie} [Ann] Murphy
Gr�inne [Ann] "N� Murch�"

An interesting issue here concerns the variations of individual name parts.
In this example, Grace accepts "Gracie" as an informal version of her
forename. However, the difference between Ann and Anne is more of a spelling
error, either during recording or a subsequent lookup. I think this should
be handled by the software unit, just as a soundex might. The same could
apply to using a middle initial but that is a very Western convention.


Dates
The general representation of dates is mentioned above. When a date is
referenced in an element, it should have a margin of error associated with
it. This could be a +/- representation such as a day, a month, or a year,
etc., or a more explicit min/max representation.

When deterministic dates, such as our normal Gregorian ones, are loaded into
some type of indexing system like a database, it is expected that they can
all be stored as a pair of internal 'timestamp' values. Since these would be
a binary long-integer representations it would mean issues of date format,
uncertainty, TZ, etc., all become irrelevant and they can be handled
efficiently in the same manner.

Dates expressed in different Calendars are a bit more challenging and I'll
skip that for now :-)


Tony Proctor


Tony Proctor

unread,
Sep 28, 2011, 2:40:47 PM9/28/11
to

"Tony Proctor" <tony@proctor_NoMore_SPAM.net> wrote in message
news:j5vp67$t2b$1...@reader01.news.esat.net...
> Gráinne [Ann] "Ní Murchú"

>
> An interesting issue here concerns the variations of individual name
> parts. In this example, Grace accepts "Gracie" as an informal version of
> her forename. However, the difference between Ann and Anne is more of a
> spelling error, either during recording or a subsequent lookup. I think
> this should be handled by the software unit, just as a soundex might. The
> same could apply to using a middle initial but that is a very Western
> convention.
>
>
> Dates
> The general representation of dates is mentioned above. When a date is
> referenced in an element, it should have a margin of error associated with
> it. This could be a +/- representation such as a day, a month, or a year,
> etc., or a more explicit min/max representation.
>
> When deterministic dates, such as our normal Gregorian ones, are loaded
> into some type of indexing system like a database, it is expected that
> they can all be stored as a pair of internal 'timestamp' values. Since
> these would be a binary long-integer representations it would mean issues
> of date format, uncertainty, TZ, etc., all become irrelevant and they can
> be handled efficiently in the same manner.
>
> Dates expressed in different Calendars are a bit more challenging and I'll
> skip that for now :-)
>
>
> Tony Proctor
>

Sorry, the existing thread referred to here is "WHAT do you want to GED
in/out?" posted by singals. I decided to start a new thread for clarity and
forgot to change the reference

Tony Proctor


Ian Goddard

unread,
Sep 30, 2011, 4:58:43 AM9/30/11
to
Tony Proctor wrote:

> Lineage
> Formats like XML provide an automatic way of depicting a top-down
> hierarchical relationship. Unfortunately, genealogical lineage is really a
> 'network' rather than a pure 'hierarchy'. In effect, a simple nesting of
> "offspring" under their associated "parents" is insufficient.
>
> There's also a problem with a top-down approach unless a specific union
> between two people has a single representation in the data, but that then
> causes further problems with the nature and the lifetime of that union. For
> instance, if the father and mother have separate representations in the
> data, and they each have links to their associated common offspring, then it
> makes it difficult to bring the information together to identity family
> units, and also to ensure that there exists two links to each offspring.
>
> I believe it's easier to use a bottom-up representation. Each person has
> just one progenitive father and one progenitive mother and so can have
> upward links to their appropriate parents (where known). For instance,
> <Father> and<Mother> elements. This also makes it easy to have other types
> of parentage including<Guardian>,<FosterMother>,<AdoptedMother>, etc.
> Key

I agree with you. Trees provide a ready-made structure which is
probably very attractive to anyone setting out to write genealogical
S/W. But it's a siren's song. Any substantial database structured that
way is going to have duplicated sub-trees when pedigree collapse is
encountered - or even worse, sub-trees that ought to be duplicate &
aren't quite - e.g. one copy gets Fred Flintstone's date of death &
another doesn't.

If you have the bottom-up elements of people & links the tree is
implicit in the data & can be created on the fly for reporting or display.

--
Ian

The Hotmail address is my spam-bin. Real mail address is iang
at austonley org uk

Ian Goddard

unread,
Sep 30, 2011, 5:47:52 AM9/30/11
to
Tony Proctor wrote:
> Goals
> Define a universal import/export format
> Flexibility. Store virtually anything without having to bend any rules
> Locale independence
> Potential use as a definitive backup-format or a load-format for databases
> Zero-loss when operating between different software units

From what I've written elsewhere it should be no surprise that I want
to add another. It should distinguish between evidence and conclusions
drawn from that evidence.

This distinction seems very obvious and of prime importance to me but
seems to pass others by so I guess I'll have to have another try:

Evidence is real. If you, I, Cheryl, smart ol' Bob & anyone else views
the same document we all have the same thing in front of us.

Conclusions are mental constructs. Having seen the same document we may
come to different conclusions about its meaning. If the writing's in a
difficult hand we might not even be able to agree about the text. It
would be great if we all agreed although, of course, that may mean
simply that we're all wrong. However the structure of our data
shouldn't impede us in being able to share the same piece of evidence
and and record different conclusions; in fact we should even be able to
record the fact that as individuals we can seen more than one
interpretation and aren't able to distinguish which, if any, is correct.
One of the goals should be to enable us to enable us to do this. As a
sub-goal I think there should also be a means of expressing how we
derived out conclusion from the evidence and our confidence in it and
this should be part of the overall structure and not stuck away in a
note somewhere.

A few other comments:
IMV universal implies that any unique IDs we give to some part of the
data should be universally unique.

Flexibility does have some impact on locale independence. If we're to
be flexible there's going to be a need for adding ad hoc pieces of
information which seems to demand the need to invent types of data
items, albeit in some carefully controlled manner.

If something is part of the standard structure then its meaning is
understood even if the structure is defined in a language which we don't
speak, irrespective of whether that's embedded in the data as with XML
tags or in some external document. An application using this would be
able to present the user with a localised caption.

If, however, the data item is ad hoc then the format is going to have to
make provision for ad hoc labels & unless the application has a set of
translation dictionaries it's just going to have to caption the data as
found. Even in the English-speaking world we're going to have to put up
with some records mentioning, to take one of Cheryl's examples, eye
color & others mentioning eye colour.

Tony Proctor

unread,
Sep 30, 2011, 6:02:06 AM9/30/11
to

"Ian Goddard" <godd...@hotmail.co.uk> wrote in message
news:9elhi9...@mid.individual.net...
The fact versus inference is an excellent point Ian. I agree totally.

Regarding localisation of extended elements, I think any good software unit
would provide configuration options to define appropriate labels, and any
dataset that makes use of them should probably provide a default one. I'm
not sure if you're implying the person/company defining the extensions
should define the labels too or the person using the software unit should
configure it appropriately. As long as that information is separate from the
main dataset - thus keeping it locale independent - then I would be happy.

Tony Proctor


Ian Goddard

unread,
Sep 30, 2011, 6:28:35 AM9/30/11
to
Tony Proctor wrote:
> Names
> As mentioned already in this thread, names around the world are not used in
> the same way. As well as alternative spellings, nicknames, spellings in
> alternative languages, and optional parts, the very structure may be
> variable leaving the name with little uniqueness and no obvious
> interpretation for our forename/middlename/surname concepts.
>
> One possibility is to offer a prioritised set of patters to match. There are
> lots of 'pattern definition' languages around but I'll present a very simple
> one that can be used for illustration. The stored format doesn't have to use
> this syntax itself but it's very convenient when discussing the pattern and
> showing written examples.
>
> Let a 'full name' be defined by a list of possible 'sequences'. These would
> be in priority order and indicate which should be tested first. Each
> 'sequence' would be an ordered set from the following:
>
> name - simple name element, e.g. Tony
> {name, ...} - 1-or-many alternatives
> [name, ...] - 0-or-many alternatives
>
> The following example might belong to someone called Grace Ann Murphy who
> doesn't always use her middle name and sometimes goes as Gracie. However,
> she's Irish and also has an Irish version of her name. This would require
> two 'sequences':
>
> {Grace,Gracie} [Ann] Murphy
> Gráinne [Ann] "Ní Murchú"
>
> An interesting issue here concerns the variations of individual name parts.
> In this example, Grace accepts "Gracie" as an informal version of her
> forename. However, the difference between Ann and Anne is more of a spelling
> error, either during recording or a subsequent lookup. I think this should
> be handled by the software unit, just as a soundex might. The same could
> apply to using a middle initial but that is a very Western convention.

My point about evidence and conclusions bears heavily on this.

Take, for instance a marriage in which Hannah Kaye (as spelled in the
register & indexes) married George Fawley. Examination of the register,
however, shows that she signed rather than made a mark and spelled her
surname Kay. So we have two /names/ in the document, "Hannah Kaye" and
"Hannah Kay" but clearly these both refer to the one historical
/person/. I don't, however, think that this is a trivial difference to
be smudged over by soundex. The predominant spelling hereabouts is
"Kaye". However there was one family in the community which used the
"Kay" spelling and seems to have been quite punctilious about it (it
originated in a Kay/Kaye marriage; I haven't been able to resolve the
groom's identity but wonder whether he may have adopted the alternative
spelling to gloss over a fairly close degree of cousinship). As both
this family and at least one of the Kayes had daughters called Hannah
it's important to recognise both spellings in /analysis/ of the
/evidence/ and, of course, make use of it in my /conclusion/ about this
particular ancestor.

What this means, of course, is that we require more than one "name"
entity. One is the name as found in the original and one which
identifies the historical reconstruction. The former doesn't
necessarily present us with any formal structure such as "Given name[s]
Surname". It may well be something along the lines of "John son of
Jonathan Goddard" in which only the father's name is given in the
expected formal structure. On the whole I'm in favour of expressing the
evidential name as a simple string as found and restricting the
structured form to the reconstruction. Apart from anything else this
gets round the fact that some PRs Latinised the descriptions so that the
spelling as father is systematically different from the name as a data
subject and maybe different again from that in everyday life, e.g.
"Guillielmus f Guillielmi", AKA "William". The form used in
reconstruction may be usefully extended to include some additional
epithet which doesn't necessarily have any historical use but saves to
de-duplicate for our purposes, e.g. "William Goddard IV of Upperthong".

Ian Goddard

unread,
Sep 30, 2011, 6:40:39 AM9/30/11
to
Tony Proctor wrote:
> "Ian Goddard"<godd...@hotmail.co.uk> wrote in message
> news:9elhi9...@mid.individual.net...
>> Tony Proctor wrote:
>>> Goals
>>> Define a universal import/export format
>>> Flexibility. Store virtually anything without having to bend any rules
>>> Locale independence
>>> Potential use as a definitive backup-format or a load-format for
>>> databases
>>> Zero-loss when operating between different software units
>>
>> From what I've written elsewhere it should be no surprise that I want to
>> add another. It should distinguish between evidence and conclusions drawn
>> from that evidence.
>>
>> This distinction seems very obvious and of prime importance to me but
>> seems to pass others by so I guess I'll have to have another try:
>>
>> Evidence is real. If you, I, Cheryl, smart ol' Bob& anyone else views
>> make provision for ad hoc labels& unless the application has a set of
>> translation dictionaries it's just going to have to caption the data as
>> found. Even in the English-speaking world we're going to have to put up
>> with some records mentioning, to take one of Cheryl's examples, eye color
>> & others mentioning eye colour.
>>
>> --
>> Ian
>>
>> The Hotmail address is my spam-bin. Real mail address is iang
>> at austonley org uk
>
> The fact versus inference is an excellent point Ian. I agree totally.
>
> Regarding localisation of extended elements, I think any good software unit
> would provide configuration options to define appropriate labels, and any
> dataset that makes use of them should probably provide a default one. I'm
> not sure if you're implying the person/company defining the extensions
> should define the labels too or the person using the software unit should
> configure it appropriately. As long as that information is separate from the
> main dataset - thus keeping it locale independent - then I would be happy.

Putting it in XML terms you might have something like:

<Person>
<PersonName>......</Name>
....
<OtherItems>
<Item name=.... value=..../>
<Item name=..... value=.../>
</OtherItems>
</Person>

so that the expected structured stuff all has its place and it doesn't
really matter whether the user speaks English or not, if the application
is properly localised it will be able to present a localised caption for
the contents of the PersonalName element. But the only way I can see to
provide for "virtually anything without having to bend any rules" is
some mechanism such as the OtherItems element where the originating user
would be able to define named items on the fly.

The best stab that the application would be able to make at localising
these would be to extend the Item element to include the language, have
a dictionary and still from time to time make hilarious mistranslations.

Tony Proctor

unread,
Sep 30, 2011, 7:54:23 AM9/30/11
to

"Ian Goddard" <godd...@hotmail.co.uk> wrote in message
news:9eljuk...@mid.individual.net...
I've read this a few times and it sounds like you're making a case for
storing other name-like references to a person which would not automatically
be used by a software unit during the pattern matching, i.e. something the
user would have access to and could potentially make use us. Would you agree
Ian?

Tony Proctor


Ian Goddard

unread,
Sep 30, 2011, 8:35:54 AM9/30/11
to
Tony Proctor wrote:
> I've read this a few times and it sounds like you're making a case for
> storing other name-like references to a person which would not automatically
> be used by a software unit during the pattern matching, i.e. something the
> user would have access to and could potentially make use us. Would you agree
> Ian?

What I'm saying is that there are two distinctly different types of
entity which have names sensu lato as attributes.

One is what have been labeled as personae in previous discussions, they
fill roles identified in events described by original texts. Clearly
these relate to real historical people. The other are the historical
people who we reconstruct from the evidence and will appear in family
trees etc. The first are best dealt with as strings representing
exactly what's found in the original. The others are best dealt with as
properly structured, disambiguated names.

For instance a baptism might record "Johanes f Willemi Goddard" which
splits gives us two personae. One would be recorded in that way filling
the role of subject the other will be recorded as "Willelmi Goddard"
filling the role of subject's father. The second entity will, for the
first persona, record something along the lines of

<PersonalName type="Modern English">
<GivenNames>John</GivenNames>
<Surname>Goddard</Surname>
</PersonalName>

and a corresponding element for the second. Clearly the John entity
will have a link to the Johanes persona, but also links to the personae
which might use the form "Johanis" where John's children are baptised.

In fact, that's only for people. You could say the same thing about places.

This is a distinction which Gedcom fails to make. Sure, the additional
information could be plonked in a note entity but I think it's far more
significant than that. Both entities are distinct parts of the data.

Tony Proctor

unread,
Sep 30, 2011, 9:31:57 AM9/30/11
to

"Ian Goddard" <godd...@hotmail.co.uk> wrote in message
news:9elrda...@mid.individual.net...
OK, I see now Ian. That's a very subtle point that I think a lot of people
might miss - including myself. Very interesting point though.

We're implicitly heading towards something for professional usage but I
think that's the way it should be. A lot of hobbyists will eventually hit
brick walls with their popular desktop tools as their experience grows.
There's a lot more to this than being able to draw a pedigree chart that
matches your wallpaper. ;-)

Tony Proctor

Tony Proctor


Tony Proctor

unread,
Sep 30, 2011, 9:58:30 AM9/30/11
to

"Tony Proctor" <tony@proctor_NoMore_SPAM.net> wrote in message
news:j5vp67$t2b$1...@reader01.news.esat.net...
> Gráinne [Ann] "Ní Murchú"
>
> An interesting issue here concerns the variations of individual name
> parts. In this example, Grace accepts "Gracie" as an informal version of
> her forename. However, the difference between Ann and Anne is more of a
> spelling error, either during recording or a subsequent lookup. I think
> this should be handled by the software unit, just as a soundex might. The
> same could apply to using a middle initial but that is a very Western
> convention.
>
>
> Dates
> The general representation of dates is mentioned above. When a date is
> referenced in an element, it should have a margin of error associated with
> it. This could be a +/- representation such as a day, a month, or a year,
> etc., or a more explicit min/max representation.
>
> When deterministic dates, such as our normal Gregorian ones, are loaded
> into some type of indexing system like a database, it is expected that
> they can all be stored as a pair of internal 'timestamp' values. Since
> these would be a binary long-integer representations it would mean issues
> of date format, uncertainty, TZ, etc., all become irrelevant and they can
> be handled efficiently in the same manner.
>
> Dates expressed in different Calendars are a bit more challenging and I'll
> skip that for now :-)
>
>
> Tony Proctor
>

Any suggestions for handling place names?

My post suggested that persons and places should have a similar structure,
including a 'key' by which they can be referenced. However, although I
considered using the same list of pattern 'sequences' for place-names as for
person-names, I'm less than convinced now.

There is a similar goal in being able to give each unique person or place
just a single entity in the stored data - no duplicates. Most products use a
sequence of name-parts for a place name that begins with the most local
(e.g. a street address) and continues to the most global (e.g. a county or
country).

The similarity in there being a sequence of name-parts, and that we could
use a list of sequences to define them (e.g. Sunderland, {Durham,"Co.
Durham"}), would lose something if we took advantage of it... the
hierarchical nature of places.

If you were viewing data on Sunderland, for instance, then it would be
convenient to be ale to move up a level and look generally at Durham, e.g. a
county map.

In effect, each place has a type of parentage but I'm not sure how far to
take the analogy.

Tony Proctor


NigelBufton

unread,
Sep 30, 2011, 10:34:46 AM9/30/11
to
Tony Proctor used his keyboard to write :
> Gráinne [Ann] "Ní Murchú"
>
> An interesting issue here concerns the variations of individual name parts.
> In this example, Grace accepts "Gracie" as an informal version of her
> forename. However, the difference between Ann and Anne is more of a spelling
> error, either during recording or a subsequent lookup. I think this should be
> handled by the software unit, just as a soundex might. The same could apply
> to using a middle initial but that is a very Western convention.
>
>
> Dates
> The general representation of dates is mentioned above. When a date is
> referenced in an element, it should have a margin of error associated with
> it. This could be a +/- representation such as a day, a month, or a year,
> etc., or a more explicit min/max representation.
>
> When deterministic dates, such as our normal Gregorian ones, are loaded into
> some type of indexing system like a database, it is expected that they can
> all be stored as a pair of internal 'timestamp' values. Since these would be
> a binary long-integer representations it would mean issues of date format,
> uncertainty, TZ, etc., all become irrelevant and they can be handled
> efficiently in the same manner.
>
> Dates expressed in different Calendars are a bit more challenging and I'll
> skip that for now :-)
>
>
> Tony Proctor

Having followed this thread for a few days, I have seen very little that is not catered
for in the existing GEDCOM 5.5 standard (or the 5.5.1 proposal).

The issue, as always, is not so much an issue with the current standard, but the fact
that very few programs (I am aware of only two) adhere to the standard that exists.

Therefore, discussions of what might be a better standard would seem moot until
the creators of most products take the time to read the existing standard (which
they have had well over a decade to do).

The worst examples are those that create custom tags to do the job of a perfectly
appropriate standard construct, and the very worst are those that deny that
something is supported - for example Famaliy Tree Manager's refusal to export
links to media files with a help file that says that it is become GEDCOM does not
support it!

Nigel Bufton


Tony Proctor

unread,
Sep 30, 2011, 11:38:15 AM9/30/11
to

"NigelBufton" <ni...@bufton.org> wrote in message
news:j64k26$2pg$1...@dont-email.me...
I have to disagree with you there Nigel.

For a start, GEDCOM is not standards based, even to the point of inventing
its own character set. The post I made cited several different modern
standards, including ISO 8601 which Gedcom 5.5.1 still doesn't address.

The method of extending elements proposed here uses proper namespaces. If
XML were used, for instance, then it would have a standard method of
defining the schema, a way of automated validatation using a schema files,
and a way of defining new elements through a standard namespace with a
unique URI identifying it.

GEDCOM might be fine for simple pedigrees but it leaves no room for the
general narrative which forms a good part of family history. The proposal
here not only allows such narrative to be associated with people and places
but allows it to be qualified according to Surety, Sensitivity, Source, and
(taking Ian's input) fact/inference. It also allows nesting of elements to
provide a way of putting hyperlinks in the presentation of the narrative in
a viewer.

I'm sorry but I see very little common ground at all.

Tony Proctor


Ian Goddard

unread,
Sep 30, 2011, 12:15:39 PM9/30/11
to
> OK, I see now Ian. That's a very subtle point that I think a lot of people
> might miss - including myself. Very interesting point though.

As I wrote before, I spent half my working life in science in two
disciplines involved in investigation of the past knowing my reports
might be scrutinised quite closely so this is a distinction which is
burned into my thinking. And the other half was spent in IT, mostly
involving RDBMSs & more latterly XML....

> We're implicitly heading towards something for professional usage but I
> think that's the way it should be. A lot of hobbyists will eventually hit
> brick walls with their popular desktop tools as their experience grows.

Indeed. As one works on those brick walls one starts to find a lot in
common with the material local historians use. For instance I don't
think it's a coincidence that a couple of new surnames appear in what
was largely a strongly parliamentarian area just after the Civil War &
the Restoration from a few miles away where the Lords of the Manor were
from a dynasty which was & still is notably RC.

> There's a lot more to this than being able to draw a pedigree chart that
> matches your wallpaper. ;-)

Oh damn! Forgot to put that into the statement of requirements.

Tony Proctor

unread,
Sep 30, 2011, 6:00:22 PM9/30/11
to

"Tony Proctor" <tony@proctor_NoMore_SPAM.net> wrote in message
news:j643n7$j15$1...@reader01.news.esat.net...
I didn't expand on the structure of the elements I'd proposed. However, if
XML were going to be used then their design would have to follow best
practices to ensure that a schema-based validation was possible. I've fallen
into the trap before of defining XML in a way that feels natural, and then
finding that it cannot adequately be described in a schema definition.

For the 'Notes', how about something along the lines of <Notes> including a
sequence of <Node> elements, each of which is a "mixed" element with both
narrative text and embedded reference-type elements, e.g.

<Notes>
<Note>
100% sure fact with public sensitivity
</Note>
<Note Surety=80% Fact=0>
80% sure inference
</Note>
</Notes>

The embedded reference-type nodes should have a different element name to
that in the definition of the thing being referenced - another trap it's
easy to fall into. For instance, <PersonRef> instead of <Person>, and
<PlaceRef> instead of <Place>.

If it's done properly then a viewing tool could present the sections of
narrative in clearly different ways. In fact, a relatively simple XSLT (if
such a thing exists) could generate HTML directly from it.

Tony Proctor

Tony Proctor


Peter J. Seymour

unread,
Oct 1, 2011, 4:52:34 AM10/1/11
to
On 2011-09-30 15:34, NigelBufton wrote:
>
> Having followed this thread for a few days, I have seen very little that is not catered
> for in the existing GEDCOM 5.5 standard (or the 5.5.1 proposal).
>
> The issue, as always, is not so much an issue with the current standard, but the fact
> that very few programs (I am aware of only two) adhere to the standard that exists.
>
> Therefore, discussions of what might be a better standard would seem moot until
> the creators of most products take the time to read the existing standard (which
> they have had well over a decade to do).
>
> The worst examples are those that create custom tags to do the job of a perfectly
> appropriate standard construct, and the very worst are those that deny that
> something is supported - for example Famaliy Tree Manager's refusal to export
> links to media files with a help file that says that it is become GEDCOM does not
> support it!
>
> Nigel Bufton
>
>
As has been pointed out many times before, the main problem is that it
does not seem to be in the vested interests of software producers to
follow the standard. This suggests that whatever standard there is, it
will not be followed to varying extents. Gedcom itself works very well
within a defined context.
As I have suggested previously, a way of dealing with the lack of
adherence to standards is to have a "universal" gedcom reader. You can
then massage the data into whatever form you want.

Peter

Peter J. Seymour

unread,
Oct 1, 2011, 5:04:15 AM10/1/11
to
On 2011-09-30 16:38, Tony Proctor wrote:
>
> I have to disagree with you there Nigel.
>
> For a start, GEDCOM is not standards based, even to the point of inventing
> its own character set.

You are perhaps getting a bit too argumentative here. Gedcom is its own
standard. Presumably regarding character sets, you are referring to
ANSEL. Are you claiming that ANSEL is not a standard?

.....

The post I made cited several different modern
> standards, including ISO 8601 which Gedcom 5.5.1 still doesn't address.
......

And presumably never will, but that doesn't matter. It has a defined
date format and it converts readily to ISO 8601 or whatever.

Peter

Tony Proctor

unread,
Oct 1, 2011, 5:35:54 AM10/1/11
to

"Peter J. Seymour" <Newsg...@pjsey.demon.co.uk> wrote in message
news:rIAhq.4$Mi...@newsfe18.ams2...
OK, I probably misrepresented the character set. Although ANSEL has an ANSI
designation, there isn't a lot that uses it is there. Plus I question the
validity of using it as a computer exchange format in the first place. It is
no surprise that software such as FTM doesn't acknowledge it Peter.

Regarding standards in general, GEDCOM is an "isolated standard". Good
standards are built on other standards and the point I was trying to make is
that there is nothing in the definition that acknowledges modern standards.

Perhaps the main thrust of my original post, though, was not so much a
standards one as the applicability to family history in general as opposed
to simple pedigrees and discrete properties. Much of my own research
contains narrative and I have no option but to store it separate in
Word/pdf/etc documents. It then becomes sidelined and wouldn't get used by a
desktop tool.

Tony Proctor


Ian Goddard

unread,
Oct 1, 2011, 5:40:26 AM10/1/11
to
Ian Goddard wrote:
>
> Putting it in XML terms you might have something like:
>
> <Person>
> <PersonName>......</Name>
> ....
> <OtherItems>
> <Item name=.... value=..../>
> <Item name=..... value=.../>
> </OtherItems>
> </Person>

An additional point I should have mentioned is that if some data items
seem to be introduced fairly regularly by this means they could be given
their own elements in subsequent revisions.

Tony Proctor

unread,
Oct 1, 2011, 5:45:55 AM10/1/11
to

"Peter J. Seymour" <Newsg...@pjsey.demon.co.uk> wrote in message
news:vxAhq.140$h45...@newsfe22.ams2...
It's true that adoption is probably governed more by prevailing usage than
standards in cases like this Peter. I wouldn't expect software vendors to
use a new format simply because it was standards based.

However, there are many advantages to a major redesign, including things I
mentioned like applicability to narrative history, globalisation, automated
validation of file structure, zero-loss exchange, acknowledged IT methods
for registering URI-schemes/namespaces/schema-revisions etc.

No one seemed to be putting that first foot forward and suggesting what
might be done. Fool that I am, that's what I volunteered to do ;-)

Tony Proctor


Ian Goddard

unread,
Oct 1, 2011, 5:51:06 AM10/1/11
to
Tony Proctor wrote:
>
> I didn't expand on the structure of the elements I'd proposed. However, if
> XML were going to be used then their design would have to follow best
> practices to ensure that a schema-based validation was possible.

Agreed. One of the advantages of XML is that validation against a
schema makes it possible to reject a document outright even if only one
small part of it fails. That's what will keep the unofficial variations
out.

Because schema references are in the form of URLs an application could
keep abreast of the latest schemas even if it wasn't able to use newly
defined elements. This would, of course, enable a company to define its
own extended schema but unless it published it on the web it would be
automatically failed. And a program would be able to check that the
schema came from the official site and reject it if it didn't.

Ian Goddard

unread,
Oct 1, 2011, 9:24:30 AM10/1/11
to
Tony Proctor wrote:
> Lineage
> Formats like XML provide an automatic way of depicting a top-down
> hierarchical relationship. Unfortunately, genealogical lineage is really a
> 'network' rather than a pure 'hierarchy'. In effect, a simple nesting of
> "offspring" under their associated "parents" is insufficient.

Whilst XML does offer this alternative you don't have to use it. It's
quite acceptable to have something along the lines of:

<Wrapper>
<Person>...</Person>
<Person>...</Person>
<Family>...</Family>
<Place>...</Place>
</Wrapper>

In fact Gramps uses something along these lines.

> There's also a problem with a top-down approach unless a specific union
> between two people has a single representation in the data, but that then
> causes further problems with the nature and the lifetime of that union. For
> instance, if the father and mother have separate representations in the
> data, and they each have links to their associated common offspring, then it
> makes it difficult to bring the information together to identity family
> units, and also to ensure that there exists two links to each offspring.
>
> I believe it's easier to use a bottom-up representation. Each person has
> just one progenitive father and one progenitive mother and so can have
> upward links to their appropriate parents (where known). For instance,
> <Father> and<Mother> elements. This also makes it easy to have other types
> of parentage including<Guardian>,<FosterMother>,<AdoptedMother>, etc.
> Key

Taking an OO approach to design I'd start off with a very broad concept
such as Association which could then have a Family subclass, a
Guardianship subclass etc. You'd then have links of various types to
associate the individuals with the association and their role in the
association - father, mother, child and a set of rules - 0 or 1 father,
0 or 1 mother, 0 to many children.

However, it would also be possible to add further subclasses with their
own rule sets for things like BusinessPartnership to describe the family
business. It's an extensible approach.

singhals

unread,
Oct 1, 2011, 10:03:56 AM10/1/11
to gen...@rootsweb.com
Tony Proctor wrote:

> OK, I probably misrepresented the character set. Although ANSEL has an ANSI
> designation, there isn't a lot that uses it is there. Plus I question the
> validity of using it as a computer exchange format in the first place. It is
> no surprise that software such as FTM doesn't acknowledge it Peter.

Ummm, I wouldn't /exactly/ call FTM the best choice as an
example of what's best in computer programs.

It's certainly widely popular and widely available, but it
does have its little eccentricities which can drive you
crazy. So, the fact that it does or doesn't do P Q or R
doesn't mean other programs do or don't.

Cheryl

Peter J. Seymour

unread,
Oct 1, 2011, 11:28:10 AM10/1/11
to
On 2011-10-01 10:35, Tony Proctor wrote:
> "Peter J. Seymour"<Newsg...@pjsey.demon.co.uk> wrote in message
> news:rIAhq.4$Mi...@newsfe18.ams2...
>> On 2011-09-30 16:38, Tony Proctor wrote:
>>>
>>> I have to disagree with you there Nigel.
>>>
>>> For a start, GEDCOM is not standards based, even to the point of
>>> inventing
>>> its own character set.
>>
>> You are perhaps getting a bit too argumentative here. Gedcom is its own
>> standard. Presumably regarding character sets, you are referring to ANSEL.
>> Are you claiming that ANSEL is not a standard?
>>
>> .....
>>
>> The post I made cited several different modern
>>> standards, including ISO 8601 which Gedcom 5.5.1 still doesn't address.
>> ......
>>
>> And presumably never will, but that doesn't matter. It has a defined date
>> format and it converts readily to ISO 8601 or whatever.
>>
>> Peter
>
> OK, I probably misrepresented the character set. Although ANSEL has an ANSI
> designation, there isn't a lot that uses it is there. Plus I question the
> validity of using it as a computer exchange format in the first place. It is
> no surprise that software such as FTM doesn't acknowledge it Peter.

As I understand it, ANSEL originated in the early days of computng as a
standard for American library computer systems and focussed on
accommodating the character sets of certain languages. It was
effectively obsoleted by utf8. Another example of Gedcom showing its age.

.....


Much of my own research
> contains narrative and I have no option but to store it separate in
> Word/pdf/etc documents. It then becomes sidelined and wouldn't get used by a
> desktop tool.
>
.....

I sympathise. I have a similar problem. I intend to improve the text
handling facilities in Gendatam Suite but I have had other priorities.

Peter

Tony Proctor

unread,
Oct 1, 2011, 12:38:28 PM10/1/11
to

"Ian Goddard" <godd...@hotmail.co.uk> wrote in message
news:9eoike...@mid.individual.net...
Sounds like a reasonable approach Ian. I've tried to steer clear of any OO
aspects in this discussion because the storage format has some specific
requirements of its own.

When I gave this stuff a lot more though - a couple of years ago - I
separated 'storage format' (i.e. interchange, import/export, backup, or load
format) from the run-time 'object model', and from the 'indexed storage'
(e.g. a database). Stuff I'd read before then never gave a clear cut
distinction between these and what requirements they might each have
separately. For instance:

Storage format. This is a definitive storage format - as discussed
throughout this thread - and not the indexed format. Giving this a standard
would allow import/export and other types of exchange without having to
mandate a particular database format. Similar in concept to GEDCOM but a lot
more far-reaching.

Indexed storage. All or part of the storage format could be loaded into a
database. That might be a standard relational one or a proprietary one. It
doesn't really matter since it's the choice of the designer of the software
unit. If they felt SQL databases were hamstrung then they might invent
another one, although it would be a mammoth task in itself. This is how
multi-dimensional OLAP databases came about - something I was heavily
involved in once upon a time.

Object model. This is the run-time object model, used in memory and
communications. This is where the OO aspect comes in. I believe there should
be a standard object model for run-time interoperability. This is a step
beyond offline import/exchange. It would allow live co-operation between
software units holding separate trees - whether on the same computer or not,
and irrespective of whether they were from the same vendor or not - and
allow comparison, merging, etc. This is probably a pipedream but can foresee
family history being published "in the cloud" and your software being able
to connect to it and access parts of it in a controlled way. This is way
different to viewing someone's published pedigree on Ancestry, or where
ever, using a thin-client interface.


...I may go back to this when I finally kick the paid job :-)

Tony Proctor


Tony Proctor

unread,
Oct 1, 2011, 12:43:52 PM10/1/11
to

"Tony Proctor" <tony@proctor_NoMore_SPAM.net> wrote in message
news:j643n7$j15$1...@reader01.news.esat.net...
>
I went back through some of my research, Ian, to check whether I'd
distinguished facts from inference. Of course I knew which was which but I
hadn't made it explicit. Bad boy Tony!

It was useful because I also found a third variation: conjecture. Whereas
inference could be linked to a logical analysis of available facts, and was
merely awaiting some level of substantiation, conjecture is really a
poorly-disguised alternative term for 'guess'.

Starting to sound like something that could be generalised.

Tony Proctor


Ian Goddard

unread,
Oct 1, 2011, 3:21:38 PM10/1/11
to
An interesting experience.

I just paid a visit to the BetterGEDCOM project. They seem to have
en-mired themselves in a waterfall approach and have spent weeks if not
months wrangling a WHAT. What WHAT are they wrangling? Source type,
something which can be simply identified as a piece of data. All they
need is to concentrate on HOW. I left them an example:

<wrapper>
<Source type="archive" ID="660f78b6-ec5a-11e0-b261-001636e96075">
<ParentID/>
<SourceName>Yorkshire Archaeological Society Archive</SourceName>
<ShortName>Yorks Arch Soc Archive</ShortName>
<BriefName>YAS Archive</BriefName>
<AdHoc>
<Item Name="Address" Value="Claremont"/>
<!-- Add as many Items as required -->
</AdHoc>
</Source>
<Source type ="collection" ID="2a2ffc84-ec5b-11e0-a2f8-001636e96075">
<ParentID>660f78b6-ec5a-11e0-b261-001636e96075</ParentID>
<SourceName>H. L. Bradfer-Lawrence Collection</SourceName>
<ShortName>Bradfer-Lawrence Collctn</ShortName>
</Source>
<Source type="collection" ID="9b4a55ea-ec5b-11e0-a42e-001636e96075">
<ParentID>2a2ffc84-ec5b-11e0-a2f8-001636e96075</ParentID>
<SourceName>Millar Collection</SourceName>
</Source>
<Evidence ID="0a77ffee-ec5c-11e0-b798-001636e96075">
<ParentID>9b4a55ea-ec5b-11e0-a42e-001636e96075</ParentID>
<EvidenceName>Gift with warranty MD335/5/108</EvidenceName>
<Date>13th century</Date>
<References>
<Reference
source="2a2ffc84-ec5b-11e0-a2f8-001636e96075">MD335/5/108</Reference>
<Reference
source="9b4a55ea-ec5b-11e0-a42e-001636e96075">Box 64 Millar 108</Reference>
</References>
<EvidentialObject mimeType="text/plain">
Gift with warranty MD335/5/108 [13th century]

Contents:
1. William de Fonte of Hennesale 2. Michael son of John de
Heck William has given to Michael one toft in the vill of Hennesale
(description given). To hold to Michael, rendering yearly to William 6
d. for all services. Witnesses: William son of Thomas de Povlington,
John de Heck, Henry de Goudale, Hugh his brother, John son of Adam de
Wittelay, William son of Adam of the same, William son of Mabel de
Snaith, Gamel son of Richard of the same, Ylard clerk of the same,
Thomas son of Godard de Mora. Bag for seal. Former number, in pencil
'202' [Former ref: Box 64 Millar 108]

</EvidentialObject>
</Evidence>
</wrapper>

and wondered if they'll produce anything usable before I join all the
links between me and Godard, father of Thomas.

Wes Groleau

unread,
Oct 2, 2011, 12:03:33 AM10/2/11
to
On 09-28-2011 14:37, Tony Proctor wrote:
> Data values should be in a locale-neutral format, as with the source code
> for programming languages. For this reason, it is sometimes called using a
> 'programming locale'. This effectively means using a period in all decimal
> numbers (not a comma), ISO 8601 format for (Gregorian-)dates (e.g.
> yyyy-mm-dd), and unlocalised true/false or just 1/0 for booleans (e.g. for
> option selections).

For dates, I personally prefer GEDCOM's approach of allowing multiple
calendars, as long as the one being used is identified. ISO8601 (are
those the right digits? doesn't look right) isn't locale-neutral any
more than dd mmm yyyy is. Both are different locales, and it doesn't
matter which you use as long as which is clearly identified.

GEDCOM's flaws are not syntactic nor lexical, they are semantic.

XML does have the advantage of a wider selection of tools to work
with it, but if you put GEDCOM's data model into an XML syntax,
you haven't really accomplished anything.

Syntactically, one advantage of GEDCOM is that in a pinch,
humans can read it much more easily. In fact, for several
years, my database was a GEDCOM file, and my genealogy program
was Apple's TextEdit (similar to WordPad).

--
Wes Groleau

There are two types of people in the world …
http://Ideas.Lang-Learn.us/barrett?itemid=1157

Wes Groleau

unread,
Oct 2, 2011, 12:20:46 AM10/2/11
to
On 09-28-2011 14:37, Tony Proctor wrote:
> [a] Define a universal import/export format
> [b] Flexibility. Store virtually anything without having to bend any rules
> [c] Locale independence
> [d] Potential use as a definitive backup-format or a load-format for databases
> [e] Zero-loss when operating between different software units

[a] & [e] - Vendors won't comply with GEDCOM. Who's going to
make them comply with anything else?

[b] Impossible--though it ought to be possible to get closer
to this than GEDCOM does.

[c] Unnecessary and impossible. Whatever format is used _is_
either a new locale or a pre-existing one. The important thing
is that the format be defined. GEDCOM at least does that.

[d] GEDCOM can do this, too, except where it can't. :-) That's just
as much a matter of the DB being incompatible with GEDCOM as
it is GEDCOM being incompatible with the DB. The same thing
can happen with any other format.

It's not that I want to defend GEDCOM--there's a lot wrong with it.
But some of the proposals I've seen reappear from time to time are
fixing things that aren't broken while preserving what IS broken.

And the biggest problem of all is adoption.

Wes Groleau

unread,
Oct 2, 2011, 12:55:09 AM10/2/11
to
On 09-28-2011 14:37, Tony Proctor wrote:
> Some thoughts on a better textual import/export format for genealogical use.

Most of these give the impression of being presented as something better
than GEDCOM when in fact they are things GEDCOM already supports.

But you have an important exception:
> I believe it's easier to use a bottom-up representation. Each person has
> just one progenitive father and one progenitive mother and so can have
> upward links to their appropriate parents (where known). For instance,
> <Father> and<Mother> elements. This also makes it easy to have other types
> of parentage including<Guardian>,<FosterMother>,<AdoptedMother>, etc.

I have long wished for almost this. Only, change "parentage" to
"relationship." Sibling, Uncle, Godfather, Mistress, Teacher, .....

Originally, GEDCOM said nobody is related to anybody, instead we're
all related to families, and there are only three relationships: HUSB,
WIFE, CHIL. Eventually, they recognized that there _are_ other
relationships, so they invented ASSO [1]. So now, there are two
classes of relationships, and the "main" ones are still required
to be indirect.

Let people be directly related to other people, and let them be put
in all sorts of groups, not merely one kind, i.e., the traditional family.

[1] Which I am unable to look at without imagining giving the person
a derogatory anatomical title. You have a husband, a wife, and some
children. Everyone else is an ASSO. :-)

Wes Groleau

unread,
Oct 2, 2011, 1:00:12 AM10/2/11
to
On 10-01-2011 05:51, Ian Goddard wrote:
> Agreed. One of the advantages of XML is that validation against a
> schema makes it possible to reject a document outright even if only one
> small part of it fails. That's what will keep the unofficial variations
> out.

I doubt it. Having an "official schema" doesn't stop Microsoft from
changing things. They just create their own schema and pretend everyone
else is non-standard. (And they're just an example--others
do it, too.)

Wes Groleau

unread,
Oct 2, 2011, 1:06:39 AM10/2/11
to
PLEASE! For a discussion of this complexity,
it would be very helpful to see

> point K

Response to point K

(instead of)

> point A
> point B
> .....
> point ZY
> point ZZ

Response to a point that is up there somewhere

Wes Groleau

unread,
Oct 2, 2011, 1:14:33 AM10/2/11
to
On 09-30-2011 11:38, Tony Proctor wrote:
> GEDCOM might be fine for simple pedigrees but it leaves no room for the
> general narrative which forms a good part of family history. The proposal

Of course it does.

> here not only allows such narrative to be associated with people and places
> but allows it to be qualified according to Surety, Sensitivity, Source, and

Surety: from zero to 100% is just as subjective as quality (QUAY) from
one to four. Perhaps less, since some attempt was made to define what
each of the four levels meant.

Sensitivity: restriction is _already_ in the GEDCOM spec. You are
proposing a refinement, not something totally new.

Source: Surely you are not unaware of this in GEDCOM?

> (taking Ian's input) fact/inference. It also allows nesting of elements to

Distinguishing content of documents from conclusions in a structured
manner is definitely something GEDCOM lacks.

Wes Groleau

unread,
Oct 2, 2011, 1:18:30 AM10/2/11
to
On 10-01-2011 11:28, Peter J. Seymour wrote:
> As I understand it, ANSEL originated in the early days of computng as a
> standard for American library computer systems and focussed on
> accommodating the character sets of certain languages. It was
> effectively obsoleted by utf8. Another example of Gedcom showing its age.

And indeed, ANSEL handled European languages better than modern
"eight-bit" codes. But it doesn't handle scripts with non-Latin
characters very well.

Wes Groleau

unread,
Oct 2, 2011, 1:23:48 AM10/2/11
to
On 10-01-2011 09:24, Ian Goddard wrote:
> Taking an OO approach to design I'd start off with a very broad concept
> such as Association which could then have a Family subclass, a
> Guardianship subclass etc. You'd then have links of various types to
> associate the individuals with the association and their role in the
> association - father, mother, child and a set of rules - 0 or 1 father,
> 0 or 1 mother, 0 to many children.

I'd rather have a wide-variety of _relationships_ from one person to
another [1] directly [2], and a wide variety of types of groups that
might contain people in various _roles_

[1] Perhaps they might allow one-to-many, i.e., a list, instead of
one to one.

[2] Instead of GEDCOM's indirect INDI->FAM->INDI model.

NigelBufton

unread,
Oct 2, 2011, 3:01:34 AM10/2/11
to
Tony Proctor pretended :
GEDCOM 5.5 provides for ANSEL, UNICODE and ASCII. UNICODE was added in
GEDCOM 5.3; before that ANSEL had to be used for non-ASCII characters.

Nigel Bufton


NigelBufton

unread,
Oct 2, 2011, 3:09:52 AM10/2/11
to
It happens that Ian Goddard formulated :
GEDCOM does this via the ASSO tag in the INDI record. Programs that
are
compliant use this construct to do exactly that:
1 ASSO @I1234@
2 TYPE INDI
2 RELA Godfather
or
1 ASSO @F789@
2 TYPE FAM
2 RELA Witness at marriage

Nigel Bufton


NigelBufton

unread,
Oct 2, 2011, 3:31:38 AM10/2/11
to
Wes Groleau has brought this to us :
> On 10-01-2011 09:24, Ian Goddard wrote:
>> Taking an OO approach to design I'd start off with a very broad concept
>> such as Association which could then have a Family subclass, a
>> Guardianship subclass etc. You'd then have links of various types to
>> associate the individuals with the association and their role in the
>> association - father, mother, child and a set of rules - 0 or 1 father,
>> 0 or 1 mother, 0 to many children.
>
> I'd rather have a wide-variety of _relationships_ from one person to
> another [1] directly [2], and a wide variety of types of groups that
> might contain people in various _roles_
>
> [1] Perhaps they might allow one-to-many, i.e., a list, instead of
> one to one.
>
> [2] Instead of GEDCOM's indirect INDI->FAM->INDI model.

Although there is no Guardianship method, GEDCOM does provide for
adoption and fostering (and LDS sealing):
0 @I1@ INDI
1 FAMC @F1
1 FAMC @F2@
2 PEDI adopted

The adoptive relationship can be further specified by the ADOP event:
1 ADOP
2 FAMC @F2@
3 ADOP HUSB

Admittedly we have two different sub-structures relating @I1@ to @F2@
which could lead to lack of integrity if a program did not manage the
situation according to the standard. However, GEDCOM is a
communication
format, so programs should ensure that they communicate data in a state
of integrity.

Nigel Bufton


Tony Proctor

unread,
Oct 2, 2011, 6:01:17 AM10/2/11
to

"Wes Groleau" <Grolea...@FreeShell.org> wrote in message
news:j68nqm$qbg$1...@dont-email.me...
> There are two types of people in the world .
> http://Ideas.Lang-Learn.us/barrett?itemid=1157

It's definitely ISO 8601 Wes. See http://en.wikipedia.org/wiki/ISO_8601

I use this a lot with work. It was purposely defined for situations like
this. The ordering of elements is part of the standard rather than the
current locale. Also, the all-numeric format (yyyy-mm-dd) doesn't contain
any localised names such as Jan, January, etc

Tony Proctor


Tony Proctor

unread,
Oct 2, 2011, 6:04:16 AM10/2/11
to

"Wes Groleau" <Grolea...@FreeShell.org> wrote in message
news:j68oqu$u3f$1...@dont-email.me...
> There are two types of people in the world .
> http://Ideas.Lang-Learn.us/barrett?itemid=1157

I disagree strongly with your assessment of (c) Wes. It not only is possible
but it is being done all the time by (good-)XML designers, and there are
standards supporting it. Anyone putting locale-dependent data in public XML
content has sort of missed the point.

Tony Proctor


Tony Proctor

unread,
Oct 2, 2011, 6:05:36 AM10/2/11
to

"Wes Groleau" <Grolea...@FreeShell.org> wrote in message
news:j68qre$76c$1...@dont-email.me...
> There are two types of people in the world .
> http://Ideas.Lang-Learn.us/barrett?itemid=1157

Agreed Wes. I was using "parentage" more in the graphical sense than the
biological one. Apologies for that

Tony Proctor


Tony Proctor

unread,
Oct 2, 2011, 6:09:20 AM10/2/11
to

"Wes Groleau" <Grolea...@FreeShell.org> wrote in message
news:j68rh0$9sb$1...@dont-email.me...
> PLEASE! For a discussion of this complexity,
> it would be very helpful to see
>
> > point K
>
> Response to point K
>
> (instead of)
>
> > point A
> > point B
> > .....
> > point ZY
> > point ZZ
>
> Response to a point that is up there somewhere
>
> --
> Wes Groleau
>
> There are two types of people in the world .
> http://Ideas.Lang-Learn.us/barrett?itemid=1157

Lost me on this one I'm afraid...

Tony Proctor


Ian Goddard

unread,
Oct 2, 2011, 9:54:09 AM10/2/11
to
Wes Groleau wrote:
> On 10-01-2011 09:24, Ian Goddard wrote:
>> Taking an OO approach to design I'd start off with a very broad concept
>> such as Association which could then have a Family subclass, a
>> Guardianship subclass etc. You'd then have links of various types to
>> associate the individuals with the association and their role in the
>> association - father, mother, child and a set of rules - 0 or 1 father,
>> 0 or 1 mother, 0 to many children.
>
> I'd rather have a wide-variety of _relationships_ from one person to
> another [1] directly [2], and a wide variety of types of groups that
> might contain people in various _roles_
>
> [1] Perhaps they might allow one-to-many, i.e., a list, instead of
> one to one.
>
> [2] Instead of GEDCOM's indirect INDI->FAM->INDI model.
>

I think there are three possible models:

1. Direct - Ind to Ind.

2. Indirect - Ind to Assoc to Ind.

3. Double indirect - Ind to Link to Assoc to Link to Ind

I prefer the latter.

Say I send you a file which shows Young Fred as son of Old Fred (1st
form) or as a member of Old Fred's family (2nd form). This requires the
relationship to be expressed as some sort of pointer in either Young
Fred, Old Fred or the Family entity depending on the model and maybe in
more than one entity if we decide pointers have to be reciprocal. I
then realise that there was a different Old Fred so I update my data to
reflect this. We now have at two versions of at least one entity
floating about, the original of which you have a copy and my corrected
one. This is not a good situation.

If we use the double indirect version none of these entities need to be
changed. All the potentially labile information is contained in the
Link entities and we can then be free about changing our minds. If, in
the example, you're of the view from the off that Young Fred is actually
the son of Old Bill you can simply discard my link and substitute your
own without changing any of core entities.

Tony Proctor

unread,
Oct 2, 2011, 10:37:43 AM10/2/11
to

"Tony Proctor" <tony@proctor_NoMore_SPAM.net> wrote in message
news:j67fkb$7bs$1...@reader01.news.esat.net...
Sorry, ignore that. A conjecture, or course, is simply an inference on very
limited evidence. Hence, it is already catered for using the <Surety>
attribute :-)

Tony Proctor


Ian Goddard

unread,
Oct 2, 2011, 12:41:50 PM10/2/11
to
Wes Groleau wrote:
> On 10-01-2011 05:51, Ian Goddard wrote:
>> Agreed. One of the advantages of XML is that validation against a
>> schema makes it possible to reject a document outright even if only one
>> small part of it fails. That's what will keep the unofficial variations
>> out.
>
> I doubt it. Having an "official schema" doesn't stop Microsoft from
> changing things. They just create their own schema and pretend everyone
> else is non-standard. (And they're just an example--others
> do it, too.)
>

I suppose one determining factor is whether they can get away with it.
Clearly nobody could get away with an attempt to make their own tweaks
to something like TCP/IP because it just wouldn't work at all.

ISTR that MS had their own version of some XML technology - schemas or
XSL - because the official version wasn't out quickly enough but enabled
use of the standard when it came along; I'd guess the non-standard
version must be dead by now. Again, in multi-vendor situations a
non-standard implementation would lead to exclusion.

There's also a legal option here - have the format owned by a
foundation, trademark the name and grant a licence to use it only on
condition that a product claiming compatibility validates all documents
on import and export, rejects invalid documents and validates only
against schemas from the foundation's site. It wouldn't stop anyone
from using tweaked versions but they could be sued for trademark
violations by the trademark owner and, depending on the jurisdiction,
sued or even prosecuted under consumer protection legislation if they
tried to claim compatibility.

The other factor is need. I presume for GEDCOMish (for want of a better
word) applications is the inability of developers to represent their
data adequately using only the standard. If a protocol were developed
which was sufficient for their needs there would be no need for vendors
to tweak it except to lock in users. I'd like to think that eventually
users are going to get smart enough to realise that lock-in isn't to
their advantage although this may be simple optimism on my part.

Bob Melson

unread,
Oct 2, 2011, 2:42:17 PM10/2/11
to
On Saturday 01 October 2011 23:00, Wes Groleau (Grolea...@FreeShell.org)
opined:

> On 10-01-2011 05:51, Ian Goddard wrote:
>> Agreed. One of the advantages of XML is that validation against a
>> schema makes it possible to reject a document outright even if only one
>> small part of it fails. That's what will keep the unofficial variations
>> out.
>
> I doubt it. Having an "official schema" doesn't stop Microsoft from
> changing things. They just create their own schema and pretend everyone
> else is non-standard. (And they're just an example--others
> do it, too.)
>

Hear, hear! M$oft has the habit of creating its own competing standards
out of whole cloth and attempting to force others to use/accept them.
Anybody remember the not so distant past controversy over the open
document standard?
--
Robert G. Melson | Rio Grande MicroSolutions | El Paso, Texas
-----
The greatest tyrannies are always perpetrated
in the name of the noblest causes -- Thomas Paine

Tony Proctor

unread,
Oct 2, 2011, 3:44:25 PM10/2/11
to

"Bob Melson" <amia...@mypacks.net> wrote in message
news:MeWdnR3NLrYULRXT...@earthlink.com...
Yup! Happened to me too with OLAP databases. Didn't matter whether there
were "open standards" created by a representative group, or whether you had
patents registered. They just trump the whole lot with something tied into
their own technology stack and then rely on volume of sales to force it to
the top, whether it's better or worse.

Tony Proctor


Wes Groleau

unread,
Oct 2, 2011, 4:52:41 PM10/2/11
to
On 10-02-2011 12:41, Ian Goddard wrote:
> There's also a legal option here - have the format owned by a
> foundation, trademark the name and grant a licence to use it only on
> condition that a product claiming compatibility validates all documents
> on import and export, rejects invalid documents and validates only
> against schemas from the foundation's site. It wouldn't stop anyone
> from using tweaked versions but they could be sued for trademark
> violations by the trademark owner and, depending on the jurisdiction,
> sued or even prosecuted under consumer protection legislation if they
> tried to claim compatibility.

A foundation like ISO? This approach worked for Java, until Microsoft
decided to tamper with it and made their own version that broke the
portability between them and everyone else. Sun sued them to make
them stop using the name. Eventually, MS lost, but their solution
was to market "dot-Net" which was effectively the same thing in
functionality but still not portable.

LDS had a half-hearted version of the approach. AFAIK, they never
stopped anyone from using "GEDCOM", but they did offer an "official
certification" of your software if you jumped through a few hoops.

As you try to supplant GEDCOM with something better, learn from a little
history: Sun was trying to create a product, but kept having
delays because they were using the most error-prone popular language
there is (C). Finally, they decided to solve the problem by creating
a language in which the kinds of errors these experienced C programmers
kept making were impossible. They didn't do much about the kinds of
errors INexperienced C programmers make. They also deliberately banned
things they considered unsafe without bothering to do their homework
and find out how other languages had made those things safe.

Result: a language much better than C but far worse than it could have
been. But an improvement is an improvement, right? But is everyone
programming in Java now? No, C is just as popular as ever, C# may be
more popular than Java, and plenty of other languages (some of them
better than Java or C#) are still in wide use.

Short version: They made a better language, but it didn't get the
adoption they hoped, and it was largely replaced by another that
was not a significant improvement but was incompatible.

Wes Groleau

unread,
Oct 2, 2011, 4:56:44 PM10/2/11
to
On 10-02-2011 06:09, Tony Proctor wrote:
> Lost me on this one I'm afraid...

http://www.dmoz.org/Computers/Usenet/Etiquette/

No big deal on short posts, but when you quote five screens of stuff and
at the end put a comment on something somewhere in the middle, ....

--
Wes Groleau

There are two types of people in the world …
http://Ideas.Lang-Learn.us/barrett?itemid=1157

Wes Groleau

unread,
Oct 2, 2011, 4:59:09 PM10/2/11
to
On 10-02-2011 03:31, NigelBufton wrote:
> Admittedly we have two different sub-structures relating @I1@ to @F2@
> which could lead to lack of integrity if a program did not manage the

Three. There is also the ASSO.

Tony Proctor

unread,
Oct 2, 2011, 5:02:33 PM10/2/11
to

"Wes Groleau" <Grolea...@FreeShell.org> wrote in message
news:j6aiuq$4ea$1...@dont-email.me...
> There are two types of people in the world .
> http://Ideas.Lang-Learn.us/barrett?itemid=1157

Hmm. I don't want to sidetrack this thread but Java is very good nowadays,
especially since 'generics' were added. I agree they could have made it this
good right from the start, and their support classes did undergo a few
serious revisions.

Most people I know continued to use C because of familiarity, or fears over
performance. It may have changed recently but M$soft were once accused of
not using their own .Net languages for anything that they sold themselves.

I never did like C anyway, and C++ has a vile syntax IMHO. We can all
pick-and-choose our history lessons but without any progress at all then
we'd still be writing in assembler :-)

Tony Proctor


Wes Groleau

unread,
Oct 2, 2011, 5:04:43 PM10/2/11
to
On 10-02-2011 06:01, Tony Proctor wrote:
> I use this a lot with work. It was purposely defined for situations like
> this. The ordering of elements is part of the standard rather than the
> current locale. Also, the all-numeric format (yyyy-mm-dd) doesn't contain
> any localised names such as Jan, January, etc

Not for situations "like this." "This" needs support for ranges,
approximations, uncertainties, one or two of the three parts being
unknown (which the current GEDCOM only _partly_ handles).

Localization is trivial, and has been solved in several ways already.

As for ordering, every date representation scheme has only one ordering
that makes any sense, whether it is explicitly stated or not.

--
Wes Groleau

There are two types of people in the world …
http://Ideas.Lang-Learn.us/barrett?itemid=1157

Wes Groleau

unread,
Oct 2, 2011, 5:13:57 PM10/2/11
to
On 10-02-2011 06:04, Tony Proctor wrote:
> I disagree strongly with your assessment of (c) Wes. It not only is possible
> but it is being done all the time by (good-)XML designers, and there are
> standards supporting it. Anyone putting locale-dependent data in public XML
> content has sort of missed the point.

A "locale" is a scheme for representing one or more of dates,
numbers, times, etc. How is what you want us to use not a
"scheme for representing …"

There needs to be a standard way of representing them, yes.
But that way is effectively another "locale"

For dates, changing from the GEDCOM "standard" to ISO 8601 would
make implementation of sorting simpler at the cost of dropping
a lot of existing flexibility.

--
Wes Groleau

There are two types of people in the world …
http://Ideas.Lang-Learn.us/barrett?itemid=1157

Wes Groleau

unread,
Oct 2, 2011, 5:20:41 PM10/2/11
to
On 10-02-2011 17:02, Tony Proctor wrote:
> I never did like C anyway, and C++ has a vile syntax IMHO. We can all
> pick-and-choose our history lessons but without any progress at all then
> we'd still be writing in assembler :-)

I didn't say we have to stick to GEDCOM. Just a caution about thinking
that your/our/that other alternative is going to save the day.

Adoption depends on people, and people are hard to predict.
Decades of griping about the flaws in GEDCOM and dozens of
alternative proposals have so far failed to make any
significant difference.

I was once rather vocal about my gripes, but I've pretty
much given up. Haven't changed my opinions about its
deficiencies, but I also haven't changed my opinion that,
bad as it is, it's still BETTER than most of the implementations
of it.

--
Wes Groleau

There are two types of people in the world …
http://Ideas.Lang-Learn.us/barrett?itemid=1157

Ian Goddard

unread,
Oct 3, 2011, 5:32:18 AM10/3/11
to
Wes Groleau wrote:
> On 10-02-2011 12:41, Ian Goddard wrote:
>> There's also a legal option here - have the format owned by a
>> foundation, trademark the name and grant a licence to use it only on
>> condition that a product claiming compatibility validates all documents
>> on import and export, rejects invalid documents and validates only
>> against schemas from the foundation's site. It wouldn't stop anyone
>> from using tweaked versions but they could be sued for trademark
>> violations by the trademark owner and, depending on the jurisdiction,
>> sued or even prosecuted under consumer protection legislation if they
>> tried to claim compatibility.
>
> A foundation like ISO?

No. I had in mind various foundations from the FOSS world.

> This approach worked for Java, until Microsoft
> decided to tamper with it and made their own version that broke the
> portability between them and everyone else. Sun sued them to make
> them stop using the name. Eventually, MS lost, but their solution
> was to market "dot-Net" which was effectively the same thing in
> functionality but still not portable.
>
> LDS had a half-hearted version of the approach. AFAIK, they never
> stopped anyone from using "GEDCOM", but they did offer an "official
> certification" of your software if you jumped through a few hoops.

AIUI one of the requirements of trademark law is that you do make proper
efforts.

> As you try to supplant GEDCOM with something better, learn from a little
> history:

Indeed. Recent history shows that providing the standard is good enough
it tends not to get fractured. The market would reject deviants. For
instance I haven't heard of anyone trying to impose their own variations
of PDF, MP3, JPEG, etc; it just wouldn't be worth the effort. Even the
example you quoted above proves the point although that needed a trip to
court.

But you missed my main point which is that the validation mechanism of
XML makes it straightforward to confirm whether the product does what it
says on the tin - or in this case, on the shrink-wrapped box. YMMV but
over here one remedy for an aggrieved -punter- consumer is to take his
complaint to Trading Standards who have considerable powers including
prosecution.

This line of discussion ignores one issue, however. Are the variations
on GEDCOM really attempts to lock-in customers or simply uncoordinated
attempts to extend the original format beyond its intended scope?

tms

unread,
Oct 3, 2011, 2:46:45 PM10/3/11
to
On Oct 1, 5:51 am, Ian Goddard <godda...@hotmail.co.uk> wrote:
> Tony Proctor wrote:
>
> > I didn't expand on the structure of the elements I'd proposed. However, if
> > XML were going to be used then their design would have to follow best
> > practices to ensure that a schema-based validation was possible.
>
> Agreed.  One of the advantages of XML is that validation against a
> schema makes it possible to reject a document outright even if only one
> small part of it fails.  That's what will keep the unofficial variations
> out.

That will also keep anyone from using such a program. Imagine, you
just downloaded a database from somewhere that contains a vital clue
you have been searching for for years, but your genealogy program
refuses to load the data because it contains an unrecognized tag, say
one of the tags I add to SOUR records to help BibTeX format them
nicely. Will you: 1) praise your program for being so diligent, or 2)
curse it for not letting you get the data you want, and switch to
another program?

XML schema are useful in some circumstances, but not when the data are
coming from multiple uncontrolled sources.

> Because schema references are in the form of URLs an application could
> keep abreast of the latest schemas even if it wasn't able to use newly
> defined elements.  This would, of course, enable a company to define its
> own extended schema but unless it published it on the web it would be
> automatically failed.  And a program would be able to check that the
> schema came from the official site and reject it if it didn't.

So users could not import data unless they were connected to the net?
And users would not be allowed to create their own tags, as Gedcom
allows?

tms

unread,
Oct 3, 2011, 3:07:53 PM10/3/11
to
On Oct 1, 5:35 am, "Tony Proctor" <tony@proctor_NoMore_SPAM.net>
wrote:
>
> Much of my own research
> contains narrative and I have no option but to store it separate in
> Word/pdf/etc documents. It then becomes sidelined and wouldn't get used by a
> desktop tool.

What's wrong with putting it in NOTEs? That's what I do, complete
with LaTeX markup. Works just fine.

Tony Proctor

unread,
Oct 3, 2011, 3:58:01 PM10/3/11