Attribute date sorting revisited and date ranges explained

45 views
Skip to first unread message

Swanfoth

unread,
Feb 19, 2012, 9:54:49 PM2/19/12
to GEDitCOM II Discussions
Attribute dates are still not sorted properly!
When date range starts with a same year as a single year date, it
comes first which is wrong.
Like 1855 should be before 1855–1866. It's so simple. No fuzzy logic
nor artificial intelligence required. Can we please have this already…

Although the GEDCOM standard doesn't say it explicitly, it's
semantically obvious the two date range expressions have different
purpose and are not interchangeable:
BET x AND y
Is meant for single events like birth or death where the exact date is
unknown and only an approximate date range can be estimated.
FROM x TO y
is meant for attributes that have duration like residence and
occupation. Ie. people can live in a given residence various lengths
of time and the beginning and ending dates of this stay is given with
this expression. As I use only year accuracy for attributes it's
sometimes a single year, because they move soon to another place, so a
single year is enough and no range is needed.
Why would there be two different date range expressions if there's no
difference between them?
QED
Cheers

John Nairn

unread,
Feb 20, 2012, 12:47:08 AM2/20/12
to geditcom-ii...@googlegroups.com
On Feb 19, 2012, at 6:54 PM, Swanfoth wrote:

> Attribute dates are still not sorted properly!
> When date range starts with a same year as a single year date, it
> comes first which is wrong.
> Like 1855 should be before 1855–1866. It's so simple. No fuzzy logic
> nor artificial intelligence required. Can we please have this already…

Actually, it is not so simple, but is solvable and GEDitCOM (is supposed) to be using the most rational approach. Given two date ranges, you can construct a probabilty distribution that date 1 is separated by date 2 by t days. For dates like 1855 and 1855-1866, that function crosses zero - in other words, 1855 might be after 1855 to 1866 (such as if the first turns out to be Oct 1855 while the second turns out to be Mar 1855. You can integrate this function to find the probability that t less then 0 or greater than zero. For 1855 and 1855-1866, fuzzy math methods say:

1. 1855 is after 1855-1866 with a probablity of 1/24
2. 1855 is before 1855-1866 with a probablity of 23/24

Working out all the details, date 1 is more then 50% likely to be before date 2 is the midpoint of date 1 is before the midpoint of date 2. GEDitCOM II uses this fuzzy logic in date sorts. It therefore should be sorting 1855 before 1855-1866, but I tried and didn't. I will have to look into what is happening.

>
> Although the GEDCOM standard doesn't say it explicitly, it's
> semantically obvious the two date range expressions have different
> purpose and are not interchangeable:
> BET x AND y
> Is meant for single events like birth or death where the exact date is
> unknown and only an approximate date range can be estimated.
> FROM x TO y
> is meant for attributes that have duration like residence and
> occupation. Ie. people can live in a given residence various lengths
> of time and the beginning and ending dates of this stay is given with
> this expression. As I use only year accuracy for attributes it's
> sometimes a single year, because they move soon to another place, so a
> single year is enough and no range is needed.
> Why would there be two different date range expressions if there's no
> difference between them?
> QED
> Cheers

The GEDCOM standard is explicit, although GEDitCOM II does not really distinguish these two types of dates. Both are considered a date range for date calculations. Users can use them to indicate range or period as needed.

Here is how the standard describes it:

BET =Event happened some time between date 1 AND date 2. For example, bet 1904 and 1915 indicates that the event state (perhaps a single day) existed somewhere between 1904 and 1915 inclusive.

FROM =Indicates the beginning of a happening or state.
TO =Indicates the ending of a happening or state.

The BET/AND is called a "date range" and the FROM/TO is called a "date period."

John Nairn

John Nairn, Developer

unread,
Feb 20, 2012, 1:35:47 AM2/20/12
to GEDitCOM II Discussions
I looked closer and actually GEDItCOM II does treat FROM 1855 TO 1866
differently than BET 1855 TO 1866, the former is a "Date Period" and
the later is a "Date Range" as described in the GEDCOM standard. I
think it is doing it a good way (and the current way was in part a
request by users)

1. When comparing two date ranges, they are compared using their
midpoint. This is using fuzzy math to rank dates as "before" if their
chance of being before is more than 50%.

2. When comparing dates to a "Date Period", the period is compared
using the start of the period rather than the midpoint. The logic
being that "FROM 1855 TO 1866" means a state that occurred all those
days. Thus most likely (with a probably 364/365, that state started
before the date range 1855 (meaning one date in 1855).

This logic was introduced a while ago, because in certain cases (I
forget which) it was much better than other logic. The sorting you get
will depend on if that date 1855-1866 is a date range or a date period
and also if 1855 is a date range or date period. Here is are the
results:

1. 1855 is after FROM 1855 to 1866 (because second date mostly started
before first date)
2. 1855 is before BET 1855 AND 1866 (more than 50% chance it comes
first)
3. FROM 1855 TO 1855 is same order as FROM 1855 TO 1855 (both started
on the same day)

One issue is that 1855 by itself is always treated as a range (BET 1
JAN 1855 AND 31 DEC 1831). If you want to specify it as a date period,
you have to explicitly use the awkward FROM 1855 TO 1855 or the
identical FROM 1 JAN 1855 TO 31 DEC 1831.

John Nairn

Jim Eggert

unread,
Feb 20, 2012, 10:31:54 AM2/20/12
to geditcom-ii...@googlegroups.com
If all you're interested in is being above the 50% point on the probability distribution (for date sorting purposes), and if you are using a uniform probability distribution for date ranges, then it is sufficient simply to compare the midpoints of the ranges. Whichever range has the greater midpoint, then a random date selected from that range with a uniform distribution is mathematically more likely to be later than one similarly selected from the other range.

=Jim

John Nairn

unread,
Feb 20, 2012, 12:25:53 PM2/20/12
to geditcom-ii...@googlegroups.com
On Feb 20, 2012, at 7:31 AM, Jim Eggert wrote:

> If all you're interested in is being above the 50% point on the probability distribution (for date sorting purposes), and if you are using a uniform probability distribution for date ranges, then it is sufficient simply to compare the midpoints of the ranges. Whichever range has the greater midpoint, then a random date selected from that range with a uniform distribution is mathematically more likely to be later than one similarly selected from the other range.
>
> =Jim
>

And that is exactly how GEDitCOM II compares date ranges. The catch is for date periods indicated as FROM date1 TO date2. It is kind of like comparing apples and oranges because a date range means an event happened one day in that range while a date period means a state that existed for all those dates.

For example if a family lived in a residence FROM 1900 TO 1920 and one child was born in 1908, was that child born before or after they lived in the house? The full answer is both. The child was born after they moved into the house and before they moved out. Computer computer searching algorithms, however, frown on ambiguous answers and some other decision is needed. If you use midpoints, the answer would be 1908 is before FROM 1900 TO 1920, but intuitively saying the "residence event" happened before the 1908 birth makes more sense (in my opinion). For that reason, GEDItCOM compares date periods by using their start date rather than their midpoint.

If you really want a date range and not a date period, the date should be entered as BET date1 AND date2. The later will be sorted using their midpoints.

John

Swanfoth

unread,
Feb 20, 2012, 11:16:35 PM2/20/12
to GEDitCOM II Discussions
Still the results are wrong!
This is an elementary case within attributes—not mixing with events.
Let me explain again:

According to church communion books during 1844–1855 a person lives in
residence A and then in 1855 moves to residence B and stays only a few
months there and within the same year moves to another residence C
where he stays 1855–1866. I enter these numbers in GEDitCOM and sort
the residence attributes but GEDitCOM puts residence B last! That's
what you get with too much fuzzy logic and probability distributions
because they have nothing to do in this simple case! There is some
overlap within a year due to accuracy because exact dates are not
available and thus ignored. But that should not put the short stay
last as it logically should be in the middle…
Have I finally made myself clear?

Simon Robbins

unread,
Feb 21, 2012, 3:31:20 PM2/21/12
to GEDitCOM II Discussions

I think GEDitCOM gets it right most of the time but clearly there will
be occasions where it doesn't. If fuzzy logic always produced the
right result it wouldn't really be "fuzzy".

It's not difficult to edit the GEDCOM manually for the few occasions
when the result is not as desired but perhaps a better way would be to
have a button next to each event attached to a script to "nudge" the
event up or down the list. I thought of writing such a script once
before but it didn't seem that much of an issue to me.

John Nairn

unread,
Feb 21, 2012, 4:55:21 PM2/21/12
to geditcom-ii...@googlegroups.com
Actually GEDitCOM will get it right here if you are careful to enter dates for residences (which should be entered as a date period using FROM date1 TO date2) differently than dates for events (which are one date or a range BET date1 AND date2).

You should be entering

Residence A: FROM 1855 TO 1855
Residence B: FROM 1855 TO 1855
Residence C: FROM 1855 TO 1866

These will sort almost correct. As I wrote before dates periods are sorted by their first date. Here FROM 1855 to 1855 and FROM 1855 to 1866 have the same start date (1 JAN 1855) and therefore may sort in either direction (depending on how they were ordered at the start). To solve this, you said you know that this person lived in Residence B for a few months and 1855 before moving into Residence C. Therefore Residence C cannot be from 1 JAN 1855. To document this knowledge, you could change it to

Residence A: FROM 1855 TO 1855
Residence B: FROM 1855 TO 1855
Residence C: FROM MAR 1855 TO 1866

This will now sort as you want for these residence occupations. Residence B and C overlap. If you have more information on the dates it should be entered into the residence events. Also note that Residence B must be entered with the strange FROM 1855 TO 1855 to have GEDitCOM II recognize it as a date period. This entry actually means FROM 1 JAN 1855 TO 31 DEC 1855 (and could be entered that way if you prefer). If you enter just 1855 it implies a date range and is identical to BET 1 JAN 1855 AND 31 DEC 1855. This date will sort by its midpoint and therefore will come after the start Residence C.

A definition of fuzzy data is that it is not always clear how to sort - if it was the data would not be fuzzy. Another issue with fuzzy data is what is clear by human interpretation is not always clear to computer code. If you follow how GEDitCOM II is dealing with date periods and date ranges, however, you should always be able to get a sorting that makes sense.

Potentially GEDitCOM II could treat all residence dates as date periods, but it does not. All date fields can have a date range or a date period at the users control. If you want residences to be a date period, they have to be entered that way. I will think about whether it makes sense to change that, but I general, I think it is better to give users control of data entry.

Swanfoth

unread,
Feb 22, 2012, 1:38:47 AM2/22/12
to GEDitCOM II Discussions
Actually you should treat all attribute dates as date periods. A
single year just means that it started and ended within that year. And
if there is a true period date with the same start then the single
year comes first. No need to compute range midpoints. Entering 1855–
1855 looks just silly…
These moves are recorded in the the source books as [page#/year] so
months are not available and I'd hate to invent articial months just
to get correct order when one if-statement in the code could handle it
easily.
Problem solved…
PS
I wonder if the date expressions can be nested by using other
expressions in either or both ends of the period dates like
FROM CAL 1777 TO BET 1784 AND 1786
That may bring too much complexity but could sometimes be handy… just
a thought not a request

On 21 helmi, 23:55, John Nairn <j...@geditcom.com> wrote:
> Actually GEDitCOM will get it right here if you are careful to enter dates for residences (which should be entered as a date period using FROM date1 TO date2) differently than dates for events (which are one date or a range BET date1 AND date2).
>
> You should be entering
>
> Residence A: FROM 1855 TO 1855
> Residence B: FROM 1855 TO 1855
> Residence C: FROM 1855 TO 1866
>
> These will sort almost correct. As I wrote before dates periods are sorted by their first date. Here FROM 1855 to 1855 and FROM 1855 to 1866 have the same start date (1 JAN1855) and therefore may sort in either direction (depending on how they were ordered at the start). To solve this, you said you know that this person lived in Residence B for a few months and 1855 before moving into Residence C. Therefore Residence C cannot be from1 JAN1855. To document this knowledge, you could change it to
>
> Residence A: FROM 1855 TO 1855
> Residence B: FROM 1855 TO 1855
> Residence C: FROM MAR 1855 TO 1866
>
> This will now sort as you want for these residence occupations. Residence B and C overlap. If you have more information on the dates it should be entered into the residence events. Also note that Residence B must be entered with the strange FROM 1855 TO 1855 to have GEDitCOM II recognize it as a date period. This entry actually means FROM1 JAN1855 TO31 DEC1855 (and could be entered that way if you prefer). If you enter just 1855 it implies a date range and is identical to BET1 JAN1855 AND31 DEC1855. This date will sort by its midpoint and therefore will come after the start Residence C.

John Nairn

unread,
Feb 22, 2012, 2:32:13 PM2/22/12
to geditcom-ii...@googlegroups.com
That is an option, but not the simple if statement you suggest. Date fields are now all handled the same. To make attribute date fields different then event date fields, each one will have to look up their parent tag and compare to list of attributes vs. events. The parent is not always obvious depending on where the date is accessed in the code. It would not be horribly difficult, but a lot more coding and loss of elegance in date field coding. It might slow down calculations that scan many dates at once.

I will look at other options instead. For example, perhaps it only matters to sort command and can be done in that code and would not be needed any place else. Also distinguishing 1855 from 1855 to 1866 is difficult in current sort method because it is based on one number. Accounting for start date and end date would require addition of secondary sorting criteria (unless the range can be encoded into a single number somehow for efficiency)

On nested dates such as FROM CAL 1777 TO BET 1784 AND 1786

These cannot be entered because GEDitCOM II has adopted all GEDCOM options for dates, but that option is not one of them. You do however, have some ways to record such a date:

1. Any date can be followed by a comment in parentheses such as

FROM 1777 TO 1786 (1777 was calculated, end date may have been 1784)

The comment is never used in date calculations, but will be there for your records.

2. To do the same thing in a hidden field, you can use the custom GEDitCOM II option to attach a memo to the date (or any) field:

a. Enter FROM 1777 to 1786 into the field
b. Control click on the field and chose "Attach/Edit Memo" from the pop up menu
c. Enter any line of text (such as "1777 was calculated, end date may have been 1784") and then type return or enter

Now when you hover the curse over that field, the comment will appear in a pop-up window rather then the help string for date fields. You can control click again to change the memo.

If you are exporting GEDCOM data to share with others using different software, the first option might be better. The second option exports a custom tag right after that date. If you want to both use memos and export GEDCOMs to share in other software, you can run the script "Export Data/Move Memos to Notes" before exporting your data and all memos for a record will appear in notes for the record. They will not be next to the actual date, but will be available for reference.

Swanfoth

unread,
Feb 23, 2012, 1:21:29 PM2/23/12
to GEDitCOM II Discussions
I'm not talking about global sorting but sorting within attributes.
I have several farmhands and maids who switched residences almost
yearly in their youth. As the research goes backward in time and new
items are added last to the list this is needed often because I want
keep the events in chronological order.
Midpoint calculation should work also if you take the single year date
as such and use the midpoints of period dates.
BTW how do you define midpoint of BEF x or AFT x? Just wondering…

John Nairn

unread,
Feb 23, 2012, 2:01:39 PM2/23/12
to geditcom-ii...@googlegroups.com
It actually doesn't matter if you globally sort (which the sort events command does now) or sort events and attributes separately. The sort command changes the order those events and attributes appear in the source GEDCOM data (note - you can even manually sort data in the GEDCOM source editor pane of the index window if built-in sort does not get what you want). The Events pane in the individual record, however, displays events, followed by residences, followed by attributes. Even if an attribute was sorted before an event, it would appear among the attributes in the window and not among the events.

Residences in GEDCOM files are also a schizophrenic type of event/attribute. Over the years of developing GEDitCOM/GEDitCOM II, I have included residences in either events or in attributres. The current version treats that as a unique type of event. The GEDCOM standard calls it an attribute, but unlike every other attribute, no text is allowed on the main RESI line to describe it as an attribute. The residence information is all in the subordinate detail for date, place, and address. Residences are therefore more like events because they also have no text on the first line. Although an event can have the text "Y" to indicate and event as occurred and residences do not formally allow that option.

Attribute dates and places are fairly uncommon. About the only attributes I ever create with a date and/or place are residences (if called an attribute), occupation, and education. Dates could make sense for other attributes, but not very often (e.g., a conversion of religious affiliation).

BEF x and AFT x are another challenge in fuzzy dates. I used to set the start time for BEF x to a small number (i.e., the beginning of time) and end date for AFT x to a large number in the future. This did not work well with midpoint searching (they would always be first or last). Currently BEF and AFT are ignored and the date is sorted without using that information. The real problem is that BEF x and AFT x just does not provide computer code enough information. They need to be supplemented with some idea on how long before or after. For example someone born BEF x where x was from a baptism record was probably born a few days or at most a month before that date. But, someone who died AFT 1881 (when they appeared in the 1881 census) may have died many years after that date. You could comment the date:

AFT 1881 (maybe several years)

or make up a range

BET 1881 and 1891 (he was not in the 1891 census)

The first would sort by middle of 1881, the latter would sort by 1896.

Reply all
Reply to author
Forward
0 new messages