Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Announce: DateTime::Format::Gedcom V 1.00

7 views
Skip to first unread message

Ron Savage

unread,
Sep 13, 2011, 9:12:28 PM9/13/11
to List Gedcom
Hi Folks

Released to CPAN.

Docs: http://savage.net.au/Perl-modules/html/DateTime/Format/Gedcom.html

--
Ron Savage
http://savage.net.au/
Ph: 0421 920 622

Mike Hamilton

unread,
Sep 14, 2011, 6:44:56 AM9/14/11
to List Gedcom
DateTime::Format::Gedcom (thanks, Ron!) is installed, and am just starting to play with it. I have a plethora of torture tests in the form of weird and wonderful dates in GEDCOMs from my far-flung and less than computer-literate umpteenth cousins.

Am a little surprised by the "month_names_in_" (Dutch, French, Gregorian, Hebrew, Julian) stuff, with the language name hard-coded.

Wikipedia says (http://en.wikipedia.org/wiki/Lists_of_languages) :

"According to SIL International, there are 6,309 spoken languages, as cataloged and described in the book Languages of the World (ISBN 0883128152). The International Organization for Standardization (ISO) assigns codes for most languages: for example, ISO 639-3 uses "eng" for English and "apk" for Plains Apache, one of the five Apache languages of North America."

Of course, one would only ever encounter a tiny fraction of those 6,309 languages. But (say) you want Spanish, German, Italian; this would require month_names_in_spanish, month_names_in_german, month_names_in_italian, which seems rather silly.

So, my very simple suggestion is that month_names_in_dutch(), month_names_in_french() [...] be replaced with month_names_in(language).

Mike

Ron Savage

unread,
Sep 14, 2011, 7:17:46 PM9/14/11
to List Gedcom
Hi Mike

On Wed, 2011-09-14 at 20:44 +1000, Mike Hamilton wrote:
> DateTime::Format::Gedcom (thanks, Ron!) is installed, and am just starting to play with it. I have a plethora of torture tests in the form of weird and wonderful dates in GEDCOMs from my far-flung and less than computer-literate umpteenth cousins.

I'm glad you're going to torture the code, apart from the fact I'll
probably have to improve it afterwards :-).

Make sure you're using V 1.01.

> Am a little surprised by the "month_names_in_" (Dutch, French, Gregorian, Hebrew, Julian) stuff, with the language name hard-coded.

Understand. I don't really like the design. It's due to 2 factors:

o The GEDCOM standard specifically lists only a few language escapes, on
p 45: Gregorian, Julian, Hebrew, French, Roman and Unknown. Make of that
what you will.

o The previous author of Gedcom::Date added some Dutch words to his
code, so I added the Dutch month names.

I should have put the URL I used in the POD. I'll fix that:

http://wordinfo.info/unit/3233?letter=C&spage=2

But there's another problem: Accents on letters aka I18N. My first
attempt to include modern French month names resulted in a Perl syntax
error, so I haven't yet learned how to make the source UTF-8, even
though I /thought/ that was my default (under Emacs). Moral: Still
learning - Must to better.

> Wikipedia says (http://en.wikipedia.org/wiki/Lists_of_languages) :
>
> "According to SIL International, there are 6,309 spoken languages, as cataloged and described in the book Languages of the World (ISBN 0883128152). The International Organization for Standardization (ISO) assigns codes for most languages: for example, ISO 639-3 uses "eng" for English and "apk" for Plains Apache, one of the five Apache languages of North America."
>
> Of course, one would only ever encounter a tiny fraction of those 6,309 languages. But (say) you want Spanish, German, Italian; this would require month_names_in_spanish, month_names_in_german, month_names_in_italian, which seems rather silly.
>
> So, my very simple suggestion is that month_names_in_dutch(), month_names_in_french() [...] be replaced with month_names_in(language).

I'll have a think about your suggestion, but the reason I listed all
languages in separate methods was so that people could easily sub-class
my code and add 1 method for their favourite language (before I
ripped-off - errr incorporated - their code into mine :-).

That way, a sub-class and the addition of @#DNewLanguage@ to their
GEDCOM file meant no code changes from me would be necessary to get them
up and going.

Anyway, glad to hear the code is of interest to someone...

Mike Elston

unread,
Sep 14, 2011, 9:13:55 PM9/14/11
to Ron Savage, List Gedcom
Hi Ron,

On 15 Sep 2011, at 00:17 BST, you wrote:

> o The GEDCOM standard specifically lists only a few language
> escapes, on
> p 45: Gregorian, Julian, Hebrew, French, Roman and Unknown. Make of
> that
> what you will.

These are not languages, they are fundamentally different calendars:

o the Julian calendar (the well-known plan of the days and months
created by Julius Caesar in 46 B.C., but finally stabilized about 8
A.D.; the numbering of the years, with 1 A.D. being the year of the
birth of Jesus Christ, came about centuries later, when it was agreed
the calendar begins in 4714 B.C.);

o The Gregorian calendar (in use around most of the western and new
world today, this is a modification of the Julian calendar made in
October 1582 to correct for the discovery that the Julian calendar
was out by about 3 days in every 400 years; in some countries it
replaced the Julian calendar much later, eg in England in 1752, and
not until 1923 in Greece);

o the Hebrew calendar (the Jewish year 1 is 3761 B.C. by the Julian
calendar; this calendar has lunar months 29-30 days long, of which
there are 12 most years, but 13 in its leap years);

o the French Revolutionary calendar (in use from October 1793 ( "year
1") but abandoned in January 1806 in Gregorian terms; it has 12
months of 30 days each, plus an extra 5 or 6 days grouped at the end
of the year)

o the Roman calendar (I assume by this the GEDCOM standard means pre-
Christian, and maybe pre-Julian);

o GEDCOM makes provision for dates from an unknown calendar.

The GEDCOM standard does not refer at all to oriental calendars,
about which I know very little.

Of course, the Julian/Gregorian months have different names in many
languages (including French and Hebrew!), but the calendar is always
the same.

HTH

/mike
--
Mike Elston
Email: mike....@one-name.org



Ron Savage

unread,
Sep 14, 2011, 9:25:23 PM9/14/11
to List Gedcom
Hi Mike

On Wed, 2011-09-14 at 20:44 +1000, Mike Hamilton wrote:
> So, my very simple suggestion is that month_names_in_dutch(), month_names_in_french() [...] be replaced with month_names_in(language).

The trouble with this is what does the one true method do with this
language name parameter?

For instance, if it used it as a key into a hash of month names per
language - the most obvious choice - how would that hash be extended
with new languages? Only by updating the source code, right? The way
I've done it is to let the user use sub-classing without my code needing
changes.

Alternately, we could hit a per-language file shipped with the module,
but then new languages need new files (and hence a new version), and
there's also the problem of where to store such files.

Just recently I tried this with Locale::Country::SubCountry, and had
CPAN tester failures all over the place. Admittedly, I did try shipping
the one file and installing it automatically - during module
installation - in a dir which was writable by the end user, not in the
module's own dir. But CPAN tester set-ups sometimes don't have home dirs
or they're not writable, or .... So, nope, not that way.

Or, the code could scrape the names directly from the web page I
mentioned in the last email. Messy, /and/ you have to be on-line. Nope,
not that way either :-(.

So, unless you can come up with another way of solving this problem,
I'll leave the design as is.

Mike Hamilton

unread,
Sep 14, 2011, 10:28:57 PM9/14/11
to List Gedcom
I'm afraid Mike Elston has totally nailed it when he writes "These are not languages, they are fundamentally different calendars" (re. Gregorian, Julian, Hebrew, French, Roman and Unknown). It's obvious (now!), but I admit that the penny didn't drop for me even though "Gregorian" and "Julian" in particular are dead giveaways.

Simply put, calendar != language. I suppose one could (in theory) have a date in the Swahili language using the Hebrew calendar.

>> So, my very simple suggestion is that month_names_in_dutch(), month_names_in_french() [...]
>> be replaced with month_names_in(language).
>
> The trouble with this is what does the one true method do with this language name parameter?
>
> For instance, if it used it as a key into a hash of month names per language - the
> most obvious choice - how would that hash be extended with new languages?

With "getter" and "setter" methods: get_month_names(language) and set_month_names(language).

I do hope that you may yet reconsider your preference for hard-coding the language names into the method names. Also, subclassing is IMHO a very inelegant and clumsy solution. If need be, a user could create their own month_names_in_swahili() which would return get_month_names(swahili).

Cheers,

Mike Hamilton


Mike Hamilton

unread,
Sep 15, 2011, 12:02:02 AM9/15/11
to Mike Hamilton, List Gedcom
Further thoughts on DateTime::Format::Gedcom language issues:

This may be heretical, but it seems to me that attempting to provide *universal* language support opens up a can of worms that (unless one is prepared to devote a lifetime or three to the task) is too large to be digested. The required knowledge of individual languages surely presents an insurmountable hurdle.

Ron mentions (in another context) modern French. As it happens, I have some French ancestry, and enough elementary knowledge to know my avril from my elbow; but I have no idea how dates are represented in Swahili, Farsi, Mandarin or the inverse click of the Kalahari Bushmen. I'll bet that there are many, many weird and wonderful (to Western mindsets) ways of describing dates.

LANGUAGE_ID in the GEDCOM spec has:

Afrikaans | Albanian | Anglo-Saxon | Catalan | Catalan_Spn | Czech | Danish | Dutch | English | Esperanto | Estonian | Faroese | Finnish | French | German | Hawaiian | Hungarian | Icelandic | Indonesian | Italian | Latvian | Lithuanian | Navaho | Norwegian | Polish | Portuguese | Romanian | Serbo_Croa | Slovak | Slovene | Spanish | Swedish | Turkish | Wendic

plus ("other languages not supported until UNICODE")

Amharic | Arabic | Armenian | Assamese | Belorusian | Bengali | Braj | Bulgarian | Burmese | Cantonese | Church-Slavic | Dogri | Georgian | Greek | Gujarati | Hebrew | Hindi | Japanese | Kannada | Khmer | Konkani | Korean | Lahnda | Lao | Macedonian | Maithili | Malayalam | Mandrin |Manipuri | Marathi | Mewari | Nepali | Oriya | Pahari | Pali | Panjabi | Persian | Prakrit | Pusto | Rajasthani | Russian | Sanskrit | Serb | Tagalog | Tamil | Telugu | Thai | Tibetan | Ukrainian | Urdu | Vietnamese | Yiddish ]

Now, I hear you saying "that's ridiculous - I've never seen a GEDCOM in Navaho, Faroese, or Rajasthani, and never will", which is a very fair point.
But DateTime::Format::Gedcom claims to parse GEDCOM dates; it doesn't say "some conditions apply."

Therefore, I reluctantly and unhappily suggest that DateTime::Format::Gedcom should be a base class, from which DateTime::Format::Gedcom::English, DateTime::Format::Gedcom::French, DateTime::Format::Gedcom::Sanskrit [...] would derive.

Yes, it's ghastly. The old joke about "surpasseth all understanding" = "understands all parsers" applies.

Mike Hamilton

Ron Savage

unread,
Sep 15, 2011, 12:40:04 AM9/15/11
to List Gedcom
Hi Mike

OK! I admit my code does not really check days per month for, say, any
of the Julian Calendars, as per
http://en.wikipedia.org/wiki/Roman_calendar.

In order to do that, the code would have to permit the user to specify
month names for their chosen (say) Julian calendar, and days per month,
and the same for any other calendar.

To that end, Mike Hamilton's suggestion of set_* and get_* sound like a
better mechanism than my default.

Let me think about it. After all, I've released the /first/ version of
this code, not the /last/...

Ron Savage

unread,
Sep 15, 2011, 12:49:20 AM9/15/11
to List Gedcom
Hi Mike

On Thu, 2011-09-15 at 12:28 +1000, Mike Hamilton wrote:
> I'm afraid Mike Elston has totally nailed it when he writes "These are not languages, they are fundamentally different calendars" (re. Gregorian, Julian, Hebrew, French, Roman and Unknown). It's obvious (now!), but I admit that the penny didn't drop for me even though "Gregorian" and "Julian" in particular are dead giveaways.

Sure.

> Simply put, calendar != language. I suppose one could (in theory) have a date in the Swahili language using the Hebrew calendar.
>
> >> So, my very simple suggestion is that month_names_in_dutch(), month_names_in_french() [...]
> >> be replaced with month_names_in(language).
> >
> > The trouble with this is what does the one true method do with this language name parameter?
> >
> > For instance, if it used it as a key into a hash of month names per language - the
> > most obvious choice - how would that hash be extended with new languages?
>
> With "getter" and "setter" methods: get_month_names(language) and set_month_names(language).

That's a promising idea. I'll have a look more deeply tomorrow, but
right now I'd say that'll be the technique I switch to. Thanx.

I think it'd be set_month_names($calendar, $array_ref_of_names), which
would store those names into the pre-existing hash of default month
names per calendar.

> I do hope that you may yet reconsider your preference for hard-coding the language names into the method names. Also, subclassing is IMHO a very inelegant and clumsy solution. If need be, a user could create their own month_names_in_swahili() which would return get_month_names(swahili).

Some default names, e.g. Gregorian, will be hard-coded (surely).

I'm astonished you think sub-classing is inelegant and clumsy, although
I certainly concede vast amounts of code have been written in such a way
that it's awkward to use. I would claim that's not the fault of the
sub-classing concept, but of those specific implementations.

Mike Hamilton

unread,
Sep 15, 2011, 1:43:02 AM9/15/11
to List Gedcom
Ron writes:

> I think it'd be set_month_names($calendar, $array_ref_of_names), which would
> store those names into the pre-existing hash of default month names per calendar.

I'd prefer set_calendar(calendar) and set_month_names(language,month_names). Completely decouple the calendar from the month names.

> I'm astonished you think sub-classing is inelegant and clumsy

Real-life example: I have French and German names in my tree, so I need French accents (acute, grave, etc) and, maybe one day, German umlauts. So I have to create new classes for French and German, and then some sort of multiple inheritance class FrenchGerman ?

Non-real-life: tomorrow I receive a new tree in Swahili, which requires a FrenchGermanSwahili class. The day after, my long-lost Tibetan cousin emails with his tree. He has data in Tibetan, Thai, Manipuri, Konkani and Marathi, so do I create a new class named FrenchGermanSwahiliTibetanThaiManipuriKonkaniMarathi ?

Yes, that's of course an extreme, contrived and absurd example.

Please don't take any of my comments as being negative. DateTime::Format::Gedcom is already a worthwhile and useful module. GEDCOM parsers seem easy at first glance, but many have given up after facing the nitty-gritty bits, the "ifs and buts" and the edge cases !

Must sign off now, there are some angels dancing on the heads of pins that I have to count ...

Mike Hamilton
(in Melbourne)



-----Original Message-----
From: Ron Savage [mailto:r...@savage.net.au]

Ron Savage

unread,
Sep 15, 2011, 1:48:12 AM9/15/11
to List Gedcom
Hi Mike

On Thu, 2011-09-15 at 14:02 +1000, Mike Hamilton wrote:
> Further thoughts on DateTime::Format::Gedcom language issues:
>
> This may be heretical, but it seems to me that attempting to provide *universal* language support opens up a can of worms that (unless one is prepared to devote a lifetime or three to the task) is too large to be digested. The required knowledge of individual languages surely presents an insurmountable hurdle.

I certainly don't want to support anything even vaguely approaching
'universal'.

I'm hoping no knowledge of /languages/ is required, but rather only
of /calendars/. And even that will be minimal.

> Ron mentions (in another context) modern French. As it happens, I have some French ancestry, and enough elementary knowledge to know my avril from my elbow; but I have no idea how dates are represented in Swahili, Farsi, Mandarin or the inverse click of the Kalahari Bushmen. I'll bet that there are many, many weird and wonderful (to Western mindsets) ways of describing dates.

True, but the problem is presumably tractable precisely because we're
dealing with nothing but GEDCOM dates.

> LANGUAGE_ID in the GEDCOM spec has:
>
> Afrikaans | Albanian | Anglo-Saxon | Catalan | Catalan_Spn | Czech | Danish | Dutch | English | Esperanto | Estonian | Faroese | Finnish | French | German | Hawaiian | Hungarian | Icelandic | Indonesian | Italian | Latvian | Lithuanian | Navaho | Norwegian | Polish | Portuguese | Romanian | Serbo_Croa | Slovak | Slovene | Spanish | Swedish | Turkish | Wendic
>
> plus ("other languages not supported until UNICODE")
>
> Amharic | Arabic | Armenian | Assamese | Belorusian | Bengali | Braj | Bulgarian | Burmese | Cantonese | Church-Slavic | Dogri | Georgian | Greek | Gujarati | Hebrew | Hindi | Japanese | Kannada | Khmer | Konkani | Korean | Lahnda | Lao | Macedonian | Maithili | Malayalam | Mandrin |Manipuri | Marathi | Mewari | Nepali | Oriya | Pahari | Pali | Panjabi | Persian | Prakrit | Pusto | Rajasthani | Russian | Sanskrit | Serb | Tagalog | Tamil | Telugu | Thai | Tibetan | Ukrainian | Urdu | Vietnamese | Yiddish ]
>
> Now, I hear you saying "that's ridiculous - I've never seen a GEDCOM in Navaho, Faroese, or Rajasthani, and never will", which is a very fair point.
> But DateTime::Format::Gedcom claims to parse GEDCOM dates; it doesn't say "some conditions apply."
>
> Therefore, I reluctantly and unhappily suggest that DateTime::Format::Gedcom should be a base class, from which DateTime::Format::Gedcom::English, DateTime::Format::Gedcom::French, DateTime::Format::Gedcom::Sanskrit [...] would derive.
>
> Yes, it's ghastly. The old joke about "surpasseth all understanding" = "understands all parsers" applies.

Nope - Not worried.

GEDCOM's definition of date_calendar_escape mercifully does not refer to
language_id (of which there are just 3 references in the doc).

I chose DateTime::Format::Natural so as to pass over to it as much work
as possible, with the aim of leaving myself with as little work as
possible. That's what CPAN is all about.

Nevertheless, Mike and Mike's comment make me think I will have to
re-work the code, along the lines of:

set_month_names($language, $array_of_month_stuff)

Where $language becomes recognizable when it appears in a
date_calendar_escape, and the arrayref is like:
['January', 'Jan', 30, ...], with 3 elements per month.

I still don't like the idea of parsing the dates myself, but I assume it
will come to that.

Ron Savage

unread,
Sep 15, 2011, 4:33:15 AM9/15/11
to List Gedcom
Hi Mike

On Thu, 2011-09-15 at 15:43 +1000, Mike Hamilton wrote:
> Ron writes:
>
> > I think it'd be set_month_names($calendar, $array_ref_of_names), which would
> > store those names into the pre-existing hash of default month names per calendar.
>
> I'd prefer set_calendar(calendar) and set_month_names(language,month_names). Completely decouple the calendar from the month names.

I'll think about this. You've got me worried, if nothing else.

You're implying that you can think of a /realistic/ use case with 1
calendar and multiple languages.

Now I'm beginning to hope you're wrong :-)).

> > I'm astonished you think sub-classing is inelegant and clumsy
>
> Real-life example: I have French and German names in my tree, so I need French accents (acute, grave, etc) and, maybe one day, German umlauts. So I have to create new classes for French and German, and then some sort of multiple inheritance class FrenchGerman ?

No no. You've overlooked the precise details of my non-existent
implementation. Hahahahahaha.

The code will stockpile user options, be they calendar or language.

So, calls to set options can be endless in number. The code will just be
storing stuff in a hash.

Access to the hash is, as always, triggered by your usage of a
date_calendar_escape in the GEDCOM file itself.

By design, a single file can hence use any number of escapes.

> Non-real-life: tomorrow I receive a new tree in Swahili, which requires a FrenchGermanSwahili class. The day after, my long-lost Tibetan cousin emails with his tree. He has data in Tibetan, Thai, Manipuri, Konkani and Marathi, so do I create a new class named FrenchGermanSwahiliTibetanThaiManipuriKonkaniMarathi ?
>
> Yes, that's of course an extreme, contrived and absurd example.
>
> Please don't take any of my comments as being negative. DateTime::Format::Gedcom is already a worthwhile and useful module. GEDCOM parsers seem easy at first glance, but many have given up after facing the nitty-gritty bits, the "ifs and buts" and the edge cases !

I assuredly don't take them negatively.

I haven't given up yet, but I'm beginning to get your hint...

> Must sign off now, there are some angels dancing on the heads of pins that I have to count ...

What a coincidence. I computed that number earlier today. It's ...

Damn - there's not enough space in this margin for my proof, but it can
be approximated by $infinity/$zero.

Eugene van der Pijll

unread,
Sep 15, 2011, 6:45:19 AM9/15/11
to Ron Savage, List Gedcom
Ron Savage schreef:
> o The previous author of Gedcom::Date added some Dutch words to his
> code, so I added the Dutch month names.

I would have preferred the title of "other author"; I'm still keeping
open the option of continuing to work on Gedcom::Date later. (In fact,
I've been working on a script to validate GEDCOM files, and that has
given me a bit of insiration about future improvements.)

About the Dutch names: Gedcom::Date only uses these for output.
When parsing GEDCOM strings, only the restricted set of month
abbreviations in the GEDCOM standard are accepted.

Some remarks about DT::F::Gedcom:

* It doesn't follow the semantics for DateTime::Format::* modules:
parse_datetime() doesn't return a DateTime object, and it doesn't have
a format_datetime() function. This is understandable, because a GEDCOM
date string does not always correspond to an exact date, but I wonder
if it should be in this namespace.

(This is the reason that Gedcom::Date is not in the DateTime
namespace, even though it returns DateTime objects.)

* Not accepting years < 1000 is a bad thing, certainly if you accept
dates in the French calender. Single digit years are very common in
that calendar. DateTime can handle 3-digit years, and even BC years; I
would expect the same from any DT parser module.

* What is the value of the "one_ambiguous" flag if it is set by "1 JAN
2000" (especially when "1/1/2000" isn't ambiguous either, and
"1/2/2000" is not * allowed by the GEDCOM standard?)

* How does your module record the difference between "2000", "JAN 2000"
and "1 JAN 2000"?

* What is the benefit of using this module over Gedcom::Date? Or do you
have future plans that cannot be done with Gedcom::Date? Not that I
mind a bit of competition, of course.

Eugene

Mike Elston

unread,
Sep 15, 2011, 1:42:06 PM9/15/11
to Ron Savage, List Gedcom
Hello all,

I have now had a chance for a first look at DateTime::Format::Gedcom

I applaud the intention, and we all owe a great debt to Ron for his
work on Perl Gedcom.

But in my humble opinion, this class shows a complete
misunderstanding both of how dates are often presented in GEDCOM
files (and especially in files that claim to be GEDCOM but may not be
strictly so), and of the idea of a GEDCOM date.

(1) You cannot separate the <DATE_CALENDAR_ESCAPE> from the
<DATE_CALENDAR> when parsing a GEDCOM file: the
<DATE_CALENDAR_ESCAPE> specifies which calendar the date is from
(nothing, as I've said before, to do with languages), and the GEDCOM
specification is that it defaults to @#DGREGORIAN@.

(2) Although the GEDCOM specification pays lip-service to it, it has
no consistent way of differentiating between dates specified
according to the 'old-style' calendar (in which the year started on
Lady Day, March 25th) in use in countries such as England before they
adopted the Gregorian calendar, and dates specified according to the
'new-style' calendar where the year starts on January 1st.

For example, when reading a GEDCOM file presented, say, by the LDS's
own website familysearch.com, one has to remember that dates such as
christenings from English parish registers up to the middle of the
18th century, for example, are invariably written in the register as
(Julian) dates in old style, not (Gregorian) dates in the modern
style. The LDS's own transcribers, as far as my research has
indicated, generally (but not always) copied the dates as they were
entered, but many dates contributed to the IGI have been "converted"
to the new-style calendar. The IGI never uses (at least, I've never
seen it) the @#DJULIAN@ calendar escape, nor does it use the 'old-
style/new-style' format for specifying dates. For example, it is
quite possible for an entry on the IGI from a transcribed parish
register to state that someone was christened on 20 FEB 1675, having
been born on 23 DEC 1675, and there is no inconsistency (since in the
old calendar in use at the time, 23 DEC preceded 20 FEB in the year
1675. The year began on 25 MAR 1675 and ended on 24 MAR 1675, which
is the day before 25 MAR 1676 -- in England (and most of its
colonies), the new year began on March 25th until the year 1752 which
began on 1 Jan 1752. Yet someone contributing a record to the
Ancestral File or to the IGI may have converted the date to then new-
style calendar, and report the christening as 20 FEB 1676.

Note that for this reason, 20 FEB 1675 will often be written by
genealogists as 20 FEB 1675/76, a format GEDCOM recognises, implying
that it means 20 FEB 1675 by the old calendar (when the year started
on 25 MAR), which we would now think of as 20 FEB 1676, because we
would treat the year 1676 as starting on 1 JAN (ie the day after 31
DEC 1675).

In other words, GEDCOM files produced by familysearch.com do not
strictly obey the GEDCOM date rules, which are themselves incomplete.

Confused already? Try this...

Another case: a date from a Scottish parish register for 1675 would
be entered according to the Gregorian calendar (which was in use in
Scotland by then, but not in England), so (at that time) 23 JUL 1695
in Scotland was not the same day as 23 JUL 1695 in England (which, if
my mental arithmetic is correct, was 3 AUG 1695 on the Gregorian
calendar). By 1752, when England changed to the Gregorian calendar, a
difference of 11 days had accumulated, which is why 2 SEP 1752 was
followed by 14 SEP 1752 in England to effect the correction, which
gave rise to the cry "give us back our 11 days" by many of the masses
who believed the government of the day had shortened their lives by
11 days!

In England, 1752 was also the first year which officially began on 1
JAN; thus 1751 was only 282 days long, lasting from 25 MAR 1751 to 31
DEC 1751. It is a matter of debate (and sometimes impossible to
decide) which calendar is being used on contemporary documents about
that time, as some sources started using the 'new' calendar year (1
Jan-31 Dec) before 1752, and some not until later.

However, strictly speaking, the 'old-style/new-style' calendar
dichotomy is separate from the 'Julian/Gregorian' calendar
distinction; the changes from one to t'other simply happen to have
been effected in England in the same year.

(To confuse matters further, the year in Saxon and Norman times began
on 25 December rather than 25 March or 1 January!)

If you're researching old French or Jewish documents, you may have a
much greater problem.

Suppose you know that someone was born on 14e Germinal in the year 3
by the French Republican calendar (ie "@#DFRENCH@14 GERM 3"), and
that they died on 17e Avril 1846 ("17 APR 1846"). How would you
calculate how old they were when they died? Surely this is something
DateTime::Format::Gedcom should be capable of doing for us?

Or suppose a Jewish register states that a child was born on Kislev
23 5445 ("@#DHEBREW@23 KSL 5445"). How old were they when their
family emigrated to another country on 14th April 1734 (according to
the local records which were probably using the Julian calendar)? You
need to know what is the Julian date corresponding to Rosh-Hashana
(the first day of the Jewish year, Tishri 1) in the year 5445. And if
you want to do exact date calculations, Kislev is the 3rd month in
the Jewish calendar, but how many days were there in the preceding
month (Cheshvan) in that year? 29 or 30? (it's variable from one year
to the next.)

OK, maybe I'm being pedantic. The most important thing for most
researchers is to get the Gregorian calendar right, and the next most
important thing is to recognise the differences between Julian dates
and Gregorian dates, and between the old-style and new-style Julian/
Gregorian calendars.

A proper GEDCOM date class ought to be able to represent a date
specified according to the Julian or the Gregorian calendar (or the
French Republican or Hebrew calendars) as a standard internal date
(say, the number of days since some arbitrary epoch), and present it
according to any requested calendar or style.

I think that's enough for a start :-) I will take up the issues of
approximate dates, of date periods and ranges, of non-exact dates, of
month names in different languages, and the fundamental job of
parsing DATE lines in GEDCOM-style files, in subsequent emails. (Some
of these issues have already been raised by other users...)

/mike

(Note: much of the above was the result of my research when I was
writing an Objective-C date class GenDate for genealogical dates for
use in my own GEDCOM-compatible application).

Ron Savage

unread,
Sep 15, 2011, 6:24:33 PM9/15/11
to List Gedcom
Hi Mike

On Thu, 2011-09-15 at 18:42 +0100, Mike Elston wrote:
> Hello all,
>
> I have now had a chance for a first look at DateTime::Format::Gedcom
>
> I applaud the intention, and we all owe a great debt to Ron for his
> work on Perl Gedcom.
>
> But in my humble opinion, this class shows a complete
> misunderstanding both of how dates are often presented in GEDCOM
> files (and especially in files that claim to be GEDCOM but may not be
> strictly so), and of the idea of a GEDCOM date.

[snip huge analysis of date issues]

It's true that the code is quite crude in it's current handling of
dates.

It can be either continued to be worked on or abandoned. Do you think it
should be abandoned? If you think that, just say so :-).

Or perhaps should it not make any attempt to support anything other than
Gregorian dates, at least for some period of time? Is that worth doing?

I have plenty of time available to change how the code works, but only
if there is some point in doing so.

Ron Savage

unread,
Sep 15, 2011, 6:48:37 PM9/15/11
to List Gedcom
Hi Eugene

On Thu, 2011-09-15 at 12:45 +0200, Eugene van der Pijll wrote:
> Ron Savage schreef:
> > o The previous author of Gedcom::Date added some Dutch words to his
> > code, so I added the Dutch month names.
>
> I would have preferred the title of "other author"; I'm still keeping

Do you mean changing this text "Thanx to Eugene van der Pijll, the
author of the Gedcom::Date::* modules." to add 'other', or something
else? I'm quite happy to change what the comments say.

> open the option of continuing to work on Gedcom::Date later. (In fact,
> I've been working on a script to validate GEDCOM files, and that has
> given me a bit of insiration about future improvements.)

OK.

> About the Dutch names: Gedcom::Date only uses these for output.
> When parsing GEDCOM strings, only the restricted set of month
> abbreviations in the GEDCOM standard are accepted.

OK - I'll cut the Dutch words out. But as you can see from other emails,
and your own experience, calendar/language support is complex.

> Some remarks about DT::F::Gedcom:
>
> * It doesn't follow the semantics for DateTime::Format::* modules:
> parse_datetime() doesn't return a DateTime object, and it doesn't have
> a format_datetime() function. This is understandable, because a GEDCOM
> date string does not always correspond to an exact date, but I wonder
> if it should be in this namespace.
>
> (This is the reason that Gedcom::Date is not in the DateTime
> namespace, even though it returns DateTime objects.)

I feel this is a difficult decision, for the reasons you specify.

I think the basic problem is that the GEDCOM doc was not designed to fit
into the DateTime namespace, but the concept of parsing dates does,
especially given I decided to return DateTime objects.

It would (also) make sense to call it Genealogy::Gedcom::Date.

Anyone care to comment either way? Renaming it would stop a waste of
energy arguing about this.



> * Not accepting years < 1000 is a bad thing, certainly if you accept
> dates in the French calender. Single digit years are very common in
> that calendar. DateTime can handle 3-digit years, and even BC years; I
> would expect the same from any DT parser module.

As I said in another reply, I pre-process the candidate date, and then
pass it to DateTime::Format::Natural, but the latter does not always
accept years < 1000.

> * What is the value of the "one_ambiguous" flag if it is set by "1 JAN
> 2000" (especially when "1/1/2000" isn't ambiguous either, and
> "1/2/2000" is not * allowed by the GEDCOM standard?)

I overlooked that case.

> * How does your module record the difference between "2000", "JAN 2000"
> and "1 JAN 2000"?

It doesn't.

Is it worth extending the code to return info about those distinctions?

> * What is the benefit of using this module over Gedcom::Date? Or do you
> have future plans that cannot be done with Gedcom::Date? Not that I
> mind a bit of competition, of course.

It isn't necessarily superior.

It does aim to return more information per date, which is very important
to me.

Also, it helps me exercise my coding skills.

Eugene van der Pijll

unread,
Sep 15, 2011, 7:02:37 PM9/15/11
to Ron Savage, List Gedcom
Ron Savage schreef:
> > > o The previous author of Gedcom::Date ...
>
> Do you mean changing this text "Thanx to Eugene van der Pijll, the
> author of the Gedcom::Date::* modules." to add 'other', or something
> else? I'm quite happy to change what the comments say.

No, I just meant that "the previous author" suggest that I'm not working
on Gedcom::Date any longer. In fact, this discussion has motivated me to
look at the modules again, and I've already added a number of things
(like support for Julian, French and Hebrew calendars).

> > * What is the benefit of using this module over Gedcom::Date? Or do you
> > have future plans that cannot be done with Gedcom::Date? Not that I
> > mind a bit of competition, of course.
>
> It isn't necessarily superior.
>
> It does aim to return more information per date, which is very important
> to me.

That implies that it's superior _for you_, which is enough reason for
this module to exist.

> Also, it helps me exercise my coding skills.

That too.

Eugene

Ron Savage

unread,
Sep 20, 2011, 12:35:04 AM9/20/11
to List Gedcom
Hi Eugene

I'll release a simplified version of my module as
Genealogy::Gedcom::Date sometime soon.

On Thu, 2011-09-15 at 12:45 +0200, Eugene van der Pijll wrote:

> * How does your module record the difference between "2000", "JAN 2000"
> and "1 JAN 2000"?

How does you code distinguish between these cases?

Do you one of the GEDCOM concepts About, Calculated, Estimated or
Interpreted?

Do people want to know that part of a date has been fabricated by the
code?

Stephen Pickles

unread,
Sep 20, 2011, 6:19:32 AM9/20/11
to Ron Savage, List Gedcom
These are important questions.

IMHO, a Gedcom date parser should (1) distinguish between these cases,
(2) indicate to the caller whether the date has any special attributes
(EST, ABT, etc), (3) return a date in a normalised format using the
native GEDCOM syntax with no fabricated information. Any conversions
to one of the confusion of standard internet date/time formats could
be handled by a separate call, once its known whether the date is
exact and complete (e.g. 1 JAN 2000).

The difference between these cases is fundamental to genealogy. We
should be trying to represent what we actually know. If I know only
that a child was born in the year 2000, I record "2000" in the GEDCOM
date. I don't want tools to fabricate a month and year.

To expand on (2). When I last studied the GEDCOM specification some
years ago, I was left with the impression that GEDCOM's attempt to
codify the various possible attributes that a date might have was a
little sloppy. I doubt that the syntax quite allows one to express
everything in a date that one might want (for example, I might know
that an ancestor was born on 4 JUL, yet be uncertain about the year,
but I don't think that "4 JUL ABT 1860" is allowed). I also have the
feeling that the GEDCOM's spec explanation of the semantics of these
attributes is not complete.

Add to this the issues of language (and what do you do if a language
comes along in which a month has the name "ABT"), and the calendar
(Gregorian, Julian, or even Shire Reckoning for any hobbit
genealogists out there), it's clear that comprehensively dealing with
the all the possible attributes that a date could have will be far
from trivial. But I'd have to hope that a community effort would
eventually cover most of the important cases.

To expand on (3). If you're developing a tool to handle date
information in GEDCOM files from a variety of sources, the place you
need the most help is actually parsing the date. You want to know
whether the date is syntactically valid, whether it's approximate (and
there's various kinds of approximate), whether the day, month and/or
year are supplied. You'd like to know what language and calendar it's
in. You probably want to write it out again, normalised for
capitalisation, whitespace, language (and calendar, but that might be
hard).

A couple of years ago, I wrote a little tool to extract all exact
dates of birth, marriage and death events from a GEDCOM file, and
write them out as a calendar file (ical format, RFC 2445). I was using
Paul Johnson's Gedcom package from CPAN. I started out using
Date::Manip to parse dates (partly because the Gedcom package was
already using it), but I ended up having to write my own parser. The
problem was that Date::Manip's parser would fabricate days ("JAN 2000"
would come back as "1 JAN 2000"). Paul Johnson's date normalisation
routine suffered from the same problem, because it too relied on
Date::Manip. A pity, because date normalisation would be very helpful
when you're comparing information about an event from two GEDCOM
files.

Thanks for starting an unusually interesting discussion.

Stephen

Paul Johnson

unread,
Sep 20, 2011, 7:05:46 AM9/20/11
to Stephen Pickles, Ron Savage, List Gedcom
On Tue, Sep 20, 2011 at 11:19:32AM +0100, Stephen Pickles wrote:
> These are important questions.

[ Good observations and points. ]

> A couple of years ago, I wrote a little tool to extract all exact
> dates of birth, marriage and death events from a GEDCOM file, and
> write them out as a calendar file (ical format, RFC 2445). I was using
> Paul Johnson's Gedcom package from CPAN. I started out using
> Date::Manip to parse dates (partly because the Gedcom package was
> already using it), but I ended up having to write my own parser. The
> problem was that Date::Manip's parser would fabricate days ("JAN 2000"
> would come back as "1 JAN 2000"). Paul Johnson's date normalisation
> routine suffered from the same problem, because it too relied on
> Date::Manip. A pity, because date normalisation would be very helpful
> when you're comparing information about an event from two GEDCOM
> files.

I'm afraid I've not been keeping up with the messages here recently. My
excuse it that I got behind at YAPC and have been really busy with work since
then. But be that as it may, let me comment on this particular aspect.

What Stephen says about Gedcom.pm is all true; I really punted on the date
handling. There are two reasons for that:

1. It's hard. I didn't want to write yet another date handling package.
Date::Manip is overkill in almost all respects and yet, as Stephen notes,
it is still insufficient for genealogical use. And I just didn't have the
heart to dive into its own code.

2. As Stephen also notes, I didn't feel that the GEDCOM specification's
description of dates was sufficient anyway. So even if I, or someone
else, were to fully implement it, I didn't think it would be a full
solution.

And then there's the question of what are you going to do with the
dates anyway? Full, complete dates are clear(*), but what about all
the other possibilities, either allowed by the GEDCOM spec or not.
Most tools would have no idea how to handle "Between May and July
1678", let alone something like "Easter Sunday in either 1783 or 1785".
So I thought to leave dates as basically free-form fields, with the
option to use Date::Manip to normalise them as far as possible, if
required.

(*) I say clear, but what about times and time zones, or calendar
changes? And no doubt there are other complexities. Rarely is
anything clear-cut in genealogy.

> Thanks for starting an unusually interesting discussion.

Agreed.

--
Paul Johnson - pa...@pjcj.net
http://www.pjcj.net

Eugene van der Pijll

unread,
Sep 20, 2011, 2:40:19 PM9/20/11
to Paul Johnson, Stephen Pickles, Ron Savage, List Gedcom
Paul Johnson schreef:

> 2. As Stephen also notes, I didn't feel that the GEDCOM specification's
> description of dates was sufficient anyway. So even if I, or someone
> else, were to fully implement it, I didn't think it would be a full
> solution.

There are several things in the GEDCOM specification that are definitely
missing: phrases like "FROM BEF 1820 TO ABT 1825" for example. It would
be interesting to develop a more complete GEDCOM-like date grammar.

> And then there's the question of what are you going to do with the
> dates anyway?

There's several things that a GEDCOM Date module (or a program) would
want to do with a date:

* Validate
* Date math / checking, such as "is date B more than 16 years after date
A?"
* Text output, e.g. for report writing ("ABT APR 1820" => "around April 1820").

All of these are still useful even with the more interesting GEDCOM date
formats.

For example, using my own module:

use Gedcom::Date;

my $birth = Gedcom::Date->parse("BET JUL 1820 AND JUL 1825");
my $marr = Gedcom::Date->parse("BEF 1834");

print "Too young at marriage\n"
if $birth->clone->add( years => 16 ) > $marr;

You really want to be able to write such a validation rule, without
having to treat all different GEDCOM date formats in your code
explicitly.

> Most tools would have no idea how to handle "Between May and July
> 1678", let alone something like "Easter Sunday in either 1783 or 1785".

That most tools can't handle the first date is no reason not to try to
accept it in your own scripts. The second date is not really expressable
in GEDCOM, except as a (unparsable) date phrase.

> (*) I say clear, but what about times and time zones, or calendar
> changes? And no doubt there are other complexities. Rarely is
> anything clear-cut in genealogy.

GEDCOM was originally designed as an output scheme for genealogical
conclusions. Calendar changes should have been handled by the
genealogist or the program that created a GEDCOM file; when a date has
been outputted to a GEDCOM, it refers to a definite date in a known
calendar (either explicitly by a @#D...@ escape, or implicitly to the
Gregorian calender).

Times are outside the scope of GEDCOM; the standard has defined no tags
for them.

So while times, time zones and calendar changes are problematical in
genealogy, they shouldn't be a problem when interpreting a valid GEDCOM
file.

Eugene

Eugene van der Pijll

unread,
Sep 20, 2011, 2:45:18 PM9/20/11
to Ron Savage, List Gedcom
Ron Savage schreef:

> > * How does your module record the difference between "2000", "JAN 2000"
> > and "1 JAN 2000"?
>
> How does you code distinguish between these cases?

It remembers the parts of the date (dmy, or my, or just y) that are
known, and only uses those components. It fabricates the missing day or
month where necessary, but these are never returned to the outside
world.

> Do you one of the GEDCOM concepts About, Calculated, Estimated or
> Interpreted?

Not at this moment. "1900" is equivalent to "BET 1 JAN 1900 AND 31 DEC
1900", and "BET 1897 AND 1903" is rougly equivalent to "ABT 1900", but I
haven't (yet) added a method to convert these GEDCOM date strings to
each other.

Eugene

Ron Savage

unread,
Sep 20, 2011, 6:25:27 PM9/20/11
to List Gedcom
Hi Eugene

On Tue, 2011-09-20 at 20:45 +0200, Eugene van der Pijll wrote:
> Ron Savage schreef:
> > > * How does your module record the difference between "2000", "JAN 2000"
> > > and "1 JAN 2000"?
> >
> > How does you code distinguish between these cases?
>
> It remembers the parts of the date (dmy, or my, or just y) that are
> known, and only uses those components. It fabricates the missing day or
> month where necessary, but these are never returned to the outside
> world.

Yes, thanx. I have seen that code. I thought you might have added
something else recently.

> > Do you one of the GEDCOM concepts About, Calculated, Estimated or
> > Interpreted?
>
> Not at this moment. "1900" is equivalent to "BET 1 JAN 1900 AND 31 DEC
> 1900", and "BET 1897 AND 1903" is rougly equivalent to "ABT 1900", but I
> haven't (yet) added a method to convert these GEDCOM date strings to
> each other.

Interesting. It suggests many approximate dates could be reported as a
range of dates. More to think about...

Ron Savage

unread,
Sep 20, 2011, 6:35:49 PM9/20/11
to List Gedcom
Hi Eugene

On Tue, 2011-09-20 at 20:40 +0200, Eugene van der Pijll wrote:
> Paul Johnson schreef:
> > 2. As Stephen also notes, I didn't feel that the GEDCOM specification's
> > description of dates was sufficient anyway. So even if I, or someone
> > else, were to fully implement it, I didn't think it would be a full
> > solution.
>
> There are several things in the GEDCOM specification that are definitely
> missing: phrases like "FROM BEF 1820 TO ABT 1825" for example. It would
> be interesting to develop a more complete GEDCOM-like date grammar.

There's always the problem of getting people to stick to any 'standard'.

> > And then there's the question of what are you going to do with the
> > dates anyway?
>
> There's several things that a GEDCOM Date module (or a program) would
> want to do with a date:
>
> * Validate
> * Date math / checking, such as "is date B more than 16 years after date
> A?"
> * Text output, e.g. for report writing ("ABT APR 1820" => "around April 1820").
>
> All of these are still useful even with the more interesting GEDCOM date
> formats.
>
> For example, using my own module:
>
> use Gedcom::Date;
>
> my $birth = Gedcom::Date->parse("BET JUL 1820 AND JUL 1825");
> my $marr = Gedcom::Date->parse("BEF 1834");
>
> print "Too young at marriage\n"
> if $birth->clone->add( years => 16 ) > $marr;
>
> You really want to be able to write such a validation rule, without
> having to treat all different GEDCOM date formats in your code
> explicitly.
>
> > Most tools would have no idea how to handle "Between May and July
> > 1678", let alone something like "Easter Sunday in either 1783 or 1785".
>
> That most tools can't handle the first date is no reason not to try to
> accept it in your own scripts. The second date is not really expressable
> in GEDCOM, except as a (unparsable) date phrase.

Seems to me we're talking about 2 different things:

o What researchers record, which is exported as a GEDCOM date.

o What syntax a parser provides to give programmers access to the date
data.

The more latitude the former have, the more complexity the latter needs.

> > (*) I say clear, but what about times and time zones, or calendar
> > changes? And no doubt there are other complexities. Rarely is
> > anything clear-cut in genealogy.
>
> GEDCOM was originally designed as an output scheme for genealogical
> conclusions. Calendar changes should have been handled by the
> genealogist or the program that created a GEDCOM file; when a date has
> been outputted to a GEDCOM, it refers to a definite date in a known
> calendar (either explicitly by a @#D...@ escape, or implicitly to the
> Gregorian calender).
>
> Times are outside the scope of GEDCOM; the standard has defined no tags
> for them.
>
> So while times, time zones and calendar changes are problematical in
> genealogy, they shouldn't be a problem when interpreting a valid GEDCOM
> file.
>
> Eugene
>

--

0 new messages