Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Gedcom files encoding

2 views
Skip to first unread message

Jose Joao Dias de Almeida

unread,
Sep 5, 2012, 8:07:29 AM9/5/12
to perl-...@perl.org
Dear Gedcom-ers,
I just star with Gedcom.pm and things are beginning to work!

But I have problems with files in Unicode.

When files are in utf8 + BOM --> it returns error in first line (the BOM)

If I remove the BOM and try again, apparently it does not pay
attention to "1 CHAR UTF-8"

Is there any extra thing to say in this cases?
Um abraço
J.Joao

0 HEAD
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
1 CHAR UTF-8
1 LANG Portuguese
1 SOUR MYHERITAGE
...

Ron Savage

unread,
Sep 5, 2012, 6:25:41 PM9/5/12
to perl-...@perl.org
Hi Jose
I just checked the source code of Gedcom.pm V 1.16, and the only
reference to utf8 is on line 390, where it is writing an XML file.

So, you're right, of course.

What do you think should happen?

Perhaps if the code detects 1 CHAR UTF-8 the input file should be closed
and re-opened in utf-8 mode, yes?


--
Ron Savage
http://savage.net.au/
Ph: 0421 920 622

Jose Joao Dias de Almeida

unread,
Sep 6, 2012, 12:06:25 PM9/6/12
to Ron Savage, perl-...@perl.org


On 09/05/2012 11:25 PM, Ron Savage wrote:
> Hi Jose
>
> On 05/09/12 22:07, Jose Joao Dias de Almeida wrote:
>> Dear Gedcom-ers,
>> I just star with Gedcom.pm and things are beginning to work!
>>
>> But I have problems with files in Unicode.
>>
>> When files are in utf8 + BOM --> it returns error in first line (the BOM)
>>
>> If I remove the BOM and try again, apparently it does not pay attention
>> to "1 CHAR UTF-8"
>>
>> Is there any extra thing to say in this cases?
>> Um abraço
>> J.Joao
>>
>> 0 HEAD
>> 1 GEDC
>> 2 VERS 5.5
>> 2 FORM LINEAGE-LINKED
>>
>> 1 LANG Portuguese
>> 1 SOUR MYHERITAGE
>> ...
>
> I just checked the source code of Gedcom.pm V 1.16, and the only
> reference to utf8 is on line 390, where it is writing an XML file.
>
> So, you're right, of course.
>
> What do you think should happen?
>
> Perhaps if the code detects 1 CHAR UTF-8 the input file should be closed
> and re-opened in utf-8 mode, yes?

I think that would solve the problem.

Probably a simple
if(/1 CHAR UTF-8/){ binmode(...) }
would also work.

One extra thing:
Some of the unicode files sometime include the initial byte order marker
(BOM) in order to sign the unicode format used.

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

it would be nice if we could at least skip/ignore them (for exemple
Myheritage tools are generating gedcom files with BOMs)

eg:

$ged =~
s/^(\x00\x00\xFE\xFF|\xFF\xFE\x00\x00|\xFF\xFE|\xFE\xFF|\xEF\xBB\xBF)//;
## remove BOM !

Um abraço
J.Joao

Ron Savage

unread,
Sep 6, 2012, 8:20:57 PM9/6/12
to Jose Joao Dias de Almeida, perl-...@perl.org
Hi Jose
I believe binmode does not work after the file has been read. That is,
it must be executed immediately after opening the file, before the 1st read.

But see File::BOM. Gedcom needs to use that.

The string 1 CHAR UTF-8 obviously must be read in before knowing the
file is intended to be utf8, so File::BOM won't solve that problem.

> One extra thing:
> Some of the unicode files sometime include the initial byte order marker
> (BOM) in order to sign the unicode format used.
>
> Bytes Encoding Form
> 00 00 FE FF UTF-32, big-endian
> FF FE 00 00 UTF-32, little-endian
> FE FF UTF-16, big-endian
> FF FE UTF-16, little-endian
> EF BB BF UTF-8
>
> it would be nice if we could at least skip/ignore them (for exemple
> Myheritage tools are generating gedcom files with BOMs)
>
> eg:
>
> $ged =~
> s/^(\x00\x00\xFE\xFF|\xFF\xFE\x00\x00|\xFF\xFE|\xFE\xFF|\xEF\xBB\xBF)//;
> ## remove BOM !
>
> Um abraço
> J.Joao
>
>
>


Michael Ionescu

unread,
Sep 7, 2012, 12:05:02 AM9/7/12
to Ron Savage, Jose Joao Dias de Almeida, perl-...@perl.org
Hi guys,

to my understanding, current versions of perl have utf-8 support
activated by default to recognize utf-8 files and treat them
accordingly. Even variables containing utf-8 encoded characters are
internally marked and treated as such in matching operations etc. So I
believe reopening the file will not be necessary. There are loads of
webpages that actually refer to older perl versions that did not have
the current level of utf-8 support and therefore propagate all manner of
workarounds no longer necessary.

The BOM is permissible, but superfluous on utf-8 files. That is unless
they originated in a different encoding and are being altered and
returned to the sender. So unless Gedcom.pm intends to be able to export
encodings other than ASCII or utf-8, it should be perfectly safe to
simply delete/ignore the BOM when it is ASCII- or utf-8-encoded.

I personally wouldn't ignore it when encoded otherwise as that might
lead to problems later on in processing the actual data. In those cases
an error upon processing the BOM should be just what the doctor ordered.

I'm not an authority on utf-8, it's simply how I understand
http://www.perlmonks.org/?node_id=599720
https://en.wikipedia.org/wiki/Byte_Order_Mark
http://www.unicode.org/faq/utf_bom.html#22
applied to this situation.

If I should be wrong I would appreciate being corrected as I will be
spending some time on the processing of utf-8 gedcom files using
Gedcom.pm in the immediate future.

Michael

Ron Savage

unread,
Sep 7, 2012, 2:21:37 AM9/7/12
to perl-...@perl.org
Hi Michael

On 07/09/12 14:05, Michael Ionescu wrote:
> Hi guys,
>
> to my understanding, current versions of perl have utf-8 support
> activated by default to recognize utf-8 files and treat them

By default? IMHO not so. It depends, might be a better response. See:

http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html

which kicks off 46 articles by Tom Christansen on utf8.

Ron Savage

unread,
Sep 7, 2012, 9:51:06 PM9/7/12
to Jose Joao Dias de Almeida, perl-...@perl.org
Hi Jose

On 07/09/12 23:30, Jose Joao Dias de Almeida wrote:
> Ron,
> any solution you came up is fine!

Thanx for the detail of BOMs.

I've logged a bug report for Gedcom re utf8.

If Paul can't do the work right now, I hope he'll let us know, and I'll
see if he wants to hand maintenance of the module over to me.

A couple of minutes ago I logged a bug report for Tree::DAG_Node, and
offered to maintain that one too.

I already do this for a new other modules...

This list does not indicate the original authors:

https://metacpan.org/author/RSAVAGE

The 4 I did not write are:

DBIx::Tree
GraphViz
HTML::Entities::Interpolate
Set::Array
0 new messages