Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

reading and writing of utf-8 with marc::batch

9 views
Skip to first unread message

Eric Lease Morgan

unread,
Mar 26, 2013, 4:22:03 PM3/26/13
to perl...@perl.org

For the life of me I can't figure out how to do reading and writing of UTF-8 with MARC::Batch.

I have a UTF-8 encoded file of MARC records. Dumping the records and greping for a particular string illustrates the validity:

$ marcdump und.marc | grep Sainte-Face
und.marc
1000 records
2000 records
3000 records
4000 records
5000 records
6000 records
7000 records
8000 records
9000 records
10000 records
11000 records
12000 records
245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
610 20 _aArchiconfrérie de la Sainte-Face
13000 records
$

I then run a Perl script that simply reads each record and dumps it to STDOUT. Notice how I define both my input and output as UTF-8:

#!/shared/perl/current/bin/perl

# configure
use constant MARC => './und.marc';

# require
use strict;
use MARC::Batch;

# initialize
binmode ( MARC, ":utf8" );
my $batch = MARC::Batch->new( 'USMARC', MARC );
$batch->strict_off;
$batch->warnings_off;
binmode( STDOUT, ":utf8" );

# read & write
while ( my $marc = $batch->next ) { print $marc->as_usmarc }

# done
exit;

But my output is munged:

$ ./marc.pl > und.mrc
$ marcdump und.mrc | grep Sainte-Face
und.mrc
1000 records
2000 records
3000 records
4000 records
5000 records
6000 records
7000 records
8000 records
9000 records
10000 records
11000 records
12000 records
245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
610 _aArchiconfrérie de la Sainte-Face
13000 records
$

What am I doing wrong!?

--
Eric Lease Morgan
University of Notre Dame

574/631-8604



Paul Hoffman

unread,
Mar 26, 2013, 5:11:22 PM3/26/13
to perl...@perl.org
On Tue, Mar 26, 2013 at 04:22:03PM -0400, Eric Lease Morgan wrote:
> For the life of me I can't figure out how to do reading and writing of
> UTF-8 with MARC::Batch.
>
> I have a UTF-8 encoded file of MARC records. Dumping the records and
> greping for a particular string illustrates the validity:
>
> $ marcdump und.marc | grep Sainte-Face

What is marcdump?

> 245 00 _aAnnales de l'Archiconfr�rie de la Sainte-Face
> 610 20 _aArchiconfr�rie de la Sainte-Face
> 13000 records
> $
>
> I then run a Perl script that simply reads each record and dumps it to
> STDOUT. Notice how I define both my input and output as UTF-8:

Try *not* calling binmode and see what happens. Or just call
binmode(MARC) without the ':utf8' layer.

> 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
> 610 _aArchiconfrérie de la Sainte-Face
> 13000 records
> $

This looks like double-encoding:

00000000 6c 27 41 72 63 68 69 63 6f 6e 66 72 c3 83 c2 a9 |l'Archiconfr�.�|
00000010 72 69 65 |rie|

LATIN SMALL LETTER E WITH ACUTE is supposed to be c3 a9 (as it is in the
first marcdump output) not c3 83 c2 a9.

Paul.

--
Paul Hoffman <nku...@nkuitse.com>

Timothy Prettyman

unread,
Mar 26, 2013, 5:35:45 PM3/26/13
to Eric Lease Morgan, perl...@perl.org
Do your records have the utf8 encoding byte set in the LDR? (Byte 9 should
be 'a' for utf8).

-Tim

Timothy Prettyman
University of Michigan LIbrary/LIT

Leif Andersson

unread,
Mar 26, 2013, 5:57:01 PM3/26/13
to Eric Lease Morgan, perl...@perl.org
Hi Eric,

my first guess would be your terminal is not utf8.
If you comment out
#binmode( STDOUT, ":utf8" );
and that does the trick, then you can start looking for how to change your terminal settings.
(And that can sometimes be a rather frustrating task, I'm afraid)

/Leif Andersson
Stockholm UL
________________________________________
Från: Eric Lease Morgan [emo...@nd.edu]
Skickat: den 26 mars 2013 21:22
Till: perl...@perl.org
Ämne: reading and writing of utf-8 with marc::batch

Jon Gorman

unread,
Mar 27, 2013, 10:01:37 AM3/27/13
to Eric Lease Morgan, perl...@perl.org
Ok, I can't claim to be an expert, but from my own experience, I'd say
Paul is very likely right about double-encoding occuring. However,
the question ends up being where that happens, and in this case I
suspect how MARC::Batch will work could depend heavily on what version
of perl you're running and what version of MARC::Batch you're running.
That might help too (I'd try to be on a later version of perl, the
latest of Batch::MARC ). (It also depends on how you're generating the
marc record, which isn't really clear to me.

It could also be that the leaders or the terminal as others have suggested.

One piece of advice is not to trust the terminal directly but pipe
into xxd. (And if possible, just try transforming the offending
record). Or use yaz-marcdump -v, which will also give the hex if I
remember correctly. (If it's c3 a9 in both cases, you know the
terminal is at fault)

Then try doing that without the binmode, w/ binmode :raw, etc.

Jon Gorman

Galen Charlton

unread,
Mar 27, 2013, 10:38:57 AM3/27/13
to Jon Gorman, Eric Lease Morgan, perl...@perl.org
Hi,

On Wed, Mar 27, 2013 at 7:01 AM, Jon Gorman <jonatha...@gmail.com>wrote:

> One piece of advice is not to trust the terminal directly but pipe
> into xxd. (And if possible, just try transforming the offending
> record). Or use yaz-marcdump -v, which will also give the hex if I
> remember correctly. (If it's c3 a9 in both cases, you know the
> terminal is at fault)
>

Another trick is to pipe the output through less with the LESSCHARSET
environment variable set to 'ascii'. Bytes whose value is less than 32 or
greater than 136 will be displayed as reverse-video hexadecimal numbers,
e.g.,

<subfield code="a">Garci<CC><81>a Ma<CC><81>rquez, Gabriel,</subfield>

Regards,

Galen
--
Galen Charlton
gmch...@gmail.com

Eric Lease Morgan

unread,
Mar 27, 2013, 1:26:15 PM3/27/13
to perl...@perl.org

On Mar 26, 2013, at 5:57 PM, Leif Andersson <Leif.An...@sub.su.se> wrote:

> my first guess would be your terminal is not utf8.

While I'm not positive my terminal is doing UTF-8, I think it is. When I dump in the beginning the output to the terminal is correct. After I run my script the output to the same terminal is incorrect.

--
Eric Lease Morgan

Galen Charlton

unread,
Mar 27, 2013, 1:27:54 PM3/27/13
to Eric Lease Morgan, perl...@perl.org
Hi Eric,

On Wed, Mar 27, 2013 at 10:26 AM, Eric Lease Morgan <emo...@nd.edu> wrote:

> While I'm not positive my terminal is doing UTF-8, I think it is. When I
> dump in the beginning the output to the terminal is correct. After I run my
> script the output to the same terminal is incorrect.
>

Would you be willing to put up a link to your MARC file? I'm willing to
take a quick look to see if I can reproduce the problem you're seeing.

Eric Lease Morgan

unread,
Mar 27, 2013, 2:20:25 PM3/27/13
to perl...@perl.org

A number of people have alluded to the problem of double encoding, and I'm beginning to think this is true.

I have isolated a number of problem records. They all contain diacritics, but they do not have an "a" in position #9 of the leader -- http://dh.crc.nd.edu/tmp/original.marc Can someone verify that the file contains UTF-8 characters for me?

For these same records I have also added an "a" in position #9 and created a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc

Is it true that original.marc is not denoted correctly, but fixed.marc is denoted correctly?

--
Eric Morgan

Shelley Doljack

unread,
Mar 27, 2013, 2:16:49 PM3/27/13
to Eric Lease Morgan, perl...@perl.org
Whenever I see characters like é, I consult this website http://www.i18nqa.com/debug/bug-utf-8-latin1.html to help me figure out what's going on. You might find it helpful too.

Shelley
--
Shelley Doljack
E-Resources Metadata Librarian
Metadata Department
Stanford University Libraries
sdol...@stanford.edu
650-725-0167

Galen Charlton

unread,
Mar 27, 2013, 2:28:02 PM3/27/13
to Eric Lease Morgan, perl...@perl.org
Hi,

On Wed, Mar 27, 2013 at 11:20 AM, Eric Lease Morgan <emo...@nd.edu> wrote:

> I have isolated a number of problem records. They all contain diacritics,
> but they do not have an "a" in position #9 of the leader --
> http://dh.crc.nd.edu/tmp/original.marc Can someone verify that the file
> contains UTF-8 characters for me?
>

I've eyeballed it and confirm that the encoding of that file is UTF-8.

For these same records I have also added an "a" in position #9 and created
> a similar file -- http://dh.crc.nd.edu/tmp/fixed.marc


I've looked this over as well.


> Is it true that original.marc is not denoted correctly, but fixed.marc is
> denoted correctly?
>

Yes. The Leader/09 must be set to 'a' if the character encoding in use is
UTF-8.

Eric Lease Morgan

unread,
Mar 27, 2013, 4:59:01 PM3/27/13
to perl...@perl.org

On Mar 27, 2013, at 2:20 PM, Eric Lease Morgan <emo...@ND.EDU> wrote:

> A number of people have alluded to the problem of double encoding, and I'm beginning to think this is true.

When it calls as_usmarc, I think MARC::Batch tries to honor the value set in position #9 of the leader. In other words, if the leader is empty, then it tries to output records as MARC-8, and when the leader is a value of "a", it tries to encode the data as UTF-8.

If I employ binmode( OUTFILE, ":utf8"), and the output is already UTF-8, then double encoding happens.

To test this theory, I fixed a number records in my batch. Specifically, I inserted the letter "a" in position #9 of the leader. I then ran my processing file WITHOUT the employment of binmode, and my output was correct. For example, look at all the glorious characters in the following URL:

http://www.catholicresearch.net/vufind/Record/undmarc_001906501

--
Eric Lease Morgan
Hesburgh Libraries

Eric Lease Morgan

unread,
Mar 27, 2013, 5:11:26 PM3/27/13
to perl...@perl.org

On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan <emo...@nd.edu> wrote:

> When it calls as_usmarc, I think MARC::Batch tries to honor the value set in position #9 of the leader. In other words, if the leader is empty, then it tries to output records as MARC-8, and when the leader is a value of "a", it tries to encode the data as UTF-8.

How can I figure out whether or not a MARC record contains ONLY characters from the UTF-8 character set?

Put another way, how can I determine whether or not position #9 of a given MARC leader is accurate? If position #9 is an "a", then how can I read the balance of the record to determine whether or not all the characters really and truly are UTF-8 encoded?

--
Eric "This Is Almost Too Much For Me" Morgan

Shelley Doljack

unread,
Mar 27, 2013, 5:52:43 PM3/27/13
to Eric Lease Morgan, perl...@perl.org
I use MarcEdit to view records and check if the mnemonic form of a diacritic (e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best way I've come up with so far. MarcEdit is pretty good at guessing what the character encoding is without relying on the LDR/09 value. I think there are some perl modules you could use that "guess" what the encoding is of a character but I've never used them. I'm interested in finding out other methods (preferably automated) for detecting wrong or mixed character encodings in a MARC record.

Shelley

----- Original Message -----
> From: "Eric Lease Morgan" <emo...@nd.edu>
> To: perl...@perl.org

Galen Charlton

unread,
Mar 28, 2013, 12:49:31 AM3/28/13
to Eric Lease Morgan, perl...@perl.org
Hi,


On Wed, Mar 27, 2013 at 2:11 PM, Eric Lease Morgan <emo...@nd.edu> wrote:

> Put another way, how can I determine whether or not position #9 of a given
> MARC leader is accurate? If position #9 is an "a", then how can I read the
> balance of the record to determine whether or not all the characters really
> and truly are UTF-8 encoded?
>

The following program will read a file of MARC records from standard input
and classify each as either being valid UTF-8 or not.

___START____
#!/usr/bin/perl

use Encode;

binmode STDIN, ':bytes';

$/ = "\035"; # MARC record terminator
my $i = 0;
while (<>) {
$i++;
my $bytes = $_;
eval {
my $utf8str = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK);
};
if ($@) {
print "Record $i is valid UTF-8\n";
} else {
print "Record $i definitely not valid UTF-8\n";
}
}
___END____

Ashley Sanders

unread,
Mar 28, 2013, 5:18:14 AM3/28/13
to Eric Lease Morgan, perl...@perl.org
Eric,

> How can I figure out whether or not a MARC record contains ONLY characters from the UTF-8 character set?

You can use a regex to check if a string is utf-8. There are various examples
floating around the internet. An example is the one here:

http://www.w3.org/International/questions/qa-forms-utf-8

You'll need to add the MARC control characters ^_, ^^, and ^] to the ASCII part
of the expression in the above page. (I think the w3c example is aimed at XML1.0
in which the MARC control characters are not allowed.)

Ashley.
--
Ashley Sanders a.sa...@manchester.ac.uk
http://copac.ac.uk -- A Mimas service funded by JISC at the University of Manchester

Eric Lease Morgan

unread,
Mar 28, 2013, 2:49:01 PM3/28/13
to perl...@perl.org

Thank you for all the input, and I think I have resolved my particular issue. Battle won. War still raging.

Using the script suggested by Galen as an starting point, I wrote the following hack outputting integers denoting MARC records containing non-UTF-8 characters, but the script output nothing; all the data in all of my records was encoded as UTF-8:

#!/usr/bin/perl

# require
use strict;
use Encode;

# initialize
binmode STDIN, ":bytes";
$/ = "\035";
my $i = 0;

# read STDIN
while ( <> ) {

# increment
$i++;

# check validity
eval { my $utf8str = &Encode::is_utf8( $_, Encode::FB_CROAK ); };

# check for error
if ( $@ ) { print "Record $i contains non-UTF-8 characters\n"; }

}

# done
exit;


Since all of the data in all of my records was UTF-8, then all of the leaders of all of the records need to have a value of "a" set in position #9 of the leader. So I wrote the following hack (circumventing MARC::Batch):

#!/usr/bin/perl

# require
use strict;

# initialize
binmode STDIN, ":bytes";
binmode STDOUT, ":bytes";
$/ = "\035";

# loop through the input
while ( <> ) {

# do the work and output
substr( $_, 9, 1 ) = "a";
print $_;

}

# done
exit;


I then fed the output of my fix routine to my indexing routing, and all of my problems seemed to go away. GIGO?

I'm still not sure, but I think deep within MARC::Batch some sort of encoding is observed, honored, and output. And when the denoted encoding is not true and things like binmode( FILE, ":utf8" ) get called, output gets munged. Again, I'm not sure. It is almost exhausting.


--
Eric Morgan
University of Notre Dame






Gianluca Drago

unread,
Nov 5, 2013, 2:37:51 AM11/5/13
to perl...@perl.org
0 new messages