Trouble Writing MARC Records with PyMARC

TM

unread,

Feb 1, 2013, 2:50:22 PM2/1/13

to pym...@googlegroups.com

I have a project where I have to pull out specific MARC records from a very large dump.

I thought I would start off with a toy example, as I am very new to Python and newer to pyMARC. Just a simple record-by-record copy of the dump. Open read and write filehandles, create reader and writer objects from pyMARC, then begin a loop of read a record, write a record.

This blows up not even a thousand records in.
...
File "C:\Python26\Lib\site-packages\pymarc\writer.py", line 40, in write
self.file_handle.write(record.as_marc())
File "C:\Python26\Lib\site-packages\pymarc\record.py", line 330, in as_marc
field_data = field_data.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)

Okay, I do some searching and change pymarc.MARCReader(readfilehandle) to pymarc.MARCReader(readfilehandle, to_unicode=True).

This blows up after a little over 388,000 records with:
...
File "C:\Python26\Lib\site-packages\pymarc\reader.py", line 87, in next
utf8_handling=self.utf8_handling)
File "C:\Python26\Lib\site-packages\pymarc\record.py", line 113, in __init__
utf8_handling=utf8_handling)
File "C:\Python26\Lib\site-packages\pymarc\record.py", line 298, in decode_marc
data = data.decode('utf-8', utf8_handling)
File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-4: invalid data

Okay, I do some more searching and change pymarc.MARCReader(readfilehandle, to_unicode=True) to pymarc.MARCReader(readfilehandle, to_unicode=True, force_utf8=True). This blows up at the same spot, after 388,000 records, with an identical error.

I try pymarc.MARCReader(readfilehandle, to_unicode=True, force_utf8=True, utf8_handling='replace'). This blows up after 435,00 records with:
...
Traceback (most recent call last):
File "C:\data\primo\rewrite.py", line 24, in <module>
writer.write(record)
File "C:\Python26\Lib\site-packages\pymarc\writer.py", line 40, in write
self.file_handle.write(record.as_marc())
File "C:\Python26\Lib\site-packages\pymarc\record.py", line 359, in as_marc
return self.leader.encode('utf-8') + directory.encode('utf-8') + fields
UnicodeDecodeError: 'ascii' codec can't decode byte 0xaa in position 18: ordinal not in range(128)

utf8_handling='ignore' blows up after 435,000 records, as well.

I have read that pyMARC isn't used to write records as much as it is used to read records. Is there something I have missed?

Thanks in advance for everyone's time.

Ed Summers

unread,

Feb 2, 2013, 3:57:36 AM2/2/13

to pym...@googlegroups.com

Nice work trying out the various flags, it is incredibly frustrating
right? If you can share a few (or all these records I'd be interested
to see what character set they are coded as being in). That's position
9 in the leader normally.

Can you try opening your writer filehandle so that it will expect
utf-8 instead of 'ascii'?

import codecs
fh = codecs.open("new-marc.dat", "w", "utf-8")
writer = pymarc.MARCWriter(fh)

In general pymarc is useful for getting data out of MARC into
something like a database or XML, and is lacking in some write
functionality, notably the ability to write MARC-8 encoded data back
out. What are you trying to do with this large dump of data, other
than experimenting?

//Ed

> --
> You received this message because you are subscribed to the Google Groups
> "pymarc Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pymarc+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Doug Kingston

unread,

Feb 3, 2013, 2:14:44 AM2/3/13

to pym...@googlegroups.com

I use pymarc a lot for writing MARC records created from a bespoke database. The MARC is subsequently imported into Koha. (http://koha-community.org). Once I got the coded stuff right I have not really had any problems. I believe we are using MARC-21.

-Doug-

Godmar Back

unread,

Feb 3, 2013, 10:16:31 AM2/3/13

to pym...@googlegroups.com

There are various choices made in pymarc with respect to when to decode from utf8 to unicode and vice versa, see this mailing list and github for discussion.

In short, it's likely that your records have some deficiency that prevents them from being properly handled. Here's the work around I recommend.

Wrap reading and writing each record in individual try/catch clauses (you need to use the 'next' method of the iterator). This will allow you to progress past an error, for one. In addition, remember the file offset of the underlying file where the erroneous record is located. Then reopen the file, seek to the offset, create a marc reader and read and write just the offending record to a separate file. There's one surefire way to extract any record: turn utf8 decoding off on read, and on write, make sure the leader field doesn't force utf8 encoding (right now, pymarc respects this field no matter what, wrongly so in my opinion). Then, use a different program - hexdump, or Terry Reese's marc edit) to make sense of the offending record(s).

- Godmar

On Fri, Feb 1, 2013 at 2:50 PM, TM <trim...@yahoo.com> wrote:

Reply all

Reply to author

Forward