Encoding issues of doom

Ruth Tillman

unread,

Jun 20, 2019, 1:19:34 PM6/20/19

to pymarc Discussion

I'm trying to work through some very large MARC extracts from our catalog. I've been running into the most frustrating errors related to encoding. Using Python 3.

Right now I'm just printing titles as a way of playing whack-a-mole. I've tried messing around with the encoding flags but can't seem to find something that actually works.

Here's just a sample of what I'm getting from the MarcEdit character analysis tool:

Record # 0: ASCII, Confidence: 1
Record # 1: ASCII, Confidence: 1
Record # 2: ASCII, Confidence: 1
Record # 3: ASCII, Confidence: 1
Record # 4: UTF-8, Confidence: 0.505
Record # 5: ASCII, Confidence: 1
Record # 6: UTF-8, Confidence: 0.505
Record # 7: UTF-8, Confidence: 0.7525
Record # 8: UTF-8, Confidence: 0.505
Record # 9: ASCII, Confidence: 1
Record # 10: ASCII, Confidence: 1

...

Record # 538079: ASCII, Confidence: 1
Record # 538080: UTF-8, Confidence: 1
Record # 538081: ASCII, Confidence: 1
Record # 538082: UTF-8, Confidence: 0.505
Record # 538083: ASCII, Confidence: 1
Record # 538084: ASCII, Confidence: 1
Record # 538085: ASCII, Confidence: 1
Record # 538086: UTF-8, Confidence: 0.505
Record # 538087: UTF-8, Confidence: 0.505

Meanwhile, the code:

When the script is:

with open(marcFile, "rb") as marc:
    reader = MARCReader(marc, to_unicode=False, utf8_handling='replace')
    for record in reader:
        print(record.title())

or to_unicode=False

or to_unicode=True, utf8_handling='replace'

or to_unicode=True, utf8_handling='ignore'

I eventually get:

for record in reader:
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\reader.py", line 101, in __next__
    utf8_handling=self.utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 74, in __init__
    utf8_handling=utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 307, in decode_marc
    code = subfield[0:1].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

in different variations depending on the file, but it almost always shows up at some point (I have a bunch of extract files, all about 500,000 records, so some are ok.)

If I use to_unicode=True, force_utf8=True

I get:

self.leader = marc[0:LEADER_LEN].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22: ordinal not in range(128)

If I use plain old:

with open(marcFile, "rb") as marc:
    reader = MARCReader(marc, to_unicode=True)
    for record in reader:
        print(record.title())

for record in reader:
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\reader.py", line 101, in __next__
utf8_handling=self.utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 74, in __init__
utf8_handling=utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 312, in decode_marc
data = data.decode('utf-8', utf8_handling)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 3: invalid continuation byte

Unfortunately the files are enormous catalog dumps so a) I can't attach them, b) I can't really share 500,000 records w/o running it by someone.

Anyone have suggestions? Setting the catalog on fire is, unfortunately, not an option. I've been messing around a bit in MarcEdit with MarcBreaker and now that I've done the character analysis, I was thinking of trying to do the characterset conversion. Thoughts?

Ruth

Geoffrey Spear

unread,

Jun 20, 2019, 1:53:19 PM6/20/19

to pym...@googlegroups.com

Unfortunately, I don't think pymarc currently has any good way to parse records with the specific problem in the first traceback, which looks like non-ascii subfield codes.

There's been some discussion of this on the issue tracker in the past, but I don't think anyone's come up with a reasonable way of fixing this sort of thing yet.

I think setting utf8_handling or using to_unicode=False should eliminate the second problem (or at least hide it and mangle your data :( )

--
You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pymarc/f02c6551-3cf3-4c65-b3df-63bbf2295538%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ruth Tillman

unread,

Jun 20, 2019, 2:10:28 PM6/20/19

to pymarc Discussion

It's entirely possible that we have something like an ñ subfield, which I found and fixed in one of our 245s. Making our own catalog software back in the day was a heck of a trip.

To unsubscribe from this group and stop receiving emails from it, send an email to pym...@googlegroups.com.

Ruth Tillman

unread,

Jun 28, 2019, 10:42:15 AM6/28/19

to pymarc Discussion

Update -- I was able to solve my problem by getting a much smaller cat dump of just the maps records, in which I was able to fix the few errors which occurred. Worked for this purpose, but I do wish I could run pymarc against the entire cat extract.

Geoffrey Spear

unread,

Jul 3, 2019, 9:50:37 AM7/3/19

to pym...@googlegroups.com

https://github.com/Wooble/pymarc/tree/issue89 is an attempt to make this work, taking a rather crude approach of just trying to perform unicode normalization to the subfield code to pull out an ascii letter. This may or may not fix your specific issues depending on what exactly is wrong with various records in the file. It might also just fix the first exception you got and hit some other terrible encoding problems right after that :)

To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pymarc/eab3bba8-5c35-4aae-9111-85250bdc3d51%40googlegroups.com.

Reply all

Reply to author

Forward