I'm trying to work through some very large MARC extracts from our catalog. I've been running into the most frustrating errors related to encoding. Using Python 3.
Right now I'm just printing titles as a way of playing whack-a-mole. I've tried messing around with the encoding flags but can't seem to find something that actually works.
Here's just a sample of what I'm getting from the MarcEdit character analysis tool:
Record # 0: ASCII, Confidence: 1
Record # 1: ASCII, Confidence: 1
Record # 2: ASCII, Confidence: 1
Record # 3: ASCII, Confidence: 1
Record # 4: UTF-8, Confidence: 0.505
Record # 5: ASCII, Confidence: 1
Record # 6: UTF-8, Confidence: 0.505
Record # 7: UTF-8, Confidence: 0.7525
Record # 8: UTF-8, Confidence: 0.505
Record # 9: ASCII, Confidence: 1
Record # 10: ASCII, Confidence: 1
...
Record # 538079: ASCII, Confidence: 1
Record # 538080: UTF-8, Confidence: 1
Record # 538081: ASCII, Confidence: 1
Record # 538082: UTF-8, Confidence: 0.505
Record # 538083: ASCII, Confidence: 1
Record # 538084: ASCII, Confidence: 1
Record # 538085: ASCII, Confidence: 1
Record # 538086: UTF-8, Confidence: 0.505
Record # 538087: UTF-8, Confidence: 0.505
Meanwhile, the code:
When the script is:
with open(marcFile, "rb") as marc:
reader = MARCReader(marc, to_unicode=False, utf8_handling='replace')
for record in reader:
print(record.title())
or to_unicode=False
or to_unicode=True, utf8_handling='replace'
or to_unicode=True, utf8_handling='ignore'
I eventually get:
for record in reader:
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\reader.py", line 101, in __next__
utf8_handling=self.utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 74, in __init__
utf8_handling=utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 307, in decode_marc
code = subfield[0:1].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
in different variations depending on the file, but it almost always shows up at some point (I have a bunch of extract files, all about 500,000 records, so some are ok.)
If I use to_unicode=True, force_utf8=True
I get:
self.leader = marc[0:LEADER_LEN].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22: ordinal not in range(128)
If I use plain old:
with open(marcFile, "rb") as marc:
reader = MARCReader(marc, to_unicode=True)
for record in reader:
print(record.title())
for record in reader:
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\reader.py", line 101, in __next__
utf8_handling=self.utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 74, in __init__
utf8_handling=utf8_handling)
File "C:\Users\rkt6\AppData\Local\Programs\Python\Python36\lib\site-packages\pymarc\record.py", line 312, in decode_marc
data = data.decode('utf-8', utf8_handling)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 3: invalid continuation byte
Unfortunately the files are enormous catalog dumps so a) I can't attach them, b) I can't really share 500,000 records w/o running it by someone.
Anyone have suggestions? Setting the catalog on fire is, unfortunately, not an option. I've been messing around a bit in MarcEdit with MarcBreaker and now that I've done the character analysis, I was thinking of trying to do the characterset conversion. Thoughts?
Ruth