Hi
I am building tools for large sets of MARC records, and I often come across everything from a couple of records up to a couple of million records that gets the encoding errors.
Withe these amounts of data, you cannot tweak the reader with the
force_utf8 and
to_unicode. What I was thinking was that you, based on the
current_chunk property could instantiate a new record with the altered UTF settings, using the options in the Record__init__ method
I can imagine there being something else going on for my test collection, but I wonder what you think and how you propose I solve it this. Here is a gist of my current code:
https://gist.github.com/fontanka16/b1825611ad5d5635376f879640cfbdd9The core is this:
for idx, marc_record in enumerate(reader, start=1):
if marc_record:
# DO stuff
elif "'ascii' codec can't" in str(reader.current_exception):
try:
print(f"trying to handle {reader.current_exception}")
rec = Record(
data=reader.current_chunk,
to_unicode=True,
force_utf8=True)
print(rec.title())
except Exception as ee:
print(ee)
The result is, that whatever settings on the record that r I try, the chunk seems to get treated the same and the output from the above code becomes:
trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)
This feels like something stupid, I just cannot figure out what. Does anyone have a clue?