Fixing reading exceptions while reading from reader

52 views

Skip to first unread message

Theodor

unread,

Nov 25, 2021, 10:34:48 AM11/25/21

to pymarc Discussion

Hi
I am building tools for large sets of MARC records, and I often come across everything from a couple of records up to a couple of million records that gets the encoding errors.
Withe these amounts of data, you cannot tweak the reader with the force_utf8 and to_unicode. What I was thinking was that you, based on the current_chunk property could instantiate a new record with the altered UTF settings, using the options in the Record__init__ method

I can imagine there being something else going on for my test collection, but I wonder what you think and how you propose I solve it this. Here is a gist of my current code: https://gist.github.com/fontanka16/b1825611ad5d5635376f879640cfbdd9

The core is this:

for idx, marc_record in enumerate(reader, start=1):

if marc_record:

# DO stuff

elif "'ascii' codec can't" in str(reader.current_exception):

try:

print(f"trying to handle {reader.current_exception}")

rec = Record(

data=reader.current_chunk,

to_unicode=True,

force_utf8=True)

print(rec.title())

except Exception as ee:

print(ee)

The result is, that whatever settings on the record that r I try, the chunk seems to get treated the same and the output from the above code becomes:

trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)

'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)

trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)

'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)

trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)

'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)

trying to handle 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)

'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)

trying to handle 'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)

'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)

trying to handle 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

trying to handle 'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)

'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)

This feels like something stupid, I just cannot figure out what. Does anyone have a clue?

Reply all

Reply to author

Forward

0 new messages