Fixing reading exceptions while reading from reader

50 views
Skip to first unread message

Theodor

unread,
Nov 25, 2021, 10:34:48 AM11/25/21
to pymarc Discussion
Hi
I am building tools for large sets of MARC records, and I often come across everything from a couple of records up to a couple of million records that gets the encoding errors.
Withe these amounts of data, you cannot tweak the reader with the force_utf8 and to_unicode. What I was thinking was that you,  based on the current_chunk property could instantiate a new record with the altered UTF settings, using the options in the Record__init__ method

I can imagine there being something else going on for my test collection, but I wonder what you think and how you propose I solve it this. Here is a gist of my current code: https://gist.github.com/fontanka16/b1825611ad5d5635376f879640cfbdd9

The core is this:
for idx, marc_record in enumerate(reader, start=1):
    if marc_record:                            
       # DO stuff
    elif "'ascii' codec can't" in str(reader.current_exception):
        try:
            print(f"trying to handle {reader.current_exception}")
            rec = Record(
                data=reader.current_chunk, 
                to_unicode=True, 
                force_utf8=True)
            print(rec.title())
        except Exception as ee:
            print(ee)
The result is, that whatever settings on the record that r I try, the chunk seems to get treated the same and the output from the above code becomes:
trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
'ascii' codec can't decode byte 0xcc in position 22: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 3: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)
trying to handle 'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)
'ascii' codec can't decode byte 0xe2 in position 28: ordinal not in range(128)

This feels like something stupid, I just cannot figure out what.  Does anyone have a clue?

Reply all
Reply to author
Forward
0 new messages