Unknown encoding AttributeError: 'str' object has no attribute 'decode'

1,012 views
Skip to first unread message

Marcus Scalpere

unread,
Mar 24, 2018, 3:32:21 PM3/24/18
to pymarc Discussion
Hi, I'm trying to convert the marc (ISO file) to marcxml, but no success. The file is probably encoded in cp1250 or cp252. It may also contain damaged data. I follow the documentation and advice here on the forum. Please help, I'm completely crazy. Thanks

def convert_to_xml(iso_source: str):
    with codecs.open(iso_source, "rb", "cp1250") as sourceFile:
        reader = MARCReader(sourceFile, force_utf8=True, to_unicode=True)
        memory = BytesIO()
        writer = XMLWriter(memory)

        for record in reader:
            writer.write(record)
        writer.close(close_fh=False)
        return memory
Sem vložte kód...

NUV_2016_test.ISO

Dan Scott

unread,
Mar 25, 2018, 8:03:46 AM3/25/18
to pym...@googlegroups.com
I don't believe pymarc knows how to deal with MARC files in encodings other than MARC8 or UTF8, per "pydoc pymarc.reader".

Python's codecs package isn't aware of the MARC's leader and directory, so that approach will lead to corrupted records.

You're better off using yaz-marcdump, which is both MARC-sensitive and aware of many different encodings, to convert the encoding of the file first:

yaz-marcdump -f cp1252 -t utf8 -o marc -l 9=97 NUV_2016_test.ISO > NUV_2016_test.yaz

That will convert your CP1252-encoded records to UTF8-encoded records with the correct leaders and directories for character lengths and offsets, and also (via "-l 9=97") set leader[09] to 'a' to tell processors to treat the data as UTF8.

Compare the result of that conversion to the corresponding codecs approach:

def convert_encoding():
    import codecs
    iso_source = 'NUV_2016_test.ISO'
    utf_out = 'NUV_2016_test.utf8'
    with codecs.open(iso_source, mode="rb", encoding="cp1252") as sourceFile:
        data = sourceFile.read()
    with codecs.open(utf_out, mode="wb", encoding="utf8") as outFile:
        outFile.write(data)

cmp will tell you that the length declared in the leader of the first record differs between the yaz-marcdump converted file and the Python codecs-converted file, and the offsets for the fields will be incorrect in the Python version as well.

Dan Scott

unread,
Mar 25, 2018, 8:19:52 AM3/25/18
to pym...@googlegroups.com
I should note that https://github.com/edsu/pymarc/pull/86 looks like it could add the ability to support reading arbitrary file encodings to pymarc. Might be worth trying out if you have a large set of records to work with and want to keep it all in Python (and if it works for you, add a comment to the pull request!)

Edward Summers

unread,
Mar 25, 2018, 12:58:19 PM3/25/18
to pym...@googlegroups.com


> On Mar 25, 2018, at 8:19 AM, Dan Scott <den...@GMAIL.COM> wrote:
>
> I should note that https://github.com/edsu/pymarc/pull/86 looks like it could add the ability to support reading arbitrary file encodings to pymarc. Might be worth trying out if you have a large set of records to work with and want to keep it all in Python (and if it works for you, add a comment to the pull request!)

Nice find Dan. This looks like a good one to merge in right?

//Ed
signature.asc

Dan Scott

unread,
Mar 26, 2018, 11:55:26 AM3/26/18
to pym...@googlegroups.com
I guess I could give it a try myself :)
Reply all
Reply to author
Forward
0 new messages