Write files in MARC-8/windows-1252

58 views
Skip to first unread message

Patrick H

unread,
Jan 6, 2025, 11:27:07 AMJan 6
to pymarc Discussion
I've been using pymarc to experiment with doing authority control in Python by extracting controllable fields, identifying possible/likely matches in the LCNAF via their APIs, and then making edits to the bibs based on those matches. To do all this, I force utf-8 upon import so that the characters will play nice with the APIs. So far I'm able to do all of this searching and editing using pymarc and other libraries, but am hitting a snag when creating files for import to our ILS because it uses MARC-8.

I've been trying to work with the RawField class explicitly to see if I could force a conversion of the bytes while iterating through records (or Fields within a Record), but when I try to convert a Field to RawField I receive the error below (for context, recs[0] is the first record in a list I read in from a file that's been converted to utf-8 upon reading):

>>>RawField(recs[0]['245']).as_marc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\site-packages\pymarc\field.py", line 53, in __init__
    self.tag = f"{int(tag):03}"
                  ^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'Field'

I know there are ways of converting the records in MarcEdit after I save them, but I would prefer to incorporate this conversion to the code I've already written. Is there a way to retrieve a particular Field or Record object as bytes and use the built-in Python encode/decode methods to encode the data in a different character set? 


Andrew Hankinson

unread,
Jan 6, 2025, 11:36:59 AMJan 6
to pym...@googlegroups.com
`RawField` is a subclass of `Field`. You're trying to pass in a `Field` instance to the constructor, but this won't work because it's already a "Field" instance under the hood. So your code is failling because it's expecting a string or an int as a tag. (e.g., "245"). 

The only thing that differentiates the two is the `to_marc` method. Internally they are exactly the same: 


Do you have some sample data to share, or an example of the problems you're getting when you use `Field` directly?

-Andrew

--
You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/pymarc/55fe4b34-28ec-473f-954c-916f6490d005n%40googlegroups.com.

Patrick H

unread,
Jan 6, 2025, 1:07:34 PMJan 6
to pymarc Discussion
Thanks for the response!  Here's an example of the errors I get when I try to decode all the fields a particular record for a sample of records ('marc-8-test.mrc') that has already been read in and converted to utf upon being read in.  Since I'm getting the same error regardless of the encoding, I assume I'm just trying to decode the bytes at the wrong time but am not sure how best to iterate through the fields in a record and change the encoding during or just before writing the data to a file. If sample data would be helpful, I can definitely send it! 

>>> marc_8_recs = h.read_marc('marc-8-test.mrc') ## this function converts all incoming records to utf-8 using the attributes of MARCReader
>>> marc_8_recs[0]
<pymarc.record.Record object at 0x000001472661A610>
>>> [i.as_marc for i in marc_8_recs[0].fields]
[<bound method Field.as_marc of <pymarc.field.Field object at 0x0000014725799DA0>>, <bound method Field.as_marc of <pymarc.field.Field object at 0x0000014726454810>>, [...]]
>>> [i.decode('cp1252') for i in marc_8_recs[0].fields]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
AttributeError: 'Field' object has no attribute 'decode'
>>> [i.as_marc.decode('cp1252') for i in marc_8_recs[0].fields]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
AttributeError: 'function' object has no attribute 'decode'
>>> [i.as_marc.decode('ascii') for i in marc_8_recs[0].fields]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
AttributeError: 'function' object has no attribute 'decode'
>>> [i.as_marc.decode('utf-8') for i in marc_8_recs[0].fields]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
AttributeError: 'function' object has no attribute 'decode'

Andrew Hankinson

unread,
Jan 6, 2025, 2:09:27 PMJan 6
to pym...@googlegroups.com, pymarc Discussion
You’re missing the parenthesis on the `as_marc()` function call. 

On 6 Jan 2025, at 19:07, Patrick H <pathar...@gmail.com> wrote:

Thanks for the response!  Here's an example of the errors I get when I try to decode all the fields a particular record for a sample of records ('marc-8-test.mrc') that has already been read in and converted to utf upon being read in.  Since I'm getting the same error regardless of the encoding, I assume I'm just trying to decode the bytes at the wrong time but am not sure how best to iterate through the fields in a record and change the encoding during or just before writing the data to a file. If sample data would be helpful, I can definitely send it! 

Patrick H

unread,
Jan 6, 2025, 3:00:08 PMJan 6
to pymarc Discussion
Ah okay, thanks for correcting that!  

I tried the following approach adding the parentheses, as below:

>>>new_rec = Record()
>>> for field in marc_8_recs[0].fields:
...     marc8_field = field.as_marc('cp1252')
...     new_rec.add_field(marc8_field)

Using this approach I was able to iterate through the list of sample Records and add them to a list ("marc-8-auths"). When I try to use the MARCWriter to write these modified records to a file for import, I get the following error:  

>>> writer = MARCWriter(open('marc8-write-text.mrc','wb'))
>>> for auth in marc_8_auths:
...     writer.write(auth)
...

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Program Files\Python311\Lib\site-packages\pymarc\writer.py", line 124, in write
    self.file_handle.write(record.as_marc())
                           ^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\site-packages\pymarc\record.py", line 446, in as_marc
    field_data = field.as_marc(encoding=encoding)
                 ^^^^^^^^^^^^^
AttributeError: 'bytes' object has no attribute 'as_marc'

I'm guessing this is because the writer is anticipating a Field object rather than bytes so maybe I screwed up by encoding them before writing. Is there a way to set the encoding when I call the MARCWriter? What's the best way to specify the encoding for a set of records prior to saving them to a file? 

Andrew Hankinson

unread,
Jan 7, 2025, 3:25:31 AMJan 7
to pym...@googlegroups.com
Well, looking at the code, I can't see that the `encoding` parameter on the RawField does anything other than raise a warning. All the other calls to `encode` are hardcoded to ascii.

So I think you're better off using just plain Field, since the `to_marc()` method there actually applies the encoding you specify. 

I don't use the MARCWriter so I might be mistaken, but it looks like the writer uses the value in the leader to specify the encoding, but you only have the option of UTF-8 or iso8859-1, since that is what the `as_marc()` method on the record will use. 

As for your specific error: Yes, the writer is trying to write a field object, but you are passing it text (bytes), and the error is saying that this text has no method `as_marc()`. But other than that I don't know if I can help you any further! I haven't dealt with the MARC8 part of pymarc, but maybe somebody else here has?

-Andrew

Ed Summers

unread,
Jan 7, 2025, 10:30:19 AMJan 7
to pymarc Discussion
I'll admit I've lost the thread here. If you share the complete program and data I might be easier to follow along.

From what I've seen so far it looks like you are reading in MARC data as UTF-8 encoded, and then trying to convert each field to cp1252 (a Windows encoding) even though you said you wanted to convert it to MARC-8? Is the understanding here that they are somehow equivalent?

We could consider adding an optional encoding parameter to pymarc.Record.as_marc() to override the logic that looks at the leader to determine the encoding. But as far as I know there is no native support for MARC-8 in Python?

Patrick H

unread,
Jan 7, 2025, 12:07:30 PMJan 7
to pymarc Discussion
Thanks for taking a look at this Andrew and Ed. I shared the snippets above rather than the full code because it's a bit of a mess at the moment and the encoding issues occur after I've done a bunch of parsing and editing of records with data from APIs in utf-8 and need to convert the final records (with edits) to MARC-8 so that they can be ingested into our ILS. If sending the full file would be helpful, I'm happy to do so, but your summary of my question is accurate Ed: I'd like to convert a set of records from utf-8 to MARC-8 and save them to a file for import.   

I'm trying to convert them into ascii or cp1252 because MarcEdit's detection tools (we've used this to successfully convert files to MARC-8 in the past) indicate that MARC-8 has some overlap with those character sets. This is the first place I've worked where we have to use MARC-8 so I apologize if I've confused things through my own lack of understanding of encoding issues. I thought to try ascii or cp1252 because that's what MarcEdit uses in its own character detection tools (MarcEdit: Thinking about Charactersets and MARC) and there is native Python support for both. I was hoping to use Pymarc to take the data I have in utf-8 and convert it to an encoding our systems can handle, at least for records lacking those special characters, and maybe use the decoding exceptions raised through that process to identify records that require further editing.

I recognize this is a very niche (and unfortunate) need for our system to still be using this encoding, so if it's not feasible to encode Records as anything other than UTF-8 then I don't expect changes to PyMarc to accommodate that. I just thought perhaps the "as_marc()" or RawField tools in Pymarc might allow me to use the native support in Python to encode the bytes in one of those other encodings prior to saving the Records to a file. 

Ed Summers

unread,
Jan 7, 2025, 12:39:49 PMJan 7
to pym...@googlegroups.com


> On Jan 7, 2025, at 12:07 PM, Patrick H <pathar...@gmail.com> wrote:
>
> I recognize this is a very niche (and unfortunate) need for our system to still be using this encoding, so if it's not feasible to encode Records as anything other than UTF-8 then I don't expect changes to PyMarc to accommodate that. I just thought perhaps the "as_marc()" or RawField tools in Pymarc might allow me to use the native support in Python to encode the bytes in one of those other encodings prior to saving the Records to a file.

Thanks for this Patrick. Sadly, even after all these years of Unicode, it seems like it is not very niche to be working with an ILS that still only supports MARC-8 encoded records. Just out of curiosity what ILS are you working with? From your examples it looked like you were reading in UTF-8 encoded records? Were these exported from your ILS?

After reading the MARCEdit post you shared it seemed like Terry was saying MARC-8 could be easily mistaken for cp1252 when trying to automatically determine the encoding of a MARC record, not that there was sufficient overlap to consider them equivalent? But I could be wrong about that.

If you want to experiment with having a parameter that would allow Record.as_marc('cp1252') I could try to add it to an experimental version of pymarc. My only worry with simply introducing it, is that pymarc’s handling of encodings is already so complicated, and adding another knob will simply make it worse…

As an alternative, you might want to consider writing out your modified data as UTF-8 encoded with pymarc. Then you could use yaz-marcdump, which is part of the yaz toolkit [1] to convert your records to marc-8:

$ yaz-marcdump -f utf8 -t marc8 -o marc utf8-records.raw > marc8-records.raw

yaz has been around a long time, and has been heavily used over the years, so it should be as reliable as you can get for going backwards from UTF-8 to MARC-8. Maybe others have found different approaches to this, if so please chime in, it’s one of the most difficult areas to work with in pymarc.

//Ed

[1] https://www.indexdata.com/resources/software/yaz/

Patrick H

unread,
Jan 7, 2025, 1:22:26 PMJan 7
to pymarc Discussion
Thanks so much for this tip Ed! I haven't worked with yaz before but will look into it. It seems like I was misreading Terry's post so I appreciate your clarification, as it saved me going down the wrong path on my own. The character conversion tool in MarcEdit identifies MARC-8 as either ascii or cp1252 so I was treating them as similar, if not entirely equivalent, but I think that was a misreading on my part. The previous libraries I worked at had used utf-8 so I never had to worry much about encoding issues in other programming projects. 

We use Horizon for our ILS. In case it's of interest, I'm working on a tool to do some basic authority control tasks in Python now that Marcive has shut down. The records I've been using for testing/development are from either our ILS or govdocs records from OCLC, both of which were encoded in MARC-8. I have a function that uses the encoding tools in the MARCReader to force a conversion to utf-8 at the time of reading so that I can extract controllable fields, perform searches against various LC APIs, and then make any required edits to the records based on searches and save an authority record for import. I've been able to parse the results of the APIs, make edits to Records/Fields based on the results from the LCNAF, and grab an authority record in MARC-21 format from the marcxml available from LC but was having to convert the records in MarcEdit afterwards and was hoping to remove that step if possible. It sounds like converting them after the records have been saved makes more sense than doing anything in Pymarc and that would definitely work for us. The code for the above is still quite messy and currently only capable of handling fairly obvious matches, but if it's of interest I'd be happy to share. 

I don't blame you at all for not wanting to mess with any of the encoding issues. The features to get things into utf-8 and parsing xml have worked splendidly in this project, so thanks to you and all others who have contributed to them! 
Reply all
Reply to author
Forward
0 new messages