unicode issues...

Heidi Frank

unread,

Jul 22, 2015, 4:33:04 PM7/22/15

to pymarc Discussion

Hi everyone,

So each time I encounter unicode issues, I really don't know how to troubleshoot and debug...

my python script iterates through a folder of MARCXML files, uses the pymarc.parse_xml_to_array(file_path) function, then iterates through that array writing each of the records out to an .mrc file (using a variable created by the pymarc.MARCWriter() function)

this part of the script looks like this:

-----------------------------------------------

marcRecsOut_orig_recs = pymarc.MARCWriter(file(aco_globals.batch_folder+'/'+batch_name+'_0_orig_recs.mrc', 'w'))

marcxml_dir = aco_globals.batch_folder+'/marcxml_in'

for filename in os.listdir(marcxml_dir):

file_path = os.path.join(marcxml_dir,filename)

if os.path.isfile(file_path):

if file_path[-3:]=='xml':

marc_xml_array = pymarc.parse_xml_to_array(file_path)

for rec in marc_xml_array:

rec = aco_functions.pad_008(rec)

rec_001 = rec.get_fields('001')[0]

print rec_001

marcRecsOut_orig_recs.write(rec)

marcRecsOut_orig_recs.close()

-----------------------------------------------

I'm encountering an error for this line in the code:

marcRecsOut_orig_recs.write(rec)

resulting in this output:

=001 001356825

Traceback (most recent call last):

File "aco-1-xml2mrc-oclc-nums.py", line 57, in <module>

marcRecsOut_orig_recs.write(rec)

File "/Library/Python/2.7/site-packages/pymarc/writer.py", line 40, in write

self.file_handle.write(record.as_marc())

File "/Library/Python/2.7/site-packages/pymarc/record.py", line 341, in as_marc

field_data = field.as_marc(encoding=encoding)

File "/Library/Python/2.7/site-packages/pymarc/field.py", line 205, in as_marc

return (marc + END_OF_FIELD).encode(encoding)

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u1e63' in position 14: ordinal not in range(256)

That unicode character - \u1e63 - is for a lowercase s with a dot below it (see: http://www.fileformat.info/info/unicode/char/1e63/index.htm), which first appears in the 110 MARC field of the attached MARCXML file. I've used this exact python script numerous times on files of MARC records having Arabic script as well as some Persian and/or Turkish, etc, and it's hard to believe the script has no unicode issues with all the other non-Latin characters but has trouble with this one...

So I'm just trying to understand what's happening here... btw, I'm on Python 2.7, on a Mac.

Does anyone have any insight how to troubleshoot this error?

any help appreciated!

heidi

NNU_001356825_marcxml.xml

Mark A. Matienzo

unread,

Jul 22, 2015, 5:58:14 PM7/22/15

to pym...@googlegroups.com

Hi Heidi,

It looks like your script may not be setting the encoding properly for the file handle associated with `marcRecsOut_orig_recs`. If you run the following, can you tell me what you receive back?

```

import sys; sys.stdout.encoding

```

This error is particularly telling:

```

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u1e63' in position 14: ordinal not in range(256)

```

... because it says that it's trying to encode the data as `latin-1`, rather than `utf-8.` You may want to try changing the first line of the section of the script you shared to the following:

```

marcRecsOut_orig_recs = pymarc.MARCWriter(codecs.open(aco_globals.batch_folder+'/'+batch_name+'_0_orig_recs.mrc', 'w', 'utf-8'))

```

Mark

--
You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Heidi P Frank

unread,

Jul 22, 2015, 9:54:35 PM7/22/15

to pym...@googlegroups.com

aah, I will try that change to the MARCWriter parameters - didn't know you could do that (though I also still don't fully understand encoding/decoding, so not saying much... :)

thanks for the tip!

heidi

Heidi Frank
Electronic Resources & Special Formats Cataloger
New York University Libraries
Knowledge Access & Resources Management Services
20 Cooper Square, 3rd Floor
New York, NY 10003
212-998-2499 (office)
212-995-4366 (fax)
hf...@nyu.edu
Skype: hfrank71

--
You received this message because you are subscribed to a topic in the Google Groups "pymarc Discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pymarc/Q444j3vY8LE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pymarc+un...@googlegroups.com.

Heidi Frank

unread,

Jul 23, 2015, 8:01:17 AM7/23/15

to pymarc Discussion, mark.m...@gmail.com

Hi Mark,

So your suggested change in the code - to add the codecs.open(... 'utf-8') - *did* work for that error I originally encountered, but now, I'm getting the following ascii error for the 1st record being processed:

=001 000397895

Traceback (most recent call last):

File "aco-1-xml2mrc-oclc-nums.py", line 57, in <module>

marcRecsOut_orig_recs.write(rec)

File "/Library/Python/2.7/site-packages/pymarc/writer.py", line 40, in write

self.file_handle.write(record.as_marc())

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 688, in write

return self.writer.write(data)

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write

data, consumed = self.encode(object, self.errors)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 422: ordinal not in range(128)

I tried to google and find out what the "0xca" character is - possibly the E with circumflex diacritic? (http://lwp.interglacial.com/appf_01.htm) - but that character doesn't appear in the offending record (there are lots of other non-Roman characters though, since these are Arabic language records) - I've attached the MARCXML record that is generating the error when trying to write it to the ouput .mrc file.

any ideas?

(this is where I start to go down rabbit holes I think... :)

heidi

NNU_000397895_marcxml.xml

Heidi Frank

unread,

Jul 23, 2015, 8:36:05 AM7/23/15

to pymarc Discussion, mark.m...@gmail.com, hf...@nyu.edu

and one odd thing to note - I've just used the same exact scripts and workflow to process a different batch of 185 records - pretty much all are containing non-Roman Arabic characters - and did not encounter any unicode issues... (and I was using the original code without the "codecs.open ... 'utf-8'" addition)

so confused...

Godmar Back

unread,

Jul 23, 2015, 3:28:53 PM7/23/15

to pym...@googlegroups.com

On Thu, Jul 23, 2015 at 8:01 AM, Heidi Frank <hf...@nyu.edu> wrote:

Hi Mark,
So your suggested change in the code - to add the codecs.open(... 'utf-8') - *did* work for that error I originally encountered, but now, I'm getting the following ascii error for the 1st record being processed:

=001 000397895
Traceback (most recent call last):
File "aco-1-xml2mrc-oclc-nums.py", line 57, in <module>
marcRecsOut_orig_recs.write(rec)
File "/Library/Python/2.7/site-packages/pymarc/writer.py", line 40, in write
self.file_handle.write(record.as_marc())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 688, in write
return self.writer.write(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 422: ordinal not in range(128)

I tried to google and find out what the "0xca" character is - possibly the E with circumflex diacritic? (http://lwp.interglacial.com/appf_01.htm) - but that character doesn't appear in the offending record (there are lots of other non-Roman characters though, since these are Arabic language records) - I've attached the MARCXML record that is generating the error when trying to write it to the ouput .mrc file.

Nope, it's this one: http://www.fileformat.info/info/unicode/char/2bb/index.htm

appears right before 'Abd' in this line: <subfield code="a">Ibn ʻAbd Rabbih, Aḥmad ibn Muḥammad,</subfield>

any ideas?

The record's field (here the 100a field), in the way you're reading it, is stored as a 8-bit string (type 'str'). It should be of type 'unicode'.

The codec Mark suggested attempts to encode this str object as utf8. To do that, it first needs to be decoded, here using the system default encoding (ascii).

Since this 8-bit string in fact contains utf8-encoded Unicode, treating it as ASCII will fail. If you want to use a codec wrapper, you need to decode the record such that each field is of unicode type (there's a flag for that)

Now what I don't see is why it is str rather than unicode. The parse_xml functions of pymarc ensure that all subfields' contents are stored as unicode.

Perhaps your pad_008 function accidentally stores a str there?

If not, post the entire code again.

- Godmar

Ed Summers

unread,

Jul 23, 2015, 4:08:28 PM7/23/15

to pym...@googlegroups.com

Hi Heidi,

It looks like the record you supplied has a space in leader position 9 which indicates that the record uses MARC-8 encoding. For better or worse (mostly the latter here) this causes pymarc to use latin-1 when serializing the pymarc.Record. If you want it to serialize as Unicode (UTF-8) try changing the leader 9 to ‘a’. Here’s a somewhat simplified version of your script that should work on the record you sent.

out = pymarc.MARCWriter(open('out.dat', 'wb'))
for rec in pymarc.parse_xml_to_array('marc.xml'):
rec.leader = rec.leader[0:9] + 'a' + rec.leader[10:]
out.write(rec)
out.close()

Unfortunately changing a character in a string isn’t as easy as you’d think. Or perhaps there’s a better way? At any rate, this encoding stuff is really thorny, especially in pymarc where there’s this weird thing called MARC-8 that nobody else uses anymore. Make sure that whatever system you have downstream from your munging programs can work with unicode/utf8 MARC records. I suspect it probably does, but you never know. pymarc doesn’t convert back to MARC8 for philosophical reasons…

//Ed

signature.asc

Heidi Frank

unread,

Jul 27, 2015, 5:00:11 PM7/27/15

to pymarc Discussion, e...@pobox.com

Hi Godmar and Ed,

Thank you both for your suggestions! Ed, yes, I found that the record causing the first unicode error (versus the second ascii error) was coded as MARC8, not UTF8 - and changing that LDR byte fixed it.

Thanks so much for your time!

heidi

Ed Summers

unread,

Jul 27, 2015, 5:08:57 PM7/27/15

to pym...@googlegroups.com

> On Jul 27, 2015, at 5:00 PM, Heidi Frank <hf...@nyu.edu> wrote:
>
> Ed, yes, I found that the record causing the first unicode error (versus the second ascii error) was coded as MARC8, not UTF8 - and changing that LDR byte fixed it.

Whew, I’m glad you got it working! What a headache this MARC stuff is eh?

//Ed

signature.asc

Reply all

Reply to author

Forward