Hi everyone,
So each time I encounter unicode issues, I really don't know how to troubleshoot and debug...
my python script iterates through a folder of MARCXML files, uses the pymarc.parse_xml_to_array(file_path) function, then iterates through that array writing each of the records out to an .mrc file (using a variable created by the pymarc.MARCWriter() function)
this part of the script looks like this:
-----------------------------------------------
marcRecsOut_orig_recs = pymarc.MARCWriter(file(aco_globals.batch_folder+'/'+batch_name+'_0_orig_recs.mrc', 'w'))
marcxml_dir = aco_globals.batch_folder+'/marcxml_in'
for filename in os.listdir(marcxml_dir):
file_path = os.path.join(marcxml_dir,filename)
if os.path.isfile(file_path):
if file_path[-3:]=='xml':
marc_xml_array = pymarc.parse_xml_to_array(file_path)
for rec in marc_xml_array:
rec = aco_functions.pad_008(rec)
rec_001 = rec.get_fields('001')[0]
print rec_001
marcRecsOut_orig_recs.write(rec)
marcRecsOut_orig_recs.close()
-----------------------------------------------
I'm encountering an error for this line in the code:
marcRecsOut_orig_recs.write(rec)
resulting in this output:
=001 001356825
Traceback (most recent call last):
File "aco-1-xml2mrc-oclc-nums.py", line 57, in <module>
marcRecsOut_orig_recs.write(rec)
File "/Library/Python/2.7/site-packages/pymarc/writer.py", line 40, in write
self.file_handle.write(record.as_marc())
File "/Library/Python/2.7/site-packages/pymarc/record.py", line 341, in as_marc
field_data = field.as_marc(encoding=encoding)
File "/Library/Python/2.7/site-packages/pymarc/field.py", line 205, in as_marc
return (marc + END_OF_FIELD).encode(encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u1e63' in position 14: ordinal not in range(256)
That unicode character - \u1e63 - is for a lowercase s with a dot below it (see:
http://www.fileformat.info/info/unicode/char/1e63/index.htm), which first appears in the 110 MARC field of the attached MARCXML file. I've used this exact python script numerous times on files of MARC records having Arabic script as well as some Persian and/or Turkish, etc, and it's hard to believe the script has no unicode issues with all the other non-Latin characters but has trouble with this one...
So I'm just trying to understand what's happening here... btw, I'm on Python 2.7, on a Mac.
Does anyone have any insight how to troubleshoot this error?
any help appreciated!
heidi