unicode error (again)

174 views
Skip to first unread message

Heidi Frank

unread,
Aug 16, 2017, 2:28:25 PM8/16/17
to pymarc Discussion
Hi all,
here I am again with a unicode error that I've no idea how to figure out...

I work on a project dealing with Arabic script records and have occasionally come across unicode errors where characters in the records can't be decoded.  I fully don't understand unicode, admittedly, but in the past, the errors were due to very basic fixes, such as just changing the code in byte 09 of the LDR fields, etc.

However, this time, I've not been able to figure out what's happening.  I have a step where I run the file of records through some MarcEdit tasks that deletes the existing 001, 003 and 035 fields and copies the 999 $i information into a new 001 and 003 field.  When first opening the original .mrc file of records in ME to convert to an editable .mrk file, I *do* check to convert to utf8, not sure if that matters.    I then run the modified file through my python script - I've pared the script down to the bare bones where the error occurs and am running this code for testing:
-------------------------------------------------------------
#!/usr/bin/python

import pymarc
from pymarc import Record, Field

marcRecsIn_updates_orig = pymarc.MARCReader(file('LeBAU_20170605_3_updates_all_orig.mrc'), to_unicode=True, force_utf8=True)
marcRecsIn_updates_mod = pymarc.MARCReader(file('LeBAU_20170605_3_updates_all_mod2.mrc'), to_unicode=True, force_utf8=True)

print 'Original updated records: '
for mrc_rec in marcRecsIn_updates_orig:
mrc_rec_001 = mrc_rec.get_fields('001')[0]
print mrc_rec_001

print 'Modified updated records: '
for mrc_rec in marcRecsIn_updates_mod:
mrc_rec_001 = mrc_rec.get_fields('001')[0]
print mrc_rec_001
-------------------------------------------------------------

the error occurs on line 15 in the "for" loop to read through each record in the file of Modified records.   the original file (before the ME edits) runs fine and outputs all the 001 fields to the screen, but the modified file throws these errors:
-------------------------------------------------------------

172-16-24-144:ACO Staff$ python aco-test-unicode.py

Original updated records: 

=001  ocm00000NEW\

=001  ocm00000NEW\

...

=001  ocm00000NEW\

=001  ocm00000NEW\

=001  ocm00000NEW\

Modified updated records: 

Traceback (most recent call last):

  File "aco-test-unicode.py", line 15, in <module>

    for mrc_rec in marcRecsIn_updates_mod:

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six.py", line 414, in next

    return type(self).__next__(self)

  File "/Library/Python/2.7/site-packages/pymarc/reader.py", line 101, in __next__

    utf8_handling=self.utf8_handling)

  File "/Library/Python/2.7/site-packages/pymarc/record.py", line 74, in __init__

    utf8_handling=utf8_handling)

  File "/Library/Python/2.7/site-packages/pymarc/record.py", line 288, in decode_marc

    subs[0] = subs[0].decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)

172-16-24-144:ACO Staff$ 

-------------------------------------------------------------

I've attached both the original file of records and the ME modified version, as well as the test script.

If anyone has a chance to look at it and has any ideas, please let me know.   PS: I'm using Python 2.7...

thanks in advance!!
heidi

aco-test-unicode.py
LeBAU_20170605_3_updates_all_orig.mrc
LeBAU_20170605_3_updates_all_mod2.mrc

Mike

unread,
Aug 18, 2017, 10:02:49 PM8/18/17
to pymarc Discussion
There are extra unicode characters after field indicators [1] inserted by MarcEdit here.
I've created a pull request on github that fixes this case [2].

[1] http://www.fileformat.info/info/unicode/char/202a/index.htm
[2] https://github.com/edsu/pymarc/pull/106

Regards,
Mikhail

Ed Summers

unread,
Aug 19, 2017, 5:51:50 AM8/19/17
to pym...@googlegroups.com
Mikhail,

Thanks for digging into this. As I've commented on the ticket I wonder if Terry might want to fix this in MarcEdit...or are the records technically legit?

I guess following Postel's Law [1] pymarc should probably be liberal in what it accepts. But I wonder if MarcEdit might want to be a bit more conservative in what it does...

//Ed

[1] https://en.wikipedia.org/wiki/Robustness_principle
> --
> You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Terry Reese

unread,
Aug 19, 2017, 12:25:28 PM8/19/17
to pymarc Discussion
I can provide some information as to what is happening -- though, I'd be curious to know a bit more about the process you are using to process the data.

I posted about this a bit ago, but MarcEdit does embed 0x202e, or characters marking the start and end of right to left processing.  This is done specifically because of instances like this: http://blog.reeset.net/archives/2103.  Within right to left languages, there are a range of characters in the general latin set that will render as right to left if placed along-side a right to left value.  One of the ones causing specific problems right now is the $0 combination.  The blog post shows how this data renders, which is incorrect, but is also correct (in terms of rendering logic).  The 0x202e character forces correct rendering.  To do this, MarcEdit uses the following function to read the bits on a character to determine if has a right to left marker:

private string Right2Left(string toLoad)
        {
            char cLMS = (char)0x202A;//(char)0x200E;
            char cRMS = (char)0x202B; //(char)0x200F;
            bool bRight2Left = false;
            System.Text.StringBuilder ctoLoad = new System.Text.StringBuilder();
            //foreach (char c in toLoad)
            //{
            for (var i = 0; i < toLoad.Length; i += char.IsSurrogatePair(toLoad, i) ? 2 : 1)
            {
                var c = char.ConvertToUtf32(toLoad, i);
                int hasRandALCat = 0;
                if (c >= 0x5BE && c <= 0x10B7F)
                {
                    if (c <= 0x85E)
                    {
                        if (c == 0x5BE) hasRandALCat = 1;
                        else if (c == 0x5C0) hasRandALCat = 1;
                        else if (c == 0x5C3) hasRandALCat = 1;
                        else if (c == 0x5C6) hasRandALCat = 1;
                        else if (0x5D0 <= c && c <= 0x5EA) hasRandALCat = 1;
                        else if (0x5F0 <= c && c <= 0x5F4) hasRandALCat = 1;
                        else if (c == 0x608) hasRandALCat = 1;
                        else if (c == 0x60B) hasRandALCat = 1;
                        else if (c == 0x60D) hasRandALCat = 1;
                        else if (c == 0x61B) hasRandALCat = 1;
                        else if (0x61E <= c && c <= 0x64A) hasRandALCat = 1;
                        else if (0x66D <= c && c <= 0x66F) hasRandALCat = 1;
                        else if (0x671 <= c && c <= 0x6D5) hasRandALCat = 1;
                        else if (0x6E5 <= c && c <= 0x6E6) hasRandALCat = 1;
                        else if (0x6EE <= c && c <= 0x6EF) hasRandALCat = 1;
                        else if (0x6FA <= c && c <= 0x70D) hasRandALCat = 1;
                        else if (c == 0x710) hasRandALCat = 1;
                        else if (0x712 <= c && c <= 0x72F) hasRandALCat = 1;
                        else if (0x74D <= c && c <= 0x7A5) hasRandALCat = 1;
                        else if (c == 0x7B1) hasRandALCat = 1;
                        else if (0x7C0 <= c && c <= 0x7EA) hasRandALCat = 1;
                        else if (0x7F4 <= c && c <= 0x7F5) hasRandALCat = 1;
                        else if (c == 0x7FA) hasRandALCat = 1;
                        else if (0x800 <= c && c <= 0x815) hasRandALCat = 1;
                        else if (c == 0x81A) hasRandALCat = 1;
                        else if (c == 0x824) hasRandALCat = 1;
                        else if (c == 0x828) hasRandALCat = 1;
                        else if (0x830 <= c && c <= 0x83E) hasRandALCat = 1;
                        else if (0x840 <= c && c <= 0x858) hasRandALCat = 1;
                        else if (c == 0x85E) hasRandALCat = 1;
                    }
                    else if (c == 0x200F) hasRandALCat = 1;
                    else if (c >= 0xFB1D)
                    {
                        if (c == 0xFB1D) hasRandALCat = 1;
                        else if (0xFB1F <= c && c <= 0xFB28) hasRandALCat = 1;
                        else if (0xFB2A <= c && c <= 0xFB36) hasRandALCat = 1;
                        else if (0xFB38 <= c && c <= 0xFB3C) hasRandALCat = 1;
                        else if (c == 0xFB3E) hasRandALCat = 1;
                        else if (0xFB40 <= c && c <= 0xFB41) hasRandALCat = 1;
                        else if (0xFB43 <= c && c <= 0xFB44) hasRandALCat = 1;
                        else if (0xFB46 <= c && c <= 0xFBC1) hasRandALCat = 1;
                        else if (0xFBD3 <= c && c <= 0xFD3D) hasRandALCat = 1;
                        else if (0xFD50 <= c && c <= 0xFD8F) hasRandALCat = 1;
                        else if (0xFD92 <= c && c <= 0xFDC7) hasRandALCat = 1;
                        else if (0xFDF0 <= c && c <= 0xFDFC) hasRandALCat = 1;
                        else if (0xFE70 <= c && c <= 0xFE74) hasRandALCat = 1;
                        else if (0xFE76 <= c && c <= 0xFEFC) hasRandALCat = 1;
                        else if (0x10800 <= c && c <= 0x10805) hasRandALCat = 1;
                        else if (c == 0x10808) hasRandALCat = 1;
                        else if (0x1080A <= c && c <= 0x10835) hasRandALCat = 1;
                        else if (0x10837 <= c && c <= 0x10838) hasRandALCat = 1;
                        else if (c == 0x1083C) hasRandALCat = 1;
                        else if (0x1083F <= c && c <= 0x10855) hasRandALCat = 1;
                        else if (0x10857 <= c && c <= 0x1085F) hasRandALCat = 1;
                        else if (0x10900 <= c && c <= 0x1091B) hasRandALCat = 1;
                        else if (0x10920 <= c && c <= 0x10939) hasRandALCat = 1;
                        else if (c == 0x1093F) hasRandALCat = 1;
                        else if (c == 0x10A00) hasRandALCat = 1;
                        else if (0x10A10 <= c && c <= 0x10A13) hasRandALCat = 1;
                        else if (0x10A15 <= c && c <= 0x10A17) hasRandALCat = 1;
                        else if (0x10A19 <= c && c <= 0x10A33) hasRandALCat = 1;
                        else if (0x10A40 <= c && c <= 0x10A47) hasRandALCat = 1;
                        else if (0x10A50 <= c && c <= 0x10A58) hasRandALCat = 1;
                        else if (0x10A60 <= c && c <= 0x10A7F) hasRandALCat = 1;
                        else if (0x10B00 <= c && c <= 0x10B35) hasRandALCat = 1;
                        else if (0x10B40 <= c && c <= 0x10B55) hasRandALCat = 1;
                        else if (0x10B58 <= c && c <= 0x10B72) hasRandALCat = 1;
                        else if (0x10B78 <= c && c <= 0x10B7F) hasRandALCat = 1;
                    }
                }
                if (hasRandALCat == 1)
                {
                    bRight2Left = true;
                }
               if ((toLoad[i] == '$' ||
                    toLoad[i] == '\r' ||
                    toLoad[i] == '\n') && bRight2Left == true)
                {
                    ctoLoad.Append(cLMS);
                    bRight2Left = false;
                    pbHasRight2Left = true;
                }               
                ctoLoad.Append(toLoad[i]);
            }
            return ctoLoad.ToString();
        }


This function tells MarcEdit when to insert explicit markers to render data as left to right.  Previously, the tool would look for the "$" character (as it was a special value), but that was causing the validator some trouble.  To correct this, I've updated the function to look at new lines as well (I can do this because I know what the mnemonic format should look like).

This data only will show up when working with the editor.  When you save the mnemonic file, the data is filtered away.  Previously, this happened like this:

foreach (string tmp_line in objRich.Lines)
                        {
                            if (pbHasRight2Left == true)
                            {
                                //clear the clms
                                string ptmp_line = tmp_line.Replace(((char)0x200E).ToString(), "").Replace(((char)0x200F).ToString(), "");
                                writer_buffer_string.AppendLine(ptmp_line);
                            }
                            else
                            {
                                writer_buffer_string.AppendLine(tmp_line);
                            }
                        }

The function would look for the pbHasRight2Left bit (which is set by the right to left function, to identify when to remove the embedded value.  I've updated the code (which I'm posting this afternoon), that will add an additional check: 

foreach (string tmp_line in objRich.Lines)
                        {
                            if (pbHasRight2Left == true ||
                                (tmp_line.IndexOf((char)0x202e) > -1))
                            {
                                //clear the clms                             
                                string ptmp_line = tmp_line.Replace(((char)0x200E).ToString(), "").Replace(((char)0x200F).ToString(), "");
                                writer_buffer_string.AppendLine(ptmp_line);
                            }
                            else
                            {
                                writer_buffer_string.AppendLine(tmp_line);
                            }
                        }

to try and make sure that this data is filtered.  I've never seen a case where this data hasn't been filtered when working with MarcEdit's processing.  If you are finding a case where it's slipped by, that would be a surprise -- but the updated filter should make sure that doesn't occur in the future. 

So, that was all a long way of saying, these bytes are embedded in the mnemonic format (and have to be) per the notes above.  However, this data should never be in the binary MARC file, because anytime that the data is saved in MarcEdit, the data is filtered away.  The byte embedding only occurs and is visible in the data file that is rendered into the editor of the tool.

I've not added this filter into the binary MARC processor (it shouldn't be there) -- so, I guess what I'd recommend...the updated flavors of MarcEdit with the changes noted above will show up in a couple hours (or sooner).  You should reprocess your MARC data (tasks, etc.), and see if you are still having problems.  You may be able to open the mod mrc file in the editor and then resave/compile it -- but I'd do it from scratch because as I say, there should be no way for this data to slip into the binary marc file format given that the data is stripped when saving (which happens prior to processing).

--tr

On Wednesday, August 16, 2017 at 2:28:25 PM UTC-4, Heidi Frank wrote:

Mike

unread,
Aug 20, 2017, 8:00:59 AM8/20/17
to pymarc Discussion
AFAICS MARC21 allows any characters in data as long as the encoding is correct.
That means that it is perfectly legal to have LTR and RTL markers embedded into field/subfield data
and there is no need to filter them out when converting into MARC21 format using Unicode encoding.
The problem in this case is that extra 0x202A bytes were inserted _between_ indicators and first
subfield delimiter, so they are not a part of subfield data.
I'd guess that they slipped in when indicators were converted into MARC.
From the other hand indicators should always be in ASCII according to the MARC21 format and
should always consist of two lower case alphabetic, numeric or blank characters.

Regards,
Mikhail

Heidi P Frank

unread,
Aug 30, 2017, 4:19:23 PM8/30/17
to pym...@googlegroups.com
Hi all,
apologies for my delayed reply, have been out of office unexpectedly.

As Terry suggested, I updated my ME version and then re-ran the original file of records through my task list, then compiled back to .mrc, and now I no longer get the unicode error - yaay!!

Thank you Terry and everyone for such detailed explanations about what was happening - all makes sense.

Terry, not sure if it matters at this point, but I've attached the simple task list I was running on my file of Arabic script records.   I would save the modified file as a new .mrk file and then compile that back to a binary MARC .mrc file.  I'd run that modified and compiled .mrc file through the python/pymarc script, and that's when I'd get the unicode error (which didn't happen before the modifications).   Also, if it helps, I've attached 3 .mrc files: 1) the original file of records; 2) the BAD modified file of records BEFORE updating the ME version; and 3) the GOOD modified file of records AFTER updating the ME version.

anyway, all is running smoothly again and I can't thank y'all enough for helping me figure this out!
best,
heidi

Heidi Frank
Electronic Resources & Special Formats Cataloger
New York University Libraries
Knowledge Access & Resources Management Services
20 Cooper Square, 3rd Floor
New York, NY  10003
212-998-2499 (office)
212-995-4366 (fax)
hf...@nyu.edu
Skype: hfrank71

--
You received this message because you are subscribed to a topic in the Google Groups "pymarc Discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pymarc/5zxuOh0fVuc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pymarc+unsubscribe@googlegroups.com.
aub-001s.task
LeBAU_20170605_3_updates-orig.mrc
LeBAU_20170605_3_updates-mod-BAD.mrc
LeBAU_20170605_3_updates-mod-GOOD.mrc

Ed Summers

unread,
Sep 2, 2017, 12:36:50 PM9/2/17
to pym...@googlegroups.com
Thanks for the update Heidi! I'm glad Terry was able to fix your problem. I think it makes sense to merge Mikhail's change [1] to pymarc can read records created by older versions of MarcEdit. But I added a test using the record you supplied and it is throwing off some warnings that I don't quite understand.

//Ed

[1] https://github.com/edsu/pymarc/pull/106
> To unsubscribe from this group and all its topics, send an email to pymarc+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <aub-001s.task><LeBAU_20170605_3_updates-orig.mrc><LeBAU_20170605_3_updates-mod-BAD.mrc><LeBAU_20170605_3_updates-mod-GOOD.mrc>

Terry Reese

unread,
Sep 2, 2017, 4:57:29 PM9/2/17
to pymarc Discussion
Just a very quick fyi Ed, these records would have only been created for almost 2 weeks.  That's it -- the change was made because the initial right to left code I'd put together missed a specific use case.  When I updated it to break on subfields -- that's when these bytes would have showed up.  My guess, you won't run into these records often.

--tr

Ed Summers

unread,
Sep 2, 2017, 5:32:05 PM9/2/17
to pym...@googlegroups.com
Ok, that's really good to know Terry. Hmmm, maybe we should just leave pymarc alone then.

By the way, this whole interaction, from reporting the issue with test program and data, to doing the research to figure out the problem, to proposing a fix, and adjusting an upstream project, all with constructive open conversation really highlight the strengths of the open source community. Y'all are the best.

Happy Labor Day!

//Ed

Heidi P Frank

unread,
Sep 6, 2017, 10:33:22 AM9/6/17
to pym...@googlegroups.com
Yes, I have to say I'm extremely grateful for this discussion list, which has helped me resolve a lot of issues.  Thank you everyone for all you do!!!

Heidi

Heidi Frank
Electronic Resources & Special Formats Cataloger
New York University Libraries
Knowledge Access & Resources Management Services
20 Cooper Square, 3rd Floor
New York, NY  10003
212-998-2499 (office)
212-995-4366 (fax)
hf...@nyu.edu
Skype: hfrank71

> To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "pymarc Discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pymarc/5zxuOh0fVuc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pymarc+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages