That was some nice detective work there. At the very least I think
pymarc should throw a better exception to make it easier to see what's
going on in the future.
I am not an expert on MARC-8, so I don't really know if it is "legal"
or not. You are the first person to notice this issue with pymarc. But
you may also be the first person who took the time to investigate the
problem and write about it. I wonder if posting your question about
MARC-8 to the MARC discussion list [1] might be useful?
//Ed
[1] http://listserv.loc.gov/listarch/marc.html
2012/3/21 Godmar Back <god...@gmail.com>:
I'm neither an expert on CJK cataloging nor Innovative, but if I
recall correctly the curly-bracket sequence is used to record the
exact sequence of bytes within a record. It is primarily used within
the cataloging interface and is parsed against a a map file to tell
the server what a particular type of client can display (see [0]).
Additionally, EACC's coding structure is that planes, sections, and
positions are each numbered from 21 to 7E (see pages 58-59 of [1]).
I know it doesn't help resolve your issue with pymarc directly but it
appears that the EACC sequence within the brackets in this record is
hence invalid.
Mark
[0] http://innovativeusers.org/list/archives/2000/msg02216.html
[1] http://www.oclc.org/support/documentation/connexion/client/international/internationalcataloging.pdf
Would you be willing to open a ticket with the MARC record that
triggers the problem attached? I think it would be useful to at least
get pymarc reporting that it is a character encoding error instead of
throwing an IndexError.
//Ed
Try the patch at https://gist.github.com/2159552? If it works for you, we can maybe write a better test and check it in.
On Thu, Mar 22, 2012 at 2:09 PM, Aaron Lav <aaron...@gmail.com> wrote:
Try the patch at https://gist.github.com/2159552? If it works for you, we can maybe write a better test and check it in.First of all, we don't know (or do we now?) that that's legal/ok to work around.
Second, please don't put this on the common case (fast) path. Let it throw an exception, and catch the error, then examine it. This will also allow chaining of different methods for fixing/working around errors like that.
Second, please don't put this on the common case (fast) path. Let it throw an exception, and catch the error, then examine it. This will also allow chaining of different methods for fixing/working around errors like that.
I think it's a bit clearer as I suggested, since my intuition is that most fixes/workarounds will need to apply the MARC-8 parsing logic anyway. If not, then it might be better to throw an exception with as much info as possible (eg "'{' first char of multibyte EACC sequence"), since that clearly indicates the nature of the problem and what workaround is most likely.
//Ed