{ASCII} within EACC make pymarc decoder throw exception

Godmar Back

unread,

Mar 21, 2012, 12:24:02 AM3/21/12

to pym...@googlegroups.com

Hi, on the topic of ill-encoded records:

A 880$a subfield, which contains a Japanese title in EACC, contains:

1b 24 31 21 50 56 4b 37 6f 69 24 4e 21 51 31 21 47 34 69 24 4e 21 30 70 21 51 2b 7b 36 39 32 34 66 36 7d 1b 28 42

(colors used for easier reading), which III interprets as

米国の統治の仕組{6924f6}

In other words, they embed ASCII {6024f6} inside of an EACC string that consists of 24-bit EACC characters, but then embeds ASCII within { } before the next ESC sequence.

Pymarc's decoder barfs because, of course, 7b 36 39 is not a valid EACC char.

I note that this eventually triggers an IndexError (a general exception type) since it'll keep skipping the 24-bit EACC chars until it considers 28 42 XX where XX is out of bounds (rather than, say, skipping to the next ESC).

My question: is the above legal, and if not, frequent enough to warrant a work-around in pymarc's decoder?

- Godmar

Ed Summers

unread,

Mar 22, 2012, 7:33:19 AM3/22/12

to pym...@googlegroups.com

Hi Godmar,

That was some nice detective work there. At the very least I think
pymarc should throw a better exception to make it easier to see what's
going on in the future.

I am not an expert on MARC-8, so I don't really know if it is "legal"
or not. You are the first person to notice this issue with pymarc. But
you may also be the first person who took the time to investigate the
problem and write about it. I wonder if posting your question about
MARC-8 to the MARC discussion list [1] might be useful?

//Ed

[1] http://listserv.loc.gov/listarch/marc.html

2012/3/21 Godmar Back <god...@gmail.com>:

Mark A. Matienzo

unread,

Mar 22, 2012, 10:26:39 AM3/22/12

to pym...@googlegroups.com

Godmar,

I'm neither an expert on CJK cataloging nor Innovative, but if I
recall correctly the curly-bracket sequence is used to record the
exact sequence of bytes within a record. It is primarily used within
the cataloging interface and is parsed against a a map file to tell
the server what a particular type of client can display (see [0]).

Additionally, EACC's coding structure is that planes, sections, and
positions are each numbered from 21 to 7E (see pages 58-59 of [1]).

I know it doesn't help resolve your issue with pymarc directly but it
appears that the EACC sequence within the brackets in this record is
hence invalid.

Mark

[0] http://innovativeusers.org/list/archives/2000/msg02216.html
[1] http://www.oclc.org/support/documentation/connexion/client/international/internationalcataloging.pdf

Ed Summers

unread,

Mar 22, 2012, 10:48:19 AM3/22/12

to pym...@googlegroups.com

Hi Godmar,

Would you be willing to open a ticket with the MARC record that
triggers the problem attached? I think it would be useful to at least
get pymarc reporting that it is a character encoding error instead of
throwing an IndexError.

//Ed

Aaron Lav

unread,

Mar 22, 2012, 2:09:31 PM3/22/12

to pym...@googlegroups.com

Try the patch at https://gist.github.com/2159552? If it works for you, we can maybe write a better test and check it in.

Godmar Back

unread,

Mar 22, 2012, 3:38:54 PM3/22/12

to pym...@googlegroups.com

Sure, here you go: https://github.com/edsu/pymarc/pull/26

- Godmar

Godmar Back

unread,

Mar 22, 2012, 3:41:08 PM3/22/12

to pym...@googlegroups.com

On Thu, Mar 22, 2012 at 2:09 PM, Aaron Lav <aaron...@gmail.com> wrote:

Try the patch at https://gist.github.com/2159552? If it works for you, we can maybe write a better test and check it in.

First of all, we don't know (or do we now?) that that's legal/ok to work around.

Second, please don't put this on the common case (fast) path. Let it throw an exception, and catch the error, then examine it. This will also allow chaining of different methods for fixing/working around errors like that.

- Godmar

Aaron Lav

unread,

Mar 22, 2012, 10:29:24 PM3/22/12

to pym...@googlegroups.com

On Thursday, March 22, 2012 2:41:08 PM UTC-5, Godmar Back wrote:

On Thu, Mar 22, 2012 at 2:09 PM, Aaron Lav <aaron...@gmail.com> wrote:

Try the patch at https://gist.github.com/2159552? If it works for you, we can maybe write a better test and check it in.

First of all, we don't know (or do we now?) that that's legal/ok to work around.

My reading of http://www.loc.gov/marc/specifications/speccharmarc8.html and http://en.wikipedia.org/wiki/ISO/IEC_2022 is that it's illegal. I don't see any provision for switching from a multibyte to single-byte encoding in the LC-called-out character sets other than through an escape code, which '{' does not qualify as.

But as the patch says, no EACC character starts with '{', so the workaround is fine. I checked that before writing the patch with

map(chr, set([(0xFF0000 & k) >> 16 for k in marc8_mapping.CHARSET_31.keys()])), which returned

['!', '"', '#', "'", '(', ')', '-', '.', '/', '3', '4', '5', '9', ':', ';', '?', 'E', 'F', 'G', 'K', 'L', 'M', 'Q', 'R', 'i', 'o', 'p']

It is, of course, possible that the EACC numbering space will be expanded in the future, so a final version of the workaround might be optional.

I agree that it'd be better if III were to only generate valid MARC, but absent that, ...

Second, please don't put this on the common case (fast) path. Let it throw an exception, and catch the error, then examine it. This will also allow chaining of different methods for fixing/working around errors like that.

I think it's a bit clearer as I suggested, since my intuition is that most fixes/workarounds will need to apply the MARC-8 parsing logic anyway. If not, then it might be better to throw an exception with as much info as possible (eg "'{' first char of multibyte EACC sequence"), since that clearly indicates the nature of the problem and what workaround is most likely.

But if this kind of malformation isn't that frequent, then maybe this isn't that important.

Aaron (as...@pobox.com)

Godmar Back

unread,

Mar 22, 2012, 10:55:37 PM3/22/12

to pym...@googlegroups.com

I had to read your email twice, but now I understand what's going on.

The cataloger who tried to encode this record tried to encode the title

米国の統治の仕組み

where the last character み is a "Hiragana Letter MI", with EACC encoding 69245f.

But instead of typing {69245f}, they must have typed {6924f6}, which doesn't correspond to any valid EACC char, so III just stuck {6924f6} in there.

In fact, Worldcat shows the full title with the Letter MI character at the end: http://www.worldcat.org/title/beikoku-no-tochi-no-shikumi/oclc/503001208

OCLC seems to have fixed the record, but many of the libraries holding this item did not. For instance, VT, UVA, UNC all carry the broken record.

- Godmar

On Thu, Mar 22, 2012 at 10:26 AM, Mark A. Matienzo <mark.m...@gmail.com> wrote:

Godmar Back

unread,

Mar 22, 2012, 11:02:23 PM3/22/12

to pym...@googlegroups.com

On Thu, Mar 22, 2012 at 10:29 PM, Aaron Lav <aaron...@gmail.com> wrote:

Second, please don't put this on the common case (fast) path. Let it throw an exception, and catch the error, then examine it. This will also allow chaining of different methods for fixing/working around errors like that.

I think it's a bit clearer as I suggested, since my intuition is that most fixes/workarounds will need to apply the MARC-8 parsing logic anyway. If not, then it might be better to throw an exception with as much info as possible (eg "'{' first char of multibyte EACC sequence"), since that clearly indicates the nature of the problem and what workaround is most likely.

pymarc isn't fast as it is, and burdening the parsing logic with uncommon cases might be unwise. Here's what I think we should try:

try:

unicode = parse_fast_common_case(marc8)

except UnicodeDecodeError, ude:

if marc8[ude.start] == "{":

# check that there's { hex x 6 } etc.

unicode = parse_fast_common_cast(marc8[:ude.start]) + remainder

# add other infrequent work-arounds here

etc. It could be made more elegant and extensible.

But placing it in the common path I would not.

- Godmar

Ed Summers

unread,

Mar 23, 2012, 4:44:02 PM3/23/12

to pym...@googlegroups.com

Lets definitely get this documented in an Issue so it doesn't get lost.

//Ed

Reply all

Reply to author

Forward