indexer error: index out of range

11 views
Skip to first unread message

Gabriel Sean Farrell

unread,
Jan 17, 2008, 7:12:53 PM1/17/08
to facba...@googlegroups.com
So after spending some time with subsets of the catalog, I decided to
dump all of Drexel's meager ~450,000 records into Solr for FBO/Helios.
I ran 'sh batchIndexer.sh ~/catdump/bibs_2008-01-17.dat' and got just
past 130,000 records when it ground to a halt with the following
error:

could not parse pubdate from <<[188-]>> for pubdaterange
. . . . . . . . . . . . . . . . . . . . . . . .Traceback (innermost last):
File "indexerDriver.py", line 182, in ?
File "indexerDriver.py", line 121, in processFile
File "/home/gsf/svn/fac-back-opac/trunk/indexer/indexer.py", line
149, in __init__
File "/home/gsf/svn/fac-back-opac/trunk/indexer/processors.py", line
119, in pubdaterangeProcessor
IndexError: index out of range: 205

Mark mentioned he had gotten index errors on encoding issues in the
past, so I converted the whole thing from MARC8 to UTF8 with
yaz-marcdump, but ended up getting the same error. Obviously, a
record is screwed up somewhere. Any ideas on how to clean it up?

This might be easier to solve if I break up my MARC dump into smaller
chunks. I'll do that tomorrow if I can't figure out another way
around it.

Gabriel

Dan Scott

unread,
Jan 17, 2008, 7:51:55 PM1/17/08
to facba...@googlegroups.com

Looks like a logic error in processors.py, actually. The relevant code is here:

            count = 0
            dateranges = range(0,2050,10)
            for i in dateranges:
                if int(resultOn[0]) >= dateranges[count] and int(resultOn[0]) < dateranges[count + 1]:
                    return "%s-%s" % (dateranges[count],(dateranges[count + 1]-1))
                count += 1

The for loop iterates to the end of the dateranges range, but then compares dateranges[count +1] which doesn't exist. For example:

>>> x = range(0,2050,10)
>>> len(x)
205
>>> print x[205]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: list index out of range

... so we should guard against that, at the very least, with a minor variation like:

            count = 1
            dateranges = range(0,2050,10)
            for i in dateranges:
                if int(resultOn[0]) >= dateranges[count - 1] and int(resultOn[0]) < dateranges[count]:
                    return "%s-%s" % (dateranges[count - 1],(dateranges[count]-1))
                count += 1

The other, probably more significant problem, is that the date that is getting parsed is obviously not of this earth. I'm not seeing the reason for that immediately, but boy-howdy would it be nice to have some unit tests for this - so if you can track down that 130,000th record that would be great. Perhaps you could catch the exception and dump the offending MARC record or at least field when it occurs?

Talking, rather than actually doing, I remain...

--
Dan Scott
Laurentian University

Gabriel Sean Farrell

unread,
Jan 17, 2008, 8:25:15 PM1/17/08
to facba...@googlegroups.com

Hey, at least you bothered to dig into the code, which is probably
better than banging one's head on the keyboard.

Tests would be nice. If I'm gonna figure some out, however, I'm not
going to want to wait so long for 130,000 to come around, so I'll
redump the MARC tomorrow in chunks of 50,000. Then I'll mess with
that code you mentioned and see if I can get some output that says
what's going on.

Gabriel

tbw

unread,
Jan 22, 2008, 1:56:28 PM1/22/08
to FacBackOPAC
I ran in to the same problem both with the off-by-one error in the
date parsing and in wanting to have the indexer just log any problem
records and go on to the next record.

I took a look at the MARC4J code and realized I don't know enough
about exception handling in Java to modify the MARC4J code.
Apparently implementing a MARC file parser that logs and skips "bad"
records is being discussed on the MARC4J list:http://marc4j.tigris.org/
servlets/ReadMsg?listName=users&msgNo=86

I ended up breaking the MARC record loading into smaller files and
then running any file containing a problem record through Terry
Reese's MarcEdit MARCValidator (http://oregonstate.edu/~reeset/
marcedit/html/index.php). You can set it to remove invalid records
(MarcEditor|Tools|Validate MARC Files). That let me index all the
"valid" records and gave me a file of "bad" records.

Tom

Gabriel Sean Farrell

unread,
Feb 4, 2008, 11:49:18 AM2/4/08
to facba...@googlegroups.com
On Jan 22, 2008 1:56 PM, tbw <tburt...@gmail.com> wrote:
>
> I ran in to the same problem both with the off-by-one error in the
> date parsing and in wanting to have the indexer just log any problem
> records and go on to the next record.
>
> I took a look at the MARC4J code and realized I don't know enough
> about exception handling in Java to modify the MARC4J code.
> Apparently implementing a MARC file parser that logs and skips "bad"
> records is being discussed on the MARC4J list:http://marc4j.tigris.org/
> servlets/ReadMsg?listName=users&msgNo=86

I've started a "pymarc-indexer" branch to attempt to parse the MARC
with pymarc instead of marc4j. I've got the directory structure set
up, but not a lot of code yet. More on that in the next couple days.

> I ended up breaking the MARC record loading into smaller files and
> then running any file containing a problem record through Terry
> Reese's MarcEdit MARCValidator (http://oregonstate.edu/~reeset/
> marcedit/html/index.php). You can set it to remove invalid records
> (MarcEditor|Tools|Validate MARC Files). That let me index all the
> "valid" records and gave me a file of "bad" records.

MarcEdit is by all accounts awesome. I've been meaning to introduce
our catalogers to it if they don't know about it already (I suspect
they do).

Gabriel

ps And thanks to Dan Scott for committing that fix to processors.py.
That's what I call talking *and* doing.

Reply all
Reply to author
Forward
0 new messages