Skipping ahead in MARC Reader

Elijah Terrell

unread,

Oct 29, 2012, 3:34:58 PM10/29/12

to pym...@googlegroups.com

Hi. I'm new to python so I apologize the answer to this is obvious to people who know the conventions.

I've got a large file of MARC records. What I'm working on right know is importing them into a database. This is mostly for experimentation purposes, not trying anything serious yet. So I set up a loop to iterate over all the records. But wouldn't you know it, some of the entries are "special", badly formatted or just missing information. Every so often I come across one of these gems, I either dump it or add contingencies to deal with any like it.

My issue is that, now that I'm a few 100,000 records into my file, it takes a while to get back to where I was reading through the file sequentially. That introduces a delay into the debug cycle which is a bit frustrating. Is there a way to just skip ahead a specific number of records?

At first I thought I could do something I've done with other iterators like:

for record in reader[400000:]:

but it says the reader object is not subscriptable.

Right now I'm using a second loop to get where I want to go like:

for record in reader:
count = count + 1

if count >= 400000:
break

It's faster than my main loop but still takes a few minutes. It seems like if the reader could just jump over the length of a record that would go faster than creating a record object based on it, etc, but I can't figure out if there's a way to do that.

Gabriel Farrell

unread,

Oct 29, 2012, 9:27:08 PM10/29/12

to pym...@googlegroups.com

Due to memory limits and to help with the debug cycle, I usually split
my record dumps into sets of 50,000 or so records. Would that be
possible with the ones you're dealing with?

Godmar Back

unread,

Oct 29, 2012, 9:36:58 PM10/29/12

to pym...@googlegroups.com

You could index the file:

mw = pymarc.MARCReader(mfile)

pos = mfile.tell()

cnt = 0

# save (cnt, pos)

for record in mw:

cnt += 1

pos = mfile.tell()

# save (cnt, pos)

then you can seek to record #cnt by seeking to #pos.

You could store the (cnt, pos) pairs in a tab-separated file (or pickle it).

- Godmar

On Mon, Oct 29, 2012 at 3:34 PM, Elijah Terrell <elite...@gmail.com> wrote:

Elijah Terrell

unread,

Oct 30, 2012, 12:52:19 AM10/30/12

to pym...@googlegroups.com

On Monday, October 29, 2012 9:36:59 PM UTC-4, Godmar Back wrote:

You could index the file:

Good to know. I had no idea things like that would work in conjunction with something like MARCReader. More study is required.

I'll also look into splitting things up as Mr Farrell suggested next time I set out to process my whole collection.

For the benefit of anyone who has the same problem I'll present what I came up with this afternoon: I ended up going over the whole file and whenever my loop found something it didn't like or hit an exception it just dumped that record out to a separate file. That way I ended up with all my miscreants in one place.

Thanks for the help, guys.

Godmar Back

unread,

Oct 30, 2012, 11:55:38 AM10/30/12

to pym...@googlegroups.com

On Tue, Oct 30, 2012 at 12:52 AM, Elijah Terrell <elite...@gmail.com> wrote:

On Monday, October 29, 2012 9:36:59 PM UTC-4, Godmar Back wrote:

You could index the file:

Good to know. I had no idea things like that would work in conjunction with something like MARCReader. More study is required.

You can also create indices by OCLC numbers/ISBN/ISSN etc. in this way. Just create multimaps mapping those identifiers to file offsets, then save them to a file. As an example, for some local MARC record files that were ~2GB in size and took 5 minutes to read, reading the index takes only a couple of seconds.

Here's the code if you're interested:

http://libx.lib.vt.edu/services/bootcamp/marctools.py

http://libx.lib.vt.edu/services/bootcamp/indexmarc.py

- Godmar

Reply all

Reply to author

Forward