pymarc alternative to marc4j?

31 views
Skip to first unread message

Tom Burton-West

unread,
Jan 11, 2008, 5:45:26 PM1/11/08
to facba...@googlegroups.com
Hello,

We just started working with fac-back-opac for a test of a special project and being a Python newbie, I was wondering whether it would be possible to use pymarc instead of marc4j?  That way debugging and development could all be done in python instead of the mixed java/python environment of jython. 

http://pypi.python.org/pypi/pymarc


Tom

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Services
University of Michigan Library

Dan Scott

unread,
Jan 11, 2008, 8:51:47 PM1/11/08
to facba...@googlegroups.com
On 11/01/2008, Tom Burton-West <tburt...@gmail.com> wrote:
Hello,

We just started working with fac-back-opac for a test of a special project and being a Python newbie, I was wondering whether it would be possible to use pymarc instead of marc4j?  That way debugging and development could all be done in python instead of the mixed java/python environment of jython. 

http://pypi.python.org/pypi/pymarc

When Mark Matienzo added marc8 -> utf-8 conversion capabilities to pymarc a few months back, I think most of us salivated over exactly the possibility that you raise. So far, however, no-one has dared to step into the realm of actually reimplementing the indexer with pymarc.

One possible concern with this approach is that marc4j is blazing fast; I'm not sure whether pymarc has been benchmarked in comparison. It would be great if someone were to take the first steps towards a pymarc-based indexer to compare its performance.

The other possibility that has been raised is to simply change the requirement for the source records to be MARC21XML to begin with, then implement the indexer using XSL or a SAX parser. That would place the burden of getting to MARC21XML on the user, but it would also mean that they could use any tool they prefer to get there -- MarcEdit, marc4j, MARC::Record, File_MARC, yaz-marcdump. Heck, they could even transform records from other formats (DC, etc) into MARC21XML.

--
Dan Scott
Laurentian University

Mark A. Matienzo

unread,
Jan 12, 2008, 11:53:27 AM1/12/08
to facba...@googlegroups.com
One of the reasons I started getting involved with FBO development was
to start working on a branch to see if pymarc integration made sense
for the indexer. There hasn't been a whole lot a talk about it since
(mostly since I've been busy with work, I've just moved, etc.). I'm
definitely still interested, though. Maybe we can start figuring out
some guidelines to light this fire under my butt. :)

Mark

Gabriel Sean Farrell

unread,
Jan 17, 2008, 1:25:38 PM1/17/08
to FacBackOPAC
On Jan 11, 8:51 pm, "Dan Scott" <deni...@gmail.com> wrote:
I had a similar reaction when looking at the marc4j and all those
Python libraries in indexer/. "Can't we simplify this a bit?", I
thought. Going straight from MARC with Python seems to be the obvious
route, but in my unscientific testing I found PyMARC to be pretty darn
slow in the tens-of-thousands-of-records range. It was also eating up
all my memory, however, and I think Ed Summers patched it to fix that,
so it may be faster now.

The other route would be to do what Dan said: tell people to convert
it to MARCXML (I prefer yaz-marcdump), then provide an XSLT to convert
that into the Solr Update Schema. I've been looking at VuFind's XSLT
for that purpose. We'd need to rework the PHP scripts it calls into
Python, or else try to get the XSLT to do everything itself, but that
could be a safe and simple solution. The funny thing is I heard
mention at the ALA meeting last weekend that VuFind has a Java indexer
that drops the MARC straight into the index a lot faster than all of
this XML rigarole. Kinda defeats the point of the RESTful, any-
language-you-like Solr philosophy, but hey, whatever works.

Oh, and Dan, I think it makes more sense to convert from DC, etc. into
the Update Schema. Convert into MARCXML? You must be joking. On a
more serious note, I might prefer the XML route because I'm hoping to
pull DC records from our IR (DSpace). I think it would be easier to
have an XSLT for that than to parse it with Python. Or maybe it
wouldn't. I for one would rather write Python than XSL.

One more thought: if you do use PyMARC, consider Solr's CSV format for
updating records. Probably faster to produce in Python and faster for
Solr to process than XML is. If you get anything working talk to us
about committing it to trunk!
Reply all
Reply to author
Forward
0 new messages