On Jan 11, 8:51 pm, "Dan Scott" <
deni...@gmail.com> wrote:
I had a similar reaction when looking at the marc4j and all those
Python libraries in indexer/. "Can't we simplify this a bit?", I
thought. Going straight from MARC with Python seems to be the obvious
route, but in my unscientific testing I found PyMARC to be pretty darn
slow in the tens-of-thousands-of-records range. It was also eating up
all my memory, however, and I think Ed Summers patched it to fix that,
so it may be faster now.
The other route would be to do what Dan said: tell people to convert
it to MARCXML (I prefer yaz-marcdump), then provide an XSLT to convert
that into the Solr Update Schema. I've been looking at VuFind's XSLT
for that purpose. We'd need to rework the PHP scripts it calls into
Python, or else try to get the XSLT to do everything itself, but that
could be a safe and simple solution. The funny thing is I heard
mention at the ALA meeting last weekend that VuFind has a Java indexer
that drops the MARC straight into the index a lot faster than all of
this XML rigarole. Kinda defeats the point of the RESTful, any-
language-you-like Solr philosophy, but hey, whatever works.
Oh, and Dan, I think it makes more sense to convert from DC, etc. into
the Update Schema. Convert into MARCXML? You must be joking. On a
more serious note, I might prefer the XML route because I'm hoping to
pull DC records from our IR (DSpace). I think it would be easier to
have an XSLT for that than to parse it with Python. Or maybe it
wouldn't. I for one would rather write Python than XSL.
One more thought: if you do use PyMARC, consider Solr's CSV format for
updating records. Probably faster to produce in Python and faster for
Solr to process than XML is. If you get anything working talk to us
about committing it to trunk!