Fwd: [solrmarc-tech] Version 2.3 of SolrMarc released

13 views

Skip to first unread message

Jonathan Rochkind

unread,

Oct 10, 2011, 6:04:12 PM10/10/11

to blacklight-...@googlegroups.com

Blacklight includes a copy of SolrMarc in it's own distribution. Updating BL to use the new version of SolrMarc could be as simple as replacing the distro'd SolrMarc.jar with a new one (although one would want to test it).

This alone should make the distro'd BL SolrMarc work with Solr 3.x, if I understand things right.

We may also want to change the distro'd default config to use the StreamingUpdateServer with javabin for more efficient updates. Although at present the default distro'd config actually uses "direct write to disk" indexing, which is probably fastest of all (when it works and is appropriate). On the one hand, I'd favor changing default distro'd config to use the more conventional and reliable HTTP connection to Solr for updates (with javabin and StreamingUpdateServer) -- however the trick there is that a Solr server would have to be running for it to work, and presently we have a rake task you can run whether or not you actually have the Solr you are indexing to running. So that change, it's unclear how to proceed, and we probably won't do it in the immediate feature.

But whenever someone finds time to do it and briefly test it, they should probably go ahead and update the distro'd SolrMarc.jar to be 2.3, so it'll work with Solr 3.x.

-------- Original Message --------

Subject:	[solrmarc-tech] Version 2.3 of SolrMarc released
Date:	Mon, 10 Oct 2011 17:53:36 -0400
From:	Robert Haschart <rh...@virginia.edu>
Reply-To:	<solrma...@googlegroups.com>
To:	solrma...@googlegroups.com <solrma...@googlegroups.com>
CC:	<vufind-...@lists.sourceforge.net>, <blacklight-...@googlegroups.com>

Release Notes for version 2.3 of SolrMarc

The main new features included in this release are described below:

1.) Support for Running with Solr 3.1

Although SolrMarc was designed to be agnostic w.r.t. the version of Solr that it is running against, changes in Solr starting with version 3.1 caused this to no longer be the case. Specifically changes in how a single core was to be loaded for local solr mode and changes in the javabin protocol for remote solr mode made SolrMarc not function with Solr 3.x

This version of SolrMarc includes custom version of the Solrj library that is backward compatible so that SolrMarc can talk to Solr servers of version 3.x or version 1.4   using either   xml or javabin communication, choosing the correct version of javabin for the version of server it is talking to.

Additionally communicating remotely the StreamingUpdateSolrServer class can be used which will chunk together records adds (which should speed up adding records to a remote Solr Server), furthermore for optimization purposes the use of the StreamingUpdateSolrServer and the use of javabin communication will now be the default.

2) New Custom Indexing Function 'Mixin' Architecture

Previously you could either use the standard custom functions that are provided as a part of the class SolrIndexer, or you could define your own subclass that extended SolrIndexer, and define custom functions of your own there. The drawback was that as you added more custom functions your class defining these custom functions could grow quite large and unwieldy, and if someone else created a new custom function it would be hard for you to find and hard for you to add to your implementation. Furthermore if you first developed a Beanshell script to implement the custom indexing function in a script and subsequently wanted to migrate that function to a compiled method, the process could be a little tricky. With the Mixin architecture, that process shold be easier.

How Does it Work?

You can see the class org.solrmarc.index.GetFormatMixin for an example. You define a new class that extends the class org.solrmarc.index.SolrIndexerMixin and in this new class you define one or more public functions that return either a String or a Set<String> and take as parameters a Record onject and zero or more String's

    /**
     * Return the content type and media types, plus electronic, for this record
     *
     * @param Record   - MARC Record
     * @return Set of Strings of content types and media types
     */
    public Set<String> getContentTypesAndMediaTypes(final Record record)
    {
        Set<String> formats = getContentTypes(record);
        formats.addAll( getMediaTypes(record));
        formats = addOnlineTypes(record, formats);
        return(formats);
    }

and then in your indexing specification you can access the above method like this:

content_type_s = custom(org.solrmarc.index.GetFormatMixin), getContentTypesAndMediaTypes

where you specify the class the method is in, in parenthesis following the custom keyword. (similar to what is done for beanshell scripts)

The main benefit of this new architecture is increased modularity. You can group methods dealing with the format of items together in one file and place methods that handle call number processing in another. Furthermore if someone else defines a group of functions that you find useful you can take that source, or the compiled class or even the entire jar they've created, and include it in your configuration, and reference the new indexing functions as shown above, and reindex the affected records.

3) New Example 'Mixin' Custom Indexing Functions

The bulk of the code in this example was submitted by David Walker at calstate.edu. It attempts to apply all of the arcane rules defined by LOC for how the content type of an item is defined, as well as the rules for how the media type of an item is defined.

Quoting David Walker:

But 'format' here actually encompassing (at least) two different concepts.  RDA does a good job of delineating these, in my opinion, so I'm going to borrow it's terminology and talk about "content types" and "media types."

Content type is 'format' in terms of the nature of the contents.  We can talk about Huckleberry Finn as a "book," the New York Times as a "newspaper," the Empire Strikes Back as a "movie" and Abbey Road as a "music recording."  These are all content types.

Media type is 'format' in terms of the physical medium or carrier of the item.  Huckleberry Finn might be available in "print" or as an "ebook."  Your library likely has old issues of the New York Times in "microfilm" as well as access "online."  The Empire Strikes Back was released on "Laser Disc," "VHS," "DVD," and now "Blu-Ray."  And you might have Abbey Road on LP, tape, CD, and so on.  These are all media types.

The main methods it defines are:
getContentType               - get content type as described above
getMediaType                 - get media type as described above
getPrimaryContentType - get a single 'best' content type for an item
getPrimaryContentTypePlusOnline - get primary content type, and if item is available online add in appropriate additional types
getContentTypesAndMediaTypes - get content type and media type as described above, and if item is available online add in appropriate additional types

4) Support for reading and writing marc records in JSON

In the past either Binary Marc or MarcXML could be stored in the solr index by specifying either:

marc_display = FullRecordAsMARC
or
marc_display = FullRecordAsXML

However both options have drawbacks, for Binary Marc, since the record contains binary character codes 0x1d 0x1e and 0x1f which are invalid in XML and which are sometimes translated as character entities #x1D; #x1E; and #x1F;   which then have to be retranslated client side before the binary Marc can be processed, furthermore binary Marc can only be 99999 bytes long before it is invalid. Larger records can be created but these out-of-spec oversized records are often handled differently by different Marc tools.

MarcXML has the drawback of being significantly larger than the binary Marc representation of the same record, as well as usually slower to process since the XML has to be parsed.

There is now a third (and fourth) option, encoding the Marc record in JSON and storing that in the solr record

marc_display = FullRecordAsJSON

Which will encode the Marc record using the Marc-in-JSON scheme, and add the resulting encoded string to the solr index
(see http://dilettantes.code4lib.org/blog/category/marc-in-json/)

or

marc_display = FullRecordAsJSON2

Which will encode the Marc record using the Marc-JSON scheme, and add the resulting encoded string to the solr index
(see http://www.oclc.org/developer/content/marc-json-draft-2010-03-11)

5) Includes updated version of Marc4j library

The SolrMarc release includes a version of marc4j (labelled marc4j-2.5.beta.jar) that is essentially what will be released shortly as marc4j.2.5.jar

-Robert Haschart

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Reply all

Reply to author

Forward

0 new messages