Blacklight includes a copy of SolrMarc in it's own distribution.
Updating BL to use the new version of SolrMarc could be as simple as
replacing the distro'd SolrMarc.jar with a new one (although one
would want to test it).
This alone should make the distro'd BL SolrMarc work with Solr 3.x,
if I understand things right.
We may also want to change the distro'd default config to use the
StreamingUpdateServer with javabin for more efficient updates.
Although at present the default distro'd config actually uses
"direct write to disk" indexing, which is probably fastest of all
(when it works and is appropriate). On the one hand, I'd favor
changing default distro'd config to use the more conventional and
reliable HTTP connection to Solr for updates (with javabin and
StreamingUpdateServer) -- however the trick there is that a Solr
server would have to be running for it to work, and presently we
have a rake task you can run whether or not you actually have the
Solr you are indexing to running. So that change, it's unclear how
to proceed, and we probably won't do it in the immediate feature.
But whenever someone finds time to do it and briefly test it, they
should probably go ahead and update the distro'd SolrMarc.jar to be
2.3, so it'll work with Solr 3.x.
-------- Original Message --------
Release Notes for version 2.3 of SolrMarc
The main new features included in this release are described below:
1.) Support for Running with Solr 3.1
Although SolrMarc was designed to be agnostic w.r.t. the version of
Solr that it is running against, changes in Solr starting with
version
3.1 caused this to no longer be the case. Specifically changes in
how
a single core was to be loaded for local solr mode and changes in
the
javabin protocol for remote solr mode made SolrMarc not function
with
Solr 3.x
This version of SolrMarc includes custom version of the Solrj
library
that is backward compatible so that SolrMarc can talk to Solr
servers
of version 3.x or version 1.4 using either xml or
javabin communication, choosing the correct version of javabin for
the
version of server it is talking to.
Additionally communicating remotely the StreamingUpdateSolrServer
class
can be used which will chunk together records adds (which should
speed
up adding records to a remote Solr Server), furthermore for
optimization purposes the use of the StreamingUpdateSolrServer and
the
use of javabin communication will now be the default.
2) New Custom Indexing Function 'Mixin' Architecture
Previously you could either use the standard custom functions that
are
provided as a part of the class SolrIndexer, or you could define
your
own subclass that extended SolrIndexer, and define custom functions
of
your own there. The drawback was that as you added more custom
functions your class defining these custom functions could grow
quite
large and unwieldy, and if someone else created a new custom
function
it would be hard for you to find and hard for you to add to your
implementation. Furthermore if you first developed a Beanshell
script
to implement the custom indexing function in a script and
subsequently
wanted to migrate that function to a compiled method, the process
could
be a little tricky. With the Mixin architecture, that process shold
be
easier.
How Does it Work?
You can see the class org.solrmarc.index.GetFormatMixin for an
example. You define a new class that extends the class
org.solrmarc.index.SolrIndexerMixin and in this new class you define
one or more public functions that return either a String or a
Set<String> and take as parameters a Record onject and zero or
more String's
/**
* Return the content type and media types, plus electronic, for
this record
*
* @param Record - MARC Record
* @return Set of Strings of content types and media types
*/
public Set<String> getContentTypesAndMediaTypes(final
Record
record)
{
Set<String> formats = getContentTypes(record);
formats.addAll( getMediaTypes(record));
formats = addOnlineTypes(record, formats);
return(formats);
}
and then in your indexing specification you can access the above
method
like this:
content_type_s = custom(org.solrmarc.index.GetFormatMixin),
getContentTypesAndMediaTypes
where you specify the class the method is in, in parenthesis
following
the custom keyword. (similar to what is done for beanshell scripts)
The main benefit of this new architecture is increased modularity.
You
can group methods dealing with the format of items together in one
file
and place methods that handle call number processing in another.
Furthermore if someone else defines a group of functions that you
find
useful you can take that source, or the compiled class or even the
entire jar they've created, and include it in your configuration,
and
reference the new indexing functions as shown above, and reindex the
affected records.
3) New Example 'Mixin' Custom Indexing
Functions
The bulk of the code in this example was submitted by David
Walker
at
calstate.edu. It attempts to apply all of the arcane rules
defined
by LOC for how the content type of an item is defined, as well as
the
rules for how the media type of an item is defined.
Quoting David Walker:
But 'format' here actually encompassing (at least) two different concepts. RDA does a good job of delineating these, in my opinion, so I'm going to borrow it's terminology and talk about "content types" and "media types."
Content type is 'format' in terms of the nature of the contents. We can talk about Huckleberry Finn as a "book," the New York Times as a "newspaper," the Empire Strikes Back as a "movie" and Abbey Road as a "music recording." These are all content types.
Media type is 'format' in terms of the physical medium or carrier of the item. Huckleberry Finn might be available in "print" or as an "ebook." Your library likely has old issues of the New York Times in "microfilm" as well as access "online." The Empire Strikes Back was released on "Laser Disc," "VHS," "DVD," and now "Blu-Ray." And you might have Abbey Road on LP, tape, CD, and so on. These are all media types.
The main methods it defines are:
getContentType - get content type as described above
getMediaType - get media type as described above
getPrimaryContentType - get a single 'best' content type for an
item
getPrimaryContentTypePlusOnline - get primary content type, and if
item
is available online add in appropriate additional types
getContentTypesAndMediaTypes - get content type and media type as
described above, and if item is available online add in appropriate
additional types
4) Support for reading and writing marc records in JSON
In the past either Binary Marc or MarcXML could be stored in the
solr
index by specifying either:
marc_display = FullRecordAsMARC
or
marc_display = FullRecordAsXML
However both options have drawbacks, for Binary Marc, since the
record
contains binary character codes 0x1d 0x1e and 0x1f which are
invalid
in XML and which are sometimes translated as character entities
#x1D;
#x1E; and #x1F; which then have to be retranslated client side
before the binary Marc can be processed, furthermore binary Marc can
only be 99999 bytes long before it is invalid. Larger records can
be
created but these out-of-spec oversized records are often handled
differently by different Marc tools.
MarcXML has the drawback of being significantly larger than the
binary
Marc representation of the same record, as well as usually slower to
process since the XML has to be parsed.
There is now a third (and fourth) option, encoding the Marc record
in
JSON and storing that in the solr record
marc_display = FullRecordAsJSON
Which will encode the Marc record using the Marc-in-JSON scheme, and
add the resulting encoded string to the solr index
(see
http://dilettantes.code4lib.org/blog/category/marc-in-json/)
or
marc_display = FullRecordAsJSON2
Which will encode the Marc record using the Marc-JSON scheme, and
add the resulting encoded string to the solr index
(see
http://www.oclc.org/developer/content/marc-json-draft-2010-03-11)
5) Includes updated version of Marc4j library
The SolrMarc release includes a version of marc4j (labelled
marc4j-2.5.beta.jar) that is essentially what will be released
shortly
as marc4j.2.5.jar
-Robert Haschart
--
You received this message because you are subscribed to the Google
Groups "solrmarc-tech" group.
To post to this group, send email to
solrma...@googlegroups.com.
To unsubscribe from this group, send email to
solrmarc-tec...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/solrmarc-tech?hl=en.