Upcoming release of SolrMarc

77 views
Skip to first unread message

Robert Haschart

unread,
Sep 28, 2011, 3:43:26 PM9/28/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com
In the near future I plan to release the next version of SolrMarc.  This release will include several important new features that are listed below
This message is also a call for votes for additional new features to either be included in the release, or added to a list for future development

Already included new features:
  • Support for running with solr 3.1
    • Includes backward compatible solrj library that can talk to solr server  version 3.x   or   version 1.4   using either   xml  or  javabin
    • When communicating remotely the StreamingUpdateSolrServer class can be used which will chunk together records adds
    • Avoids class conflicts between Unicode normalizer routines in its included normalizer.jar and any normalizer routines in a icu4j jar in use by the solr server
  • New custom indexing function mixin architecture
    • Instead of needing to derive a custom class from SolrIndexer and add all of the additional custom indexing methods there, you can now have several custom mixin classes, each derived from the new SolrIndexerMixin class, with one or more custom indexing methods being defined in each additional mixin class
    • A suite of improved getformat methods implemented via this mixin architecture (from David Walker at calstate.edu)
  • Support for reading and writing marc records in JSON including the ability to specify that the json encoding of the record be stored in the index rather than the raw binary Marc or Marc-XML
  • Improved classloading in one-jar wrapper code to avoid the need for temporary jar files, and to avoid an obscure class reference problem
  • Routines for retrieving bibliographic records from HathiTrust Bibliographic API  (described at   http://www.hathitrust.org/bib_api ) which returns JSON data that contains as one of its members a MarcXML encoded bibliographic record.
  • Tests for all of the above.


At approximately the same time, I plan to put together a new release of Marc4j

It already includes the following features:
  • support for reading/writing both MarcJSON  and  MarcInJSON formats
  • support for non-Marc standard character encodings such as Big5 (assuming the correct java libraries are installed)
  • ability to look for and delete extraneous characters that occur between Marc records in a file of Marc records.  (for instance if CR LF is inserted between records.)
  • support for illegally long marc binary records
Before releasing it I may address the following issues:
  • Currently the default behavior is to sort the fields inside a Marc record.  Many people have expressed dissatisfaction with this behavior.  Unless strong opinions are voiced requesting that this continue to be the default behavior (including compelling reasons)
  • Updating return types from methods from old style java:  (eg.  List )   to the newer parameterized types  (eg.  List<VariableField> )


Demian Katz

unread,
Sep 28, 2011, 4:12:21 PM9/28/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com

This is great news – sounds like an exciting release!  When’s the cut-off for getting new features into the trunk?  I’d like to get the VuFind example up to date with the current VuFind trunk (though it’s not the end of the world if there isn’t time for that – there are only relatively insignificant changes).

 

The mixin functionality sounds interesting.  How does this affect the building of the examples?  Is any new documentation needed to accompany the feature?

 

thanks,
Demian

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Jonathan Rochkind

unread,
Sep 28, 2011, 4:19:13 PM9/28/11
to blacklight-...@googlegroups.com, Robert Haschart, solrma...@googlegroups.com
Awesome, very excited about some of those changes.

If you could include documentation somewhere for "A suite of improved getformat methods implemented via this mixin architecture (from David Walker at calstate.edu)", that would be helpful -- sounds like something I'd like to use.

As far as the reading/writing JSON, which of the several methods that have been proposed for structuring MARC in JSON are used?
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

Robert Haschart

unread,
Sep 28, 2011, 5:18:11 PM9/28/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com
My hope is to be able to release by the end of the week.   I'd love to help upgrade the VuFind example to reflect the current trunk, in exchange you can help test it against a version 3.x solr server, it should be able to be pointed at a solr 3.x  war and a solr 3.x config and just work either directly or over http. 

The mixin functionality doesn't affect anything that you do currently, it merely provides a different (and IMHO) better method of adding in new custom functions, and makes sharing of new bits of custom functionality between different sites easier.  New documentation is needed to accompany the feature, and still needs to be written.

-Bob Haschart

Robert Haschart

unread,
Sep 28, 2011, 5:21:52 PM9/28/11
to blacklight-...@googlegroups.com, solrma...@googlegroups.com
As far as the reading/writing JSON, which of the several methods that have been proposed for structuring MARC in JSON are used?

It uses the functionality that I added to the soon-to-be-release marc4j, and can read either the MarcJSON format or the MarcInJSON format.
Additionally by default it will write the MarcInJson format, but can be asked to write MarcJson as well.

-Bob Haschart

Demian Katz

unread,
Sep 29, 2011, 8:58:22 AM9/29/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com
I don't think there's actually too much that needs to be done to update the VuFind example -- my plan is to make sure the config files match up with the VuFind trunk and to port a couple of new BeanShell scripts over to Java. I should have that done by the end of the day today (though don't let me hold you up if you need to move forward sooner than that). For now, VuFind is still using Solr 1.4.1, so that's what I'll be testing against during my development today. I don't think I'll have time to upgrade VuFind to Solr 3.3 until next week, but I'll definitely let you know how things work out once I get that far.

- Demian
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] On Behalf Of Robert Haschart [rh...@virginia.edu]
Sent: Wednesday, September 28, 2011 5:18 PM
To: solrma...@googlegroups.com
Cc: vufind-...@lists.sourceforge.net; blacklight-...@googlegroups.com
Subject: Re: [solrmarc-tech] Upcoming release of SolrMarc

My hope is to be able to release by the end of the week. I'd love to help upgrade the VuFind example to reflect the current trunk, in exchange you can help test it against a version 3.x solr server, it should be able to be pointed at a solr 3.x war and a solr 3.x config and just work either directly or over http.

The mixin functionality doesn't affect anything that you do currently, it merely provides a different (and IMHO) better method of adding in new custom functions, and makes sharing of new bits of custom functionality between different sites easier. New documentation is needed to accompany the feature, and still needs to be written.

-Bob Haschart


Demian Katz wrote:
This is great news – sounds like an exciting release! When’s the cut-off for getting new features into the trunk? I’d like to get the VuFind example up to date with the current VuFind trunk (though it’s not the end of the world if there isn’t time for that – there are only relatively insignificant changes).

The mixin functionality sounds interesting. How does this affect the building of the examples? Is any new documentation needed to accompany the feature?

thanks,
Demian

From: solrma...@googlegroups.com<mailto:solrma...@googlegroups.com> [mailto:solrma...@googlegroups.com] On Behalf Of Robert Haschart
Sent: Wednesday, September 28, 2011 3:43 PM
To: solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>; vufind-...@lists.sourceforge.net<mailto:vufind-...@lists.sourceforge.net>; blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>
Subject: [solrmarc-tech] Upcoming release of SolrMarc

In the near future I plan to release the next version of SolrMarc. This release will include several important new features that are listed below
This message is also a call for votes for additional new features to either be included in the release, or added to a list for future development

Already included new features:

* Support for running with solr 3.1
* Includes backward compatible solrj library that can talk to solr server version 3.x or version 1.4 using either xml or javabin
* When communicating remotely the StreamingUpdateSolrServer class can be used which will chunk together records adds
* Avoids class conflicts between Unicode normalizer routines in its included normalizer.jar and any normalizer routines in a icu4j jar in use by the solr server
* New custom indexing function mixin architecture
* Instead of needing to derive a custom class from SolrIndexer and add all of the additional custom indexing methods there, you can now have several custom mixin classes, each derived from the new SolrIndexerMixin class, with one or more custom indexing methods being defined in each additional mixin class
* A suite of improved getformat methods implemented via this mixin architecture (from David Walker at calstate.edu)
* Support for reading and writing marc records in JSON including the ability to specify that the json encoding of the record be stored in the index rather than the raw binary Marc or Marc-XML
* Improved classloading in one-jar wrapper code to avoid the need for temporary jar files, and to avoid an obscure class reference problem
* Routines for retrieving bibliographic records from HathiTrust Bibliographic API (described at http://www.hathitrust.org/bib_api ) which returns JSON data that contains as one of its members a MarcXML encoded bibliographic record.
* Tests for all of the above.


At approximately the same time, I plan to put together a new release of Marc4j

It already includes the following features:

* support for reading/writing both MarcJSON and MarcInJSON formats
* support for non-Marc standard character encodings such as Big5 (assuming the correct java libraries are installed)
* ability to look for and delete extraneous characters that occur between Marc records in a file of Marc records. (for instance if CR LF is inserted between records.)
* support for illegally long marc binary records


Before releasing it I may address the following issues:

* Currently the default behavior is to sort the fields inside a Marc record. Many people have expressed dissatisfaction with this behavior. Unless strong opinions are voiced requesting that this continue to be the default behavior (including compelling reasons)
* Updating return types from methods from old style java: (eg. List ) to the newer parameterized types (eg. List<VariableField> )

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.

To post to this group, send email to solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com<mailto:solrmarc-tec...@googlegroups.com>.


For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.
--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.

To post to this group, send email to solrma...@googlegroups.com<mailto:solrma...@googlegroups.com>.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com<mailto:solrmarc-tec...@googlegroups.com>.

Robert Haschart

unread,
Oct 10, 2011, 5:53:36 PM10/10/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com
Release Notes for version 2.3 of SolrMarc

The main new features included in this release are described below:

1.)  Support for Running with Solr 3.1

Although SolrMarc was designed to be agnostic w.r.t. the version of Solr that it is running against, changes in Solr starting with version 3.1 caused this to no longer be the case.  Specifically changes in how a single core was to be loaded for local solr mode and changes in the javabin protocol for remote solr mode made SolrMarc not function with Solr 3.x

This version of SolrMarc includes custom version of the Solrj library that is backward compatible so that SolrMarc can talk to Solr servers  of  version 3.x   or   version 1.4   using either   xml  or  javabin communication, choosing the correct version of javabin for the version of server it is talking to.   

Additionally communicating remotely the StreamingUpdateSolrServer class can be used which will chunk together records adds (which should speed up adding records to a remote Solr Server), furthermore for optimization purposes the use of the StreamingUpdateSolrServer and the use of javabin communication will now be the default. 

2)  New Custom Indexing Function 'Mixin' Architecture

Previously you could either use the standard custom functions that are provided as a part of the class SolrIndexer, or you could define your own subclass that extended SolrIndexer, and define custom functions of your own there.  The drawback was that as you added more custom functions your class defining these custom functions could grow quite large and unwieldy, and if someone else created a new custom function it would be hard for you to find and hard for you to add to your implementation.  Furthermore if you first developed a Beanshell script to implement the custom indexing function in a script and subsequently wanted to migrate that function to a compiled method, the process could be a little tricky.  With the Mixin architecture, that process shold be easier.

How Does it Work?

You can see the class org.solrmarc.index.GetFormatMixin for an example.  You define a new class that extends the class org.solrmarc.index.SolrIndexerMixin and in this new class you define one or more public functions that return either a String or a Set<String> and take as parameters a Record onject and zero or more String's

    /**
     * Return the content type and media types, plus electronic, for this record
     *
     * @param Record   -  MARC Record
     * @return Set of Strings of content types and media types
     */
    public Set<String> getContentTypesAndMediaTypes(final Record record)
    {
        Set<String> formats = getContentTypes(record);
        formats.addAll( getMediaTypes(record));
        formats = addOnlineTypes(record, formats);
        return(formats);
    }

and then in your indexing specification you can access the above method like this:

content_type_s = custom(org.solrmarc.index.GetFormatMixin), getContentTypesAndMediaTypes

where you specify the class the method is in, in parenthesis following the custom keyword.  (similar to what is done for beanshell scripts)

The main benefit of this new architecture is increased modularity.  You can group methods dealing with the format of items together in one file and place methods that handle call number processing in another.  Furthermore if someone else defines a group of functions that you find useful you can take that source, or the compiled class or even the entire jar they've created, and include it in your configuration, and reference the new indexing functions as shown above, and reindex the affected records.  

3)  New Example 'Mixin'  Custom Indexing Functions

The bulk of the code in this example was submitted by David Walker at calstate.edu.  It attempts to apply all of the arcane rules defined by LOC for how the content type of an item is defined, as well as the rules for how the media type of an item is defined.  
Quoting David Walker:
But 'format' here actually encompassing (at least) two different concepts.  RDA does a good job of delineating these, in my opinion, so I'm going to borrow it's terminology and talk about "content types" and "media types."

Content type is 'format' in terms of the nature of the contents.  We can talk about Huckleberry Finn as a "book," the New York Times as a "newspaper," the Empire Strikes Back as a "movie" and Abbey Road as a "music recording."  These are all content types.

Media type is 'format' in terms of the physical medium or carrier of the item.  Huckleberry Finn might be available in "print" or as an "ebook."  Your library likely has old issues of the New York Times in "microfilm" as well as access "online."  The Empire Strikes Back was released on "Laser Disc," "VHS," "DVD," and now "Blu-Ray."  And you might have Abbey Road on LP, tape, CD, and so on.  These are all media types.

The main methods it defines are:
getContentType               - get content type as described above
getMediaType                 - get media type as described above
getPrimaryContentType   - get a single 'best' content type for an item
getPrimaryContentTypePlusOnline - get primary content type, and if item is available online add in appropriate additional types
getContentTypesAndMediaTypes - get content type and media type as described above, and if item is available online add in appropriate additional types

4) Support for reading and writing marc records in JSON

In the past either Binary Marc or MarcXML could be stored in the solr index by specifying either:

marc_display = FullRecordAsMARC
or
marc_display = FullRecordAsXML

However both options have drawbacks, for Binary Marc, since the record contains binary character codes 0x1d 0x1e and 0x1f  which are invalid in XML and which are sometimes translated as character entities #x1D;  #x1E;  and  #x1F;   which then have to be retranslated client side before the binary Marc can be processed, furthermore binary Marc can only be 99999 bytes long before it is invalid.  Larger records can be created but these out-of-spec oversized records are often handled differently by different Marc tools. 

MarcXML has the drawback of being significantly larger than the binary Marc representation of the same record, as well as usually slower to process since the XML has to be parsed. 

There is now a third (and fourth) option, encoding the Marc record in JSON and storing that in the solr record

marc_display = FullRecordAsJSON

Which will encode the Marc record using the Marc-in-JSON scheme, and add the resulting encoded string to the solr index
(see http://dilettantes.code4lib.org/blog/category/marc-in-json/)

or

marc_display = FullRecordAsJSON2

Which will encode the Marc record using the Marc-JSON scheme, and add the resulting encoded string to the solr index
(see http://www.oclc.org/developer/content/marc-json-draft-2010-03-11)


5) Includes updated version of Marc4j library 

The SolrMarc release includes a version of marc4j (labelled  marc4j-2.5.beta.jar)  that is essentially what will be released shortly as marc4j.2.5.jar  



-Robert Haschart


Demian Katz

unread,
Oct 11, 2011, 8:21:47 AM10/11/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com

This is a very exciting release – thanks for all your hard work on it!  I’ll work on incorporating this into VuFind in the next week or two, and I’ll report back when the work is done.

 

- Demian

 

From: solrma...@googlegroups.com [mailto:solrma...@googlegroups.com] On Behalf Of Robert Haschart

Robert Haschart

unread,
Oct 14, 2011, 3:16:29 PM10/14/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com
The recently released SolrMarc 2.3 has a serious problem where commits
to a local solr index sets the expungeDeletes flag which causes a
segment merge which can be nearly as expensive as a index optimize.
Furthermore changes in the defaults for certain configuration properties
cause the above behavior to be chosen by default. At UVA the processing
time for our nightly updates jumped from about 30 minutes (of which
about 20 minutes is the index optimize) to about 2hr 30 minutes.

So the error has been fixed and an updated version has been released.
If you have recently downloaded a copy SolrMarc version 2.3, discard it,
and download a copy of the updated release SolrMarc version 2.3.1


-Robert Haschart


Demian Katz

unread,
Oct 14, 2011, 3:30:56 PM10/14/11
to solrma...@googlegroups.com, vufind-...@lists.sourceforge.net, blacklight-...@googlegroups.com
I've updated the VuFind trunk to v2.3.1 -- thanks for the quick fix!

- Demian
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] On Behalf Of Robert Haschart [rh...@virginia.edu]
Sent: Friday, October 14, 2011 3:16 PM

Subject: [solrmarc-tech] Version 2.3.1 of SolrMarc released


-Robert Haschart


Reply all
Reply to author
Forward
0 new messages