Marc21 binary, Solr 4 stored field?

79 views
Skip to first unread message

Jonathan Rochkind

unread,
May 14, 2013, 12:10:32 PM5/14/13
to blacklight-...@googlegroups.com
Is anyone use storing Marc21 binary in a Solr stored field?

I am investigating upgrading my Solr 1.4 to a modern Solr 4.3.

In Solr 1.4, somehow the binary marc Just Worked. It was stored in the
index, and it came back in such a way that Blacklight (3.5.0) fetched a
ruby string from Solr that had correct binary bytes in it, that was then
turned into correct Marc by ruby Marc::Record.

Upgrading to Solr 4.3, the stored field is coming back with certain
control characters escaped, like, for instance #30; #31; (This isn't
quite XML escaping because it lacks the ampersand, right? I'm not sure
exactly what escaping it is).

I can theoretically transform these back to correct binary bytes
client-side. Although I'm not sure what will happen with _literal_ eg
'#30;'s in marc values. But I can work on it.

But I'm not sure exactly what changed, or what the correct place to fix
this is. There's lots of moving parts. Solr configuration, the indexing
step, then the Blacklight app, which includes (in my app)
Rsolr/RSolr::Ext as well as blacklight code, etc.

I'm curious if anyone else has dealt with this before, if it's been
fixed in more recent versions of BL and if so how, etc. To keep me from
re-inventing the wheel if it doesnt' need re-inventing. At this point I
will not be upgrading past BL 3.5.0 (one thing at a time), but if there
are solutions already in future BL's, I can look at em to copy their
logic etc.

Thanks for any ideas!

Jonathan

Justin Coyne

unread,
May 14, 2013, 12:17:02 PM5/14/13
to Blacklight Development
I'm guessing this is due to ruby forcing the byte stream returned from solr into a UTF-8 string. Can you tell if the values look correct in the index? Are the just corrupted when pulled out with RSolr?

-Justin




Jonathan

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blacklight-development+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Jonathan Rochkind

unread,
May 14, 2013, 12:22:15 PM5/14/13
to blacklight-...@googlegroups.com, Justin Coyne
I am still trying to figure out exactly what's going on, there are lots
of parts going on.

There isn't neccesarily a clear answer to if they are "correct in the
index", since the old version of my app (and most any other BL app using
defualt logic and solr 1.4) that worked with Solr 1.4 was doing
something not actually documented to work with Solr in the first place.
i'm not sure what "correct in the index" even is, I just know it used to
somehow work.

So, yeah, there's some debugging to be done to get to the bottom of it.

But something must have changed in either Solr's index or Solr's
response with regard to the binary data stuffed in a way Solr does not
promise to support in the first place into a Solr stored field, heh --
nothing else has changed in my app. My old 'working' app based on Solr
1.4 was also using ruby 1.9. I have not changed my app at all, I have
only changed Solr. So it seems unlikely it's a ruby encoding issue,
since the ruby parts are unchanged (and I'm not getting any "illegal
bytes for UTF8 errors").

Mostly I'm just curious if anyone knows if this has already been dealt
with in future versions of BL than the one I'm using, or if anyone has
already deal with this in their local app.

Anyone using binary Marc21 data in a Solr stored field, with SolrMarc
indexing, and Solr 4.x?
> blacklight-develop...@googlegroups.com
> <mailto:blacklight-development%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Blacklight Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to blacklight-develo...@googlegroups.com.

Jonathan Rochkind

unread,
May 14, 2013, 1:24:38 PM5/14/13
to blacklight-...@googlegroups.com, Justin Coyne
My _suspicion_ at the moment is that SolrMarc was/is written to 'take
advantage' of a bug in Solr 1.4 that is not present in future versions.
It looks like SolrMarc _might_ be 'escaping' binary data itself, in a
way that Solr 1.4 erroneously or in an undocumented way 'unescapes' on
the way out.

But I'm not sure about this, still engaging in detective work.

For myself, for now, the easiest thing might be to simply write some
code at the app level to 'unescape' those #xx; escape codes. It may
accidentally unescape some _literal_ "#xx;" too, depending on what
SolrMarc does with those exactly. That might be good enough for now.

Long term, once I have Solr upgraded, I would like to find time to
switch away from SolrMarc to another indexing solution, based on Bill
Dueber's work in jruby -- I just don't understand Java enough to work
effectively with with SolrMarc sourcecode. If/when I have an indexing
solution I understand better, I also would like to switch away from
storing binary Marc21, to storing marc-in-json instead.

Still curious to hear from anyone that is using the combo of BL,
SolrMarc, Solr 4.x, and binary Marc21, however.


>
> On Tue, May 14, 2013 at 11:22 AM, Jonathan Rochkind <roch...@jhu.edu
> blacklight-developm...@googlegroups.com
> <mailto:blacklight-development%2B__uns...@googlegroups.com>
>
> <mailto:blacklight-__development%2Bunsubscribe@__googlegroups.com <mailto:blacklight-development%252Buns...@googlegroups.com>>.
> For more options, visit
> https://groups.google.com/____groups/opt_out
> <https://groups.google.com/__groups/opt_out>

Jonathan Rochkind

unread,
May 14, 2013, 4:53:16 PM5/14/13
to blacklight-...@googlegroups.com, Justin Coyne
Weirdly, in both Solr 1.4 and 4.3, if I look directly at the Solr
response, binary Marc control codes are represented as #30; and #31;.

(Apparently this is escaping that SolrMarc does).

But when my code is pointed at Solr 1.4, if I ask the
Blacklight::SolrDocument for the stored field value, it comes back with
actual binary control values.

If my code is pointed at Solr 4.3, if I ask the Blacklight::SolrDocument
for the stored field value, it comes back still with #30; in it.

In both cases, I have the _exact_ same ruby code, same version of BL, etc.

I have no idea what's going on, I'm not sure how it gets _unescaped_
when pointing at Solr 1.4 (something in Rsolr? In BL?) , and why this
stops working when it's pointed at Solr 4.3.

The whole thing is very confusing, I had no idea SolrMarc was even doing
this weird as heck escaping.

While I hypothetically could debug it, and figure out what's going on...
I'm not that excited about debugging RSolr-related stuff, that more
recent Blacklights dont' even use. I'll probably just put in a local
hack of some kind into my own app. Over-ride the MarcExtension load_marc
record to forcibly "unescape" the #30; and #31; codes myself.

Robert J. Haschart

unread,
May 14, 2013, 6:02:46 PM5/14/13
to blacklight-...@googlegroups.com, Jonathan Rochkind, Justin Coyne
It is not SolrMarc that is doing the escaping. It is something in the SolrJ
code or in the Solr code itself. SolrMarc has special code to try to force
Solr/SolrJ to _not_ do this escaping. If you are doing direct index access
(which I think I remember you are not) the SolrMarc special code makes
binary writes work. If you are doing HTML access and the server can
accept binary writes, (which 4.x should all do by default) the special case
code is not used and is not needed.

Your test where you looked directly at the solr response was almost right.
You have to specify a different writer type in the request (wt=json
should work) otherwise the xml writer does escaping if the dat hasn't
already been escaped.

Because the signature of some Solr methods that special SolrMarc code relied
on changed as of Solr 4.0, I recently needed to disable binary writes in the
direct index access so that Demain and VuFind could work with Solr 4.x. It
may be that I also disabled the binary writes via HTML when talking to Solr
4.x at the same time. I'll check on this tomorrow.

-Bob
>email to blacklight-develo...@googlegroups.com.
>For more options, visit https://groups.google.com/groups/opt_out.
>
>

Dan Davis

unread,
Aug 28, 2013, 3:10:09 PM8/28/13
to blacklight-...@googlegroups.com, Justin Coyne

On Tuesday, May 14, 2013 1:24:38 PM UTC-4, Jonathan Rochkind wrote:
Long term, once I have Solr upgraded, I would like to find time to
switch away from SolrMarc to another indexing solution, based on Bill
Dueber's work in jruby -- I just don't understand Java enough to work
effectively with with SolrMarc sourcecode. If/when I have an indexing
solution I understand better, I also would like to switch away from
storing binary Marc21, to storing marc-in-json instead.

Still curious to hear from anyone that is using the combo of BL,
SolrMarc, Solr 4.x, and binary Marc21, however.

I have had some problems using FullRecordAsMarc with Solr 3.6.1 - my problems are likely different, however.   Are you using SolrMarc to update the Lucene index directly, or are you sending the FullRecordAsMARC over the wire to Solr?   The latter did not work for me and I haven't yet tried the former.

For your long-term, one solution might be to use marc4j library (on which SolrMarc depends) to convert to MARCXML, and then use XSLT/Xpath to get the fields you want either using the XsltRequestUpdateHandler or DataImportHandler with XSL.  It is important to get tell marc4j whether the MARC has MARC8 encoding or something else (via a java property setting).   This all pre-supposes you are more comfortable with XSLT and Xpath than Java.   My intuition is that running an interpreted language in Java (XSLT) will take longer than running Java (SolrMarc), and so indexing will not be as fast, but your mileage may vary.




Reply all
Reply to author
Forward
0 new messages