HTTP Posting raw MARC to Solr - control characters

96 views
Skip to first unread message

Stephen Meyer

unread,
Jan 20, 2010, 11:42:38 AM1/20/10
to blacklight-...@googlegroups.com, solrma...@googlegroups.com
Apologies for cross posting. Also this might be more appropriate for a
Solr list, but I thought I'd check if this is just a simple stupid thing
I am doing before trying to explain MARC outside of libraries...

I have been trying to get a multithreaded implementation of our Solr
index build script per Erik's suggestions. One problem that I run into
is that Solr returns errors on the raw MARC streams that I HTTP post in
Solr's add doc XML format.

Since (as we all know) MARC uses the control characters 0x1d, 0x1e, 0x1f
to separate records, fields and subfields the raw MARC stream contains
invalid UTF-8 characters. So I try to use the appropriate character
references according to the XML spec
(http://www.w3.org/TR/2008/REC-xml-20081126/#sec-references) and escape
them (   or   ) but Solr still complains.

I know that the raw streams can eventually make their way to the Lucene
index because using an EmbeddedSolrServer works just fine when building
Solr Document objects.

A simple sample XML w/ character references is below along with the Solr
log stack trace. Is the problem here just that the Woodstox XML library
just doesn't like the control characters or am I not actually escaping
them properly?

Thanks,
-Steve

p.s. we don't have numbers to contribute to the indexing time discussion
yet cuz we are waiting on getting production server VMs set up. in a
test environment with older Solaris hardware we index 8 million bib
records in about 6 hours using embedded Solr server w/ 2GB memory
allocated. but this hardware is crazy slow when doing an extract out of
our voyager catalog to get the raw bib records. so we are hoping we can
speed this up on different hardware.

<add>
<doc>
<field name="id">1234</field>
<field name="raw_marc">00641nam a2200229Ia
45x0001001200000005001700012008004100029010001700070035001600087040001300103049000900116090002600125100002800151245005200179250001200231260006300243300002100306500001100327546001500338852004600353994001200399&#x1e;ocm52825831&#x1e;20030811140653.0&#x1e;030811s1977
ii 000 f tel d&#x1e; &#x1f;a 78900503 &#x1e;
&#x1f;a(WU)6065136&#x1e; &#x1f;aGZM&#x1f;cGZM&#x1e; &#x1f;aGZMA&#x1e;
&#x1f;aPL4780.9.J53&#x1f;bP7 1977&#x1e;1 &#x1f;aJhānsīrāṇi,
Es.&#x1e;10&#x1f;aPrakr̥ti dāhaṃ /&#x1f;cYas. Jhānsīrāṇi.&#x1e;
&#x1f;a1st ed.&#x1e; &#x1f;aVijayavāḍa :&#x1f;bVijayaśrī Pabliṣiṅg
Haus,&#x1f;c1977.&#x1e; &#x1f;a199 p. ;&#x1f;c20 cm.&#x1e;
&#x1f;aNovel.&#x1e; &#x1f;aIn
Telugu.&#x1e;00&#x1f;aWU&#x1f;bm,stk&#x1f;hPL4780.9 J53&#x1f;iP7
1977&#x1f;x7069850&#x1e; &#x1f;aX0&#x1f;bGZM&#x1e;&#x1d;</field>
</doc>
</add>

SEVERE: [com.ctc.wstx.exc.WstxLazyException]
com.ctc.wstx.exc.WstxParsingException: Illegal character entity:
expansion character (code 0x1e) not a valid XML character
at [row,col {unknown-source}]: [4,259]
at
com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character
entity: expansion character (code 0x1e) not a valid XML character
at [row,col {unknown-source}]: [4,259]
at
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2400)
at
com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2346)
at
com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1205)
at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4677)
at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 26 more

--
Stephen Meyer
Library Application Developer
UW-Madison Libraries
312F Memorial Library
728 State St.
Madison, WI 53706

sme...@library.wisc.edu
608-265-2844 (ph)


"Just don't let the human factor fail to be a factor at all."
- Andrew Bird, "Tables and Chairs"

Brad Dewar

unread,
Jan 21, 2010, 9:11:21 AM1/21/10
to solrma...@googlegroups.com, blacklight-...@googlegroups.com

XML only recognizes 5 &_; escape sequences by default (the ones that make up the XML syntax: < > & ' "). If you want it to recognize more entities (&#29;, &#30;, &#31;), then you have to specify them in that document's DTD, in this case, Solr's add doc DTD.

VuFind's approach to this (in RC1 at least) was to use its own escape mechanism to store the characters in Solr, and str_replace them back on its own when it reads a binary marc record. e.g. it escapes 0x1d with '#29#' (or something like that -- I don't remember exactly -- anything but '&#29;' really). Not very elegant or efficient, and a little error-prone, but very easy to do.

Plenty of other ways around it, too. For example, you could GET/POST data to and from Solr in a format other than XML. Solr offers plenty of output writers, and more than a few input formats as well.

Brad
Reply all
Reply to author
Forward
0 new messages