I have been trying to get a multithreaded implementation of our Solr
index build script per Erik's suggestions. One problem that I run into
is that Solr returns errors on the raw MARC streams that I HTTP post in
Solr's add doc XML format.
Since (as we all know) MARC uses the control characters 0x1d, 0x1e, 0x1f
to separate records, fields and subfields the raw MARC stream contains
invalid UTF-8 characters. So I try to use the appropriate character
references according to the XML spec
(http://www.w3.org/TR/2008/REC-xml-20081126/#sec-references) and escape
them (   or   ) but Solr still complains.
I know that the raw streams can eventually make their way to the Lucene
index because using an EmbeddedSolrServer works just fine when building
Solr Document objects.
A simple sample XML w/ character references is below along with the Solr
log stack trace. Is the problem here just that the Woodstox XML library
just doesn't like the control characters or am I not actually escaping
them properly?
Thanks,
-Steve
p.s. we don't have numbers to contribute to the indexing time discussion
yet cuz we are waiting on getting production server VMs set up. in a
test environment with older Solaris hardware we index 8 million bib
records in about 6 hours using embedded Solr server w/ 2GB memory
allocated. but this hardware is crazy slow when doing an extract out of
our voyager catalog to get the raw bib records. so we are hoping we can
speed this up on different hardware.
<add>
<doc>
<field name="id">1234</field>
<field name="raw_marc">00641nam a2200229Ia
45x0001001200000005001700012008004100029010001700070035001600087040001300103049000900116090002600125100002800151245005200179250001200231260006300243300002100306500001100327546001500338852004600353994001200399ocm5282583120030811140653.0030811s1977
ii 000 f tel d a 78900503 
a(WU)6065136 aGZMcGZM aGZMA
aPL4780.9.J53bP7 19771 aJhānsīrāṇi,
Es.10aPrakr̥ti dāhaṃ /cYas. Jhānsīrāṇi.
a1st ed. aVijayavāḍa :bVijayaśrī Pabliṣiṅg
Haus,c1977. a199 p. ;c20 cm.
aNovel. aIn
Telugu.00aWUbm,stkhPL4780.9 J53iP7
1977x7069850 aX0bGZM</field>
</doc>
</add>
SEVERE: [com.ctc.wstx.exc.WstxLazyException]
com.ctc.wstx.exc.WstxParsingException: Illegal character entity:
expansion character (code 0x1e) not a valid XML character
at [row,col {unknown-source}]: [4,259]
at
com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:279)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:138)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character
entity: expansion character (code 0x1e) not a valid XML character
at [row,col {unknown-source}]: [4,259]
at
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2400)
at
com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2346)
at
com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1205)
at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4677)
at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 26 more
--
Stephen Meyer
Library Application Developer
UW-Madison Libraries
312F Memorial Library
728 State St.
Madison, WI 53706
sme...@library.wisc.edu
608-265-2844 (ph)
"Just don't let the human factor fail to be a factor at all."
- Andrew Bird, "Tables and Chairs"