Solr ExtractingRequestHandler

81 views

Skip to first unread message

Alexander

unread,

Jul 15, 2011, 11:03:08 AM7/15/11

to ActiveFedora / Ruby + Fedora Commons

Hello.

I tried to create this issue here https://jira.duraspace.org/browse/HYDRUS,
but i have no access.
So.
Apache Solr has http://wiki.apache.org/solr/ExtractingRequestHandler
that use tika to extract and index text from .pdf, .doc files.
when I call curl "http://localhost:8983/solr/update/extract?
literal.id=doc1&commit=true" -F "myfile=@/path/somePDF.pdf" on example
solr core from apache (http://www.apache.org/dyn/closer.cgi/lucene/
solr/) I got indexed PDF file content in solr. It works.

But when i try to run same on Hydra-Jetty server, I got:

<result status="1">com.ctc.wstx.exc.WstxUnexpectedCharException:
Unexpected character '-' (code 45) in prolog; expected '<'
at [row,col {unknown-source}]: [1,1]
at
com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:
648)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:
2047)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:
1069)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:98)
at
org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:
134)
at
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:
87)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:
487)
at org.mortbay.jetty.servlet.ServletHandler
$CachedChain.doFilter(ServletHandler.java:1098)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
297)
at org.mortbay.jetty.servlet.ServletHandler
$CachedChain.doFilter(ServletHandler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:
365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:
216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:
181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:
712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:
405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:
211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:
114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:
139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:
502)
at org.mortbay.jetty.HttpConnection
$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector
$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool
$PoolThread.run(BoundedThreadPool.java:442)
</result>

also if place hydra config files (solrconfig.xml or shema.xml) in
example solr from apache:
with solrconfig.xml => Missing solr core name in path (but Hydra's
solr is also single core)
with shema.xml => ERROR:unknown field 'ignored_meta'

So what is main source of errors for ExtractingRequestHandler? why it
wont work with Hydra? And how to get files indexed?

Or maybe you may suggest me another way to index files via solr in
active-fedora based repository?

Matt Zumwalt

unread,

Jul 15, 2011, 2:52:23 PM7/15/11

to hydra...@googlegroups.com, John Scofield, active...@googlegroups.com

Alex,

I'm re-posting this on the hydra-tech list since it pertains to hydra-jetty, not active-fedora.

A couple of possible reasons for this error:

* hydra-jetty might be using a version of solr that doesn't support the functionality you want

* the solrconfig that's in hydra-jetty by default might need to be fixed to support the ExtractingRequestHandler

Matt Zumwalt

MediaShelf, LLC

http://www.yourmediashelf.com

--
You received this message because you are subscribed to the Google Groups "ActiveFedora / Ruby + Fedora Commons" group.
To post to this group, send email to active...@googlegroups.com.
To unsubscribe from this group, send email to active-fedor...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/active-fedora?hl=en.

Reply all

Reply to author

Forward

0 new messages