Hello.
I tried to create this issue here
https://jira.duraspace.org/browse/HYDRUS,
but i have no access.
So.
Apache Solr has
http://wiki.apache.org/solr/ExtractingRequestHandler
that use tika to extract and index text from .pdf, .doc files.
when I call curl "
http://localhost:8983/solr/update/extract?
literal.id=doc1&commit=true" -F "myfile=@/path/somePDF.pdf" on example
solr core from apache (
http://www.apache.org/dyn/closer.cgi/lucene/
solr/) I got indexed PDF file content in solr. It works.
But when i try to run same on Hydra-Jetty server, I got:
<result status="1">com.ctc.wstx.exc.WstxUnexpectedCharException:
Unexpected character '-' (code 45) in prolog; expected '<'
at [row,col {unknown-source}]: [1,1]
at
com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:
648)
at
com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:
2047)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:
1069)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:98)
at
org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:
134)
at
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:
87)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:
487)
at org.mortbay.jetty.servlet.ServletHandler
$CachedChain.doFilter(ServletHandler.java:1098)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
297)
at org.mortbay.jetty.servlet.ServletHandler
$CachedChain.doFilter(ServletHandler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:
365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:
216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:
181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:
712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:
405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:
211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:
114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:
139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:
502)
at org.mortbay.jetty.HttpConnection
$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector
$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool
$PoolThread.run(BoundedThreadPool.java:442)
</result>
also if place hydra config files (solrconfig.xml or shema.xml) in
example solr from apache:
with solrconfig.xml => Missing solr core name in path (but Hydra's
solr is also single core)
with shema.xml => ERROR:unknown field 'ignored_meta'
So what is main source of errors for ExtractingRequestHandler? why it
wont work with Hydra? And how to get files indexed?
Or maybe you may suggest me another way to index files via solr in
active-fedora based repository?