In our DSpace instance (DSpace 5.6 with Mirage 2), we've encountered an issue with Items that fail to appear when searching/browsing a Community or Collection, despite being present in the repository and appearing when you go directly to the Item's URL.
All of the affected Items contain PDFs of Microsoft Outlook mail folders. Rebuilding the index with dspace index-discovery produces the following error for the affected Items:
2016-10-24 13:19:05,144 ERROR org.dspace.discovery.SolrServiceImpl @ Error while writing item to discovery index: 123456789/13074
message:org.apache.tika.exception.TikaException: Failed to parse an email message
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Failed to parse an email message
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.dspace.discovery.SolrServiceImpl.writeDocument(SolrServiceImpl.java:738)
at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:1419)
at org.dspace.discovery.SolrServiceImpl.indexContent(SolrServiceImpl.java:225)
at org.dspace.discovery.SolrServiceImpl.updateIndex(SolrServiceImpl.java:405)
at org.dspace.discovery.IndexClient.main(IndexClient.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
Testing with freshly uploaded Items confirms that the problem is with the text file produced by filter-media, not with the PDF files themselves. A freshly uploaded Item will index properly and appear in the search/browse interface, but will disappear from the index and the interface (with a TikaException in dspace.log) once dspace filter-media is run. If I delete the *.txt file produced by filter-media and update the index with dspace index-discovery, the Item will index properly and become visible again.
Not being familiar with Apache Tika, my guess is that there might be an issue with Tika attempting to parse the text file as if it's an MBOX.
As a temporary workaround, I've deleted the text files from the affected Items and created an exclusion list for filter-media, but since a significant portion of our collections are in this format, this is unwieldy as a long-term solution, since it prevents us from full-text searching the contents. Can anyone suggest a fix or better workaround for this problem?
Nicholas Webb
Digital Archivist
Icahn School of Medicine at Mount Sinai
Box 1102 - One Gustave L. Levy Place
New York, NY 10029-6574
(o)
212-241-7239
(f)
212-241-7864
(e)
nichol...@mssm.edu