Filter-media on PDFs exported from Outlook causes a TikaException error and prevents Items from indexing (DSpace 5.6)

106 views
Skip to first unread message

Webb, Nicholas

unread,
Oct 28, 2016, 3:27:34 PM10/28/16
to dspac...@googlegroups.com
In our DSpace instance (DSpace 5.6 with Mirage 2), we've encountered an issue with Items that fail to appear when searching/browsing a Community or Collection, despite being present in the repository and appearing when you go directly to the Item's URL.

All of the affected Items contain PDFs of Microsoft Outlook mail folders. Rebuilding the index with dspace index-discovery produces the following error for the affected Items:

2016-10-24 13:19:05,144 ERROR org.dspace.discovery.SolrServiceImpl @ Error while writing item to discovery index: 123456789/13074

message:org.apache.tika.exception.TikaException: Failed to parse an email message
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Failed to parse an email message
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.dspace.discovery.SolrServiceImpl.writeDocument(SolrServiceImpl.java:738)
at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:1419)
at org.dspace.discovery.SolrServiceImpl.indexContent(SolrServiceImpl.java:225)
at org.dspace.discovery.SolrServiceImpl.updateIndex(SolrServiceImpl.java:405)
at org.dspace.discovery.IndexClient.main(IndexClient.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)

Testing with freshly uploaded Items confirms that the problem is with the text file produced by filter-media, not with the PDF files themselves. A freshly uploaded Item will index properly and appear in the search/browse interface, but will disappear from the index and the interface (with a TikaException in dspace.log) once dspace filter-media is run. If I delete the *.txt file produced by filter-media and update the index with dspace index-discovery, the Item will index properly and become visible again.

Not being familiar with Apache Tika, my guess is that there might be an issue with Tika attempting to parse the text file as if it's an MBOX.

As a temporary workaround, I've deleted the text files from the affected Items and created an exclusion list for filter-media, but since a significant portion of our collections are in this format, this is unwieldy as a long-term solution, since it prevents us from full-text searching the contents. Can anyone suggest a fix or better workaround for this problem?

Nicholas Webb
Digital Archivist

Icahn School of Medicine at Mount Sinai
Box 1102 - One Gustave L. Levy Place
New York, NY 10029-6574

(o) 212-241-7239
(f) 212-241-7864
(e) nichol...@mssm.edu


Terry Brady

unread,
Oct 28, 2016, 8:03:36 PM10/28/16
to Webb, Nicholas, dspac...@googlegroups.com
I do not have any immediate advice to offer, but this sounds like an issue that should be captured in Jira:  https://jira.duraspace.org/projects/DS/issues

Terry



--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech+unsubscribe@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.



--
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
425-298-5498 (Seattle, WA)
Reply all
Reply to author
Forward
0 new messages