How do I create an exclusion list for filter-media?

28 views
Skip to first unread message

Kerry Bouchard

unread,
May 5, 2021, 4:19:54 PM5/5/21
to DSpace Technical Support
We are running into the problem described here: http://dspace.2283337.n4.nabble.com/Filter-media-on-PDFs-exported-from-Outlook-causes-a-TikaException-error-and-prevents-Items-from-inde-td4683489.html , where the *.pdf.txt files output by the PDF Text Extractor media filter for a couple of PDFs in our repository causes indexing to fail for not just the PDF full text, but all the associated metadata. (In our case, the PDFs were not output from Microsoft Outlook mail folders, but I'm seeing the same "org.apache.tika.exception.TikaException: Failed to parse an email message" in the dspace log file.)

The posting at the URL above refers to a work-around by creating an exclusion list for filter-media. But I can find any documentation on how to create an exclusion list. Can someone point me to that?

Thanks, Kerry

Sean Kalynuk

unread,
May 5, 2021, 4:42:04 PM5/5/21
to Kerry Bouchard, DSpace Technical Support

Hi Kerry,

 

There is a Skip mode option (-s) for the filter-media command:

 

https://wiki.lyrasis.org/display/DSDOC6x/Mediafilters+for+Transforming+DSpace+Content#MediafiltersforTransformingDSpaceContent-Executing(viaCommandLine)

 

--

Sean

 

From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Kerry Bouchard <k.bou...@tcu.edu>
Date: Wednesday, May 5, 2021 at 3:20 PM
To: DSpace Technical Support <dspac...@googlegroups.com>
Subject: [dspace-tech] How do I create an exclusion list for filter-media?

Caution: This message was sent from outside the University of Manitoba.

We are running into the problem described here: http://dspace.2283337.n4.nabble.com/Filter-media-on-PDFs-exported-from-Outlook-causes-a-TikaException-error-and-prevents-Items-from-inde-td4683489.html , where the *.pdf.txt files output by the PDF Text Extractor media filter for a couple of PDFs in our repository causes indexing to fail for not just the PDF full text, but all the associated metadata. (In our case, the PDFs were not output from Microsoft Outlook mail folders, but I'm seeing the same "org.apache.tika.exception.TikaException: Failed to parse an email message" in the dspace log file.)

 

The posting at the URL above refers to a work-around by creating an exclusion list for filter-media. But I can find any documentation on how to create an exclusion list. Can someone point me to that?

 

Thanks, Kerry

--
All messages to this mailing list should adhere to the Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/85e9b754-31d4-4558-8bde-071facdf9d0bn%40googlegroups.com.

Bouchard, Kerry

unread,
May 5, 2021, 6:08:55 PM5/5/21
to Sean Kalynuk, DSpace Technical Support

 

            Thank you!

 

                        -Kerry

Reply all
Reply to author
Forward
0 new messages