Unknown media filter specified - "PDF Text Extractor"

451 views
Skip to first unread message

David Brian Holt

unread,
Dec 13, 2016, 1:54:48 PM12/13/16
to DSpace Technical Support
After solving my installation issue, I have been uploading some OCR-ed PDFs and Word documents to test for full-text search.  I edited my crontab file to include all my daily/weekly scripts (including index-discovery and filter-media) but I noticed that when I try to run "dspace filter-media" I'm getting an error message saying that "PDF Text Extractor" is an unknown media filter.  I haven't edited the relevant section of my dspace.cfg file on this.  Can someone offer any advice on what's causing this?  I'm really wanted to use Dspace for this project because it has Solr built-in for full-text search.  

Here is the relevant section from my dspace.cfg file:

#### Media Filter / Format Filter plugins (through PluginService) ####
# Media/Format Filters help to full-text index content or
# perform automated format conversions


#Names of the enabled MediaFilter or FormatFilter plugins
filter
.plugins = PDF Text Extractor
filter
.plugins = HTML Text Extractor
filter
.plugins = Word Text Extractor
filter
.plugins = Excel Text Extractor
filter
.plugins = PowerPoint Text Extractor
filter
.plugins = JPEG Thumbnail
filter
.plugins = PDFBox JPEG Thumbnail




# [To enable Branded Preview]: uncomment and insert the following into the plugin list
#                Branded Preview JPEG, \


# [To enable ImageMagick Thumbnail]:
#    remove "JPEG Thumbnail" from the plugin list
#    uncomment and insert the following line into the plugin list
ImageMagick Image Thumbnail, ImageMagick PDF Thumbnail, \


#Assign 'human-understandable' names to each filter
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.PDFFilter = PDF Text Extractor
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.WordFilter = Word Text Extractor
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.ExcelFilter = Excel Text Extractor
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.PowerPointFilter = PowerPoint Text Extractor
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview JPEG
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.PDFBoxThumbnail = PDFBox JPEG Thumbnail
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter = ImageMagick Image Thumbnail
plugin
.named.org.dspace.app.mediafilter.FormatFilter = org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter = ImageMagick PDF Thumbnail


#Configure each filter's input format(s)
filter
.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF
filter
.org.dspace.app.mediafilter.HTMLFilter.inputFormats = HTML, Text
filter
.org.dspace.app.mediafilter.WordFilter.inputFormats = Microsoft Word
filter
.org.dspace.app.mediafilter.PowerPointFilter.inputFormats = Microsoft Powerpoint, Microsoft Powerpoint XML
filter
.org.dspace.app.mediafilter.JPEGFilter.inputFormats = BMP, GIF, JPEG, image/png
filter
.org.dspace.app.mediafilter.BrandedPreviewJPEGFilter.inputFormats = BMP, GIF, JPEG, image/png
filter
.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
filter
.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
filter
.org.dspace.app.mediafilter.ExcelFilter.inputFormats = Microsoft Excel, Microsoft Excel XML
filter
.org.dspace.app.mediafilter.PDFBoxThumbnail.inputFormats = Adobe PDF


#Publicly accessible thumbnails of restricted content.
#List the MediaFilter name's that would get publicly accessible permissions
#Any media filters not listed will instead inherit the permissions of the parent bitstream
#filter.org.dspace.app.mediafilter.publicPermission = JPEGFilter


#Custom settings for PDFFilter
# If true, all PDF extractions are written to temp files as they are indexed...this
# is slower, but helps ensure that PDFBox software DSpace uses doesn't eat up
# all your memory
#pdffilter.largepdfs = true
# If true, PDFs which still result in an Out of Memory error from PDFBox
# are skipped over...these problematic PDFs will never be indexed until
# memory usage can be decreased in the PDFBox software
#pdffilter.skiponmemoryexception = true


# Custom settigns for ImageMagick Thumbnail Filters
# ImageMagick and GhostScript must be installed on the server, set the path to ImageMagick and GhostScript executable
#   http://www.imagemagick.org/
#   http://www.ghostscript.com/
# Note: thumbnail.maxwidth and thumbnail.maxheight are used to set Thumbnail dimensions
# org.dspace.app.mediafilter.ImageMagickThumbnailFilter.ProcessStarter = /usr/bin
#
# bitstreams generated by this process will contain the following description and may be overwritten
# org.dspace.app.mediafilter.ImageMagickThumbnailFilter.bitstreamDescription = IM Thumbnail
#
# bitstream descriptions that do not conform to the following regular expression will not be overwritten
# org.dspace.app.mediafilter.ImageMagickThumbnailFilter.replaceRegex = ^Generated Thumbnail$
#
# While PDFs may contain transparent spaces, JPEG cannot. As DSpace use JPEG
# for the generated thumbnails, PDF containing transparent spaces may lead
# to problems. To solve this the exported PDF page is flatten before it is
# resized and stored as JPEG. You can switch this behavior off by setting the
# next property false, if necessary for any reasons.
# org.dspace.app.mediafilter.ImageMagickThumbnailFilter.flatten = true

Bill T

unread,
Dec 13, 2016, 4:50:27 PM12/13/16
to DSpace Technical Support
David,

One quick guess:

Try commenting that line once again:

ImageMagick Image Thumbnail, ImageMagick PDF Thumbnail, \

and add

filter.plugins = ImageMagick Image Thumbnail
filter.plugins - ImageMagick PDF Thumbnail

at the tail end of the other filter.plugins just above.

rebuild and restart tomcat, and see if that helps.
Bill
Message has been deleted

David Brian Holt

unread,
Dec 13, 2016, 6:48:13 PM12/13/16
to DSpace Technical Support
I rebuilt and tried filter-media again and it worked!!  :)

BTW, the little logos in the header/footer with Mirage2 are broken for some reason.  Any idea how to fix that?

Thank you!

Bill T

unread,
Dec 14, 2016, 10:18:16 AM12/14/16
to DSpace Technical Support
It's hard to say, but in general I would:

1. View the page source to see where the server expects to find them.
2. Check the DSpace webapps directory to see where they really are.
3. Tweak the xsl (probably page-structure.xsl) to find them, or move the images.

Cheers!
Bill

Sidoroff, Ilja

unread,
Dec 15, 2016, 5:44:10 AM12/15/16
to David Brian Holt, dspac...@googlegroups.com
I don't know if this is relevant in your case, but I had tomcat + nginx setup with DSpace and I had configured nginx to serve static assets (such as images). When updating the DSpace installation, SELinux security context weren't updated (or were reset) and I got just this behaviour with Mirage 2.

Probably not your problem, but I thought to mention it just in case.

Ilja Sidoroff
> --
> You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
> To post to this group, send email to dspac...@googlegroups.com.
> Visit this group at https://groups.google.com/group/dspace-tech.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages