Media Filter

37 views
Skip to first unread message

Róbert Bodnár

unread,
Mar 25, 2016, 4:33:48 AM3/25/16
to dspac...@googlegroups.com
Hello everyone!

I`m trying to run Mediafilter on our collection and I see two errors, could anyone help me figure out what are these and how can be corrected?
1. This appeared only now, did not see it until now:
ERROR filtering, skipping bitstream:

        Item Handle: 123456789/62232
        Bundle Name: ORIGINAL
        File Size: 819579
        Checksum: fbccdd816c3df1e2b7dde9c2c10239f8 (MD5)
        Asset Store: 0
java.lang.NullPointerException
java.lang.NullPointerException
You can download the pdf well, has no problems.

2.
ERROR filtering, skipping bitstream:

        Item Handle: 123456789/48307
        Bundle Name: ORIGINAL
        File Size: 21761717
        Checksum: 5e2c8fd4e61fa7b9c2b0065ea55c9b30 (MD5)
        Asset Store: 0
java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:100)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
        at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101)
        at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:737)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:561)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:511)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:479)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:414)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:333)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:225)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:77)
This second one I think has something to do with the pdf-s, maybe the quality of them or somehing... Has someone experienced these errors and if there are solutions for it?

Thank you very much!
B. Róbert

Tim Donohue

unread,
Mar 30, 2016, 10:33:28 AM3/30/16
to dspac...@googlegroups.com
Hi B. Róbert,

In both of those situations, it sounds like the PDF's full text is just unable to be extracted (for indexing).  This does NOT necessarily mean the PDF is corrupt however.  Generally speaking, there are three types of PDFs:

1. PDFs which are generated from an electronic document (e.g. Word or similar). They always include the full text
2. PDFs which are scanned documents (from paper) and OCR'd. They usually include the full text (but it may depend on whether the OCR process was successful)
3. PDFs which are scanned documents (from paper) and NOT yet OCR'd.  They *never* include the full text.

In the case of #3, DSpace will not be able to extract the full text as the PDF is essentially and image file.  DSpace doesn't have any OCR capabilities built-in.

In the case of #2, DSpace may or may not be able to extract the full text (but usually should succeed). It depends on whether the OCR process was able to successfully embed the full text into the PDF.

In the case of #1, DSpace should be able to always extract the full text.  However, it's worth noting that we use a third-party library (PDFBox) to perform this extraction. It's possible that third-party library may have bugs (we've hit them in the past) with specific PDFs.  We do our best to keep that library up-to-date in later versions of DSpace, so as you upgrade DSpace, the PDFBox library often is upgraded (and it may lessen the likelihood of errors from the MediaFilter).

(From past experience, I will say that there often are some PDFs which PDFBox just has a hard time extracting text from.  In the past, it was occasionally related to the size of PDFs...large ones had more issues than small ones.)

All in all, these MediaFilter errors do NOT affect DSpace's performance, and do not mean that the PDFs are corrupted. Your DSpace will still work fine, and the Item metadata will be searchable, and the PDF files can still be downloaded. But those specific PDFs will not be full-text searchable. 

Let us know if you have further questions!

- Tim
--
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To post to this group, send email to dspac...@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

-- 
Tim Donohue
Technical Lead for DSpace & DSpaceDirect
DuraSpace.org | DSpace.org | DSpaceDirect.org
Reply all
Reply to author
Forward
0 new messages