[Dspace-tech] Indexing of scanned PDFs

Michael White

unread,

Aug 25, 2015, 10:48:14 AM8/25/15

to dspac...@lists.sourceforge.net

Hi,

Our graphics unit is experimenting with scanning old theses for inclusion in our repository - they have just uploaded the first scanned thesis to the repository (http://hdl.handle.net/1893/340) but DSpace doesn't appear to be indexing the theses text. [The thesis was uploaded yesterday and index-all runs nightly as a cron job - I've also just run index-all from the command line just to make sure].

I'm not involved in the digitisation side, so I'm not 100% sure what they are doing (and the person that did it is off on holiday now so I can't ask them), but the PDF file appears to contain the content both as scanned images (for accurate reproduction), and embedded OCR'd text (for searching, accessibility etc). Even though the displayed page is obviously an image, it is possible to select text and copy and paste it (although I can see obvious OCR errors in the pasted text) and also search the PDF file directly from Acrobat . . .

Has anyone come across this type of PDF file before (or is there something more subtle going on here that I've missed)? If the PDF file does indeed also contain the OCR'd text, any idea how to get DSpace to index it? If not, is there any advice I should be giving to the folk doing the digitisation in order to enable them to produce more DSpace friendly PDFs?

Thanks as ever,

Mike

Michael White
eLearning Developer
Centre for eLearning Development (CeLD)
S7, The Library
University of Stirling
Stirling SCOTLAND
FK9 4LA

Email: michae...@stir.ac.uk
Tel: +44 (0) 1786 466877
Fax: +44 (0) 1786 466880

http://www.is.stir.ac.uk/celd/

--

The University of Stirling (a charity registered in Scotland, number SCO11159) is a university established in Scotland by charter at Stirling, FK9 4LA. Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not disclose, copy or deliver this message to anyone and any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer do not consent to Internet email for messages of this kind.

Dorothea Salo

unread,

Aug 25, 2015, 10:48:16 AM8/25/15

to dspac...@lists.sourceforge.net

You didn't say what version of DSpace you're running (and honestly,
I'm not completely sure this was fixed in 1.5 -- anybody know?),
but... one thing that may be happening is that the filter-media cron
job is dying. Since it's written without error-recovery, it stops dead
at the first file it thinks it should be able to handle but can't.

Run it from the command-line and see if it errors out. If I'm right,
there's no obvious workaround I'm aware of, though somebody (Tim?) may
have hacked one.

Dorothea

--
Dorothea Salo ds...@library.wisc.edu
Digital Repository Librarian AIM: mindsatuw
University of Wisconsin
Rm 218, Memorial Library
(608) 262-5493

Graham Triggs

unread,

Aug 25, 2015, 10:48:17 AM8/25/15

to Dorothea Salo, dspac...@lists.sourceforge.net

Dorothea Salo wrote:
> You didn't say what version of DSpace you're running (and honestly,
> I'm not completely sure this was fixed in 1.5 -- anybody know?),
> but... one thing that may be happening is that the filter-media cron
> job is dying. Since it's written without error-recovery, it stops dead
> at the first file it thinks it should be able to handle but can't.
>
> Run it from the command-line and see if it errors out. If I'm right,
> there's no obvious workaround I'm aware of, though somebody (Tim?) may
> have hacked one.
>
> Dorothea
>

The filter-media in 1.5 is a bit more robust. If it hits an Exception
when dealing with one file, it will attempt to clean itself up a bit and
carry on with the next one.

In the cases where PDF extraction is failing due to a PDFBox bug, this
is usually good enough for it to finish the filtering normally
(excluding the file that caused the problem).

However, I can't guarantee that will be enough in this case. But then
judging by Mike's message, it's possible that filter-media wasn't even
run at all. (only index-all is mentioned)

G

This e-mail is confidential and should not be used by anyone who is not the original intended recipient. BioMed Central Limited does not accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of BioMed Central Limited. No contracts may be concluded on behalf of BioMed Central Limited by means of e-mail communication. BioMed Central Limited Registered in England and Wales with registered number 3680030 Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB
This email has been scanned by Postini.
For more information please visit http://www.postini.com

Tim Donohue

unread,

Aug 25, 2015, 10:48:19 AM8/25/15

to Graham Triggs, dspac...@lists.sourceforge.net, Dorothea Salo, michae...@stir.ac.uk

A bit more info (but similar answer to Graham's)

It's hard to tell what exactly is going on here. By default, the PDFBox
software which DSpace uses to index PDFs should be able to index a PDF
which has embedded OCR text (it's worked for us in this way). However,
there are admittedly bugs with this underlying PDFBox software that
folks have run into in the past (myself included)

Michael, you may want to check a few things:

(1) you need to make sure that you are running the 'filter-media' script
each night. This is what full-text indexes PDF, Word and HTML.
(2) If you are running 'filter-media', you may want to set up your cron
job to write its output to a log file, so you can see what errors may be
occurring. Something similar to this:
[dspace]/bin/filter-media > [dspace]/log/filter.log 2>&1

If you are running filter-media, that log file should be able to tell
you what is erroring out. If you don't understand the error message,
you can send it to dspace-tech and we can try and help you debug it.

Finally, as Graham mentioned, there are a few common errors with that
PDFBox software which we've now got workarounds for in DSpace 1.5.
Namely these two configs:

pdffilter.largepdfs = true
(If true, it writes larger PDFs to a temp file as it indexes them...this
is slower, but helps ensure that PDFBox software doesn't eat up all your
memory)

pdffilter.skiponmemoryexception=true
(If true, it skips any PDFs which still result in an Out of Memory error
from PDFBox...these PDFs just will never be indexed until the PDFBox
software we are using fixes some of its memory usage problems)

BTW...Graham, those two 'pdffilter' settings didn't make it into the
DSpace 1.5 dspace.cfg file! We need to push those into the 1.5.1
bug-fix release!

Hope that helps!

- Tim

> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> DSpace-tech mailing list
> DSpac...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>

--

========================================
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
135 Grainger Engineering Library
University of Illinois at Urbana-Champaign

email: tdon...@uiuc.edu
web: http://www.ideals.uiuc.edu
phone: (217) 333-4648
fax: (217) 244-7764
========================================

Michael White

unread,

Aug 25, 2015, 10:48:20 AM8/25/15

to Tim Donohue, Graham Triggs, Dorothea Salo, dspac...@lists.sourceforge.net

Thanks guys,

(I'm on v1.4.1 here for our main repository).

Your right - I only tried index-all from the command line earlier when I
was trying to figure out why this wasn't working - apologies, an example
of brain freeze!! I had a quiet "D'oh" moment when someone mentioned
filter-media :-)

I tried filter-media from the command line and it did indeed bomb out
fairly early on due to a protected PDF/bouncy castle type error which is
presumably why the cron filter-media wasn't doing its' job.

I dropped the bouncy castle PDF jars into the lib directory (copied over
from a v1.4.2 repo I'm also running), re-ran filter-media and that seems
to have done the trick - my PDF has now been filtered and indexed and
can be search from within DSpace :-).

Interestingly I did still get a couple of errors, but these didn't stop
the filter-media process as was the case previously (I don't know if
this is because of the new jars or if these are less serious errors than
the one that previously caused filter-media to bomb out) - just for
reference, these are the errors I'm seeing:

ERROR filtering, skipping bitstream #364
java.util.NoSuchElementException
java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(AbstractList.java:426)
at
org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.j
ava:150)
at
org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.ja
va:97)
at
org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java
:155)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
rManager.java:327)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
ger.java:296)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
erManager.java:266)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media
FilterManager.java:234)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
va:185)

ERROR filtering, skipping bitstream #169 java.io.IOException: Error
decrypting document, details: Error: The supplied password does not
match either the owner or user password in the document.
java.io.IOException: Error decrypting document, details: Error: The
supplied password does not match either the owner or user password in
the document.
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at
org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java
:110)
at
org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java
:155)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
rManager.java:327)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
ger.java:296)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
erManager.java:266)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media
FilterManager.java:234)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
va:185)

Thanks again for all the useful advice and pointers, and for helping me
to sort this out (and getting me past my brain freeze!).

Cheers,

Reply all

Reply to author

Forward