Thanks guys,
(I'm on v1.4.1 here for our main repository).
Your right - I only tried index-all from the command line earlier when I
was trying to figure out why this wasn't working - apologies, an example
of brain freeze!! I had a quiet "D'oh" moment when someone mentioned
filter-media :-)
I tried filter-media from the command line and it did indeed bomb out
fairly early on due to a protected PDF/bouncy castle type error which is
presumably why the cron filter-media wasn't doing its' job.
I dropped the bouncy castle PDF jars into the lib directory (copied over
from a v1.4.2 repo I'm also running), re-ran filter-media and that seems
to have done the trick - my PDF has now been filtered and indexed and
can be search from within DSpace :-).
Interestingly I did still get a couple of errors, but these didn't stop
the filter-media process as was the case previously (I don't know if
this is because of the new jars or if these are less serious errors than
the one that previously caused filter-media to bomb out) - just for
reference, these are the errors I'm seeing:
ERROR filtering, skipping bitstream #364
java.util.NoSuchElementException
java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(AbstractList.java:426)
at
org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.j
ava:150)
at
org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.ja
va:97)
at
org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java
:155)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
rManager.java:327)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
ger.java:296)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
erManager.java:266)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media
FilterManager.java:234)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
va:185)
ERROR filtering, skipping bitstream #169 java.io.IOException: Error
decrypting document, details: Error: The supplied password does not
match either the owner or user password in the document.
java.io.IOException: Error decrypting document, details: Error: The
supplied password does not match either the owner or user password in
the document.
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at
org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java
:110)
at
org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java
:155)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
rManager.java:327)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
ger.java:296)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
erManager.java:266)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media
FilterManager.java:234)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
va:185)
Thanks again for all the useful advice and pointers, and for helping me
to sort this out (and getting me past my brain freeze!).
Cheers,