XPDF support for filtering PDFs for text extraction/search.
-----------------------------------------------------------
Key: DS-183
URL:
http://jira.dspace.org/jira/browse/DS-183
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.5.1, 1.5.2
Environment: Unix and Linux
Reporter: Mark Diggory
See original description here...
https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984
Here are a pair of mediafilters to process PDF files with the
XPDF suite (see
http://www.foolabs.com/xpdf/ ) replacing the
one based on PDFBox. They invoke an external command, which
must be configured. It has been tested on Unix and the concept
ought to work on Windows (and certainly on MacOS X).
XPDF2Text is a replacement for the existing PDF media filter, it
creates extracted text using the pdftotext program. I've observed it
is about 3 times as fast, and much more reliable, than PDFBox.
XPDF2Thumbnail creates a thumbnail image for the first page of
the PDF. This is especially effective for 3D PDF renderings of
engineering models, but works fine for any document.
See the instructions in xpdf-filters.html to install it.
The thumbnail filter needs an additional image library, but
the text extractor doesn't need anything else.
This code has been tested with DSpace 1.5.1
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira