[Dspace-devel] [DSpace-JIRA] Created: (DS-183) XPDF support for filtering PDFs for text extraction/search.

0 views
Skip to first unread message

Mark Diggory (JIRA)

unread,
Aug 19, 2015, 1:41:28 PM8/19/15
to dspace...@lists.sourceforge.net
XPDF support for filtering PDFs for text extraction/search.
-----------------------------------------------------------

Key: DS-183
URL: http://jira.dspace.org/jira/browse/DS-183
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.5.1, 1.5.2
Environment: Unix and Linux
Reporter: Mark Diggory


See original description here...

https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984

Here are a pair of mediafilters to process PDF files with the
XPDF suite (see http://www.foolabs.com/xpdf/ ) replacing the
one based on PDFBox. They invoke an external command, which
must be configured. It has been tested on Unix and the concept
ought to work on Windows (and certainly on MacOS X).

XPDF2Text is a replacement for the existing PDF media filter, it
creates extracted text using the pdftotext program. I've observed it
is about 3 times as fast, and much more reliable, than PDFBox.

XPDF2Thumbnail creates a thumbnail image for the first page of
the PDF. This is especially effective for 3D PDF renderings of
engineering models, but works fine for any document.

See the instructions in xpdf-filters.html to install it.
The thumbnail filter needs an additional image library, but
the text extractor doesn't need anything else.

This code has been tested with DSpace 1.5.1

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



Mark Diggory (JIRA)

unread,
Aug 19, 2015, 1:41:28 PM8/19/15
to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Diggory updated DS-183:
----------------------------

Attachment: xpdf-filters.html

Adjusted Documentation.

> XPDF support for filtering PDFs for text extraction/search.
> -----------------------------------------------------------
>
> Key: DS-183
> URL: http://jira.dspace.org/jira/browse/DS-183
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.1, 1.5.2
> Environment: Unix and Linux
> Reporter: Mark Diggory
> Attachments: xpdf-filters.html, XPDFFilters.patch

Mark Diggory (JIRA)

unread,
Aug 19, 2015, 1:41:38 PM8/19/15
to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=10260#action_10260 ]

Mark Diggory commented on DS-183:
---------------------------------

Commited patch (but not documentation).

> XPDF support for filtering PDFs for text extraction/search.
> -----------------------------------------------------------
>
> Key: DS-183
> URL: http://jira.dspace.org/jira/browse/DS-183
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.1, 1.5.2
> Environment: Unix and Linux
> Reporter: Mark Diggory
> Attachments: xpdf-filters.html, XPDFFilters.patch

Bradley McLean (JIRA)

unread,
Aug 19, 2015, 1:41:38 PM8/19/15
to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bradley McLean updated DS-183:
------------------------------

Attachment: xpdf-filters.xml

Conversion of the above html to an xml docbook fragment suitable for inclusion in the DSpace manual

> XPDF support for filtering PDFs for text extraction/search.
> -----------------------------------------------------------
>
> Key: DS-183
> URL: http://jira.dspace.org/jira/browse/DS-183
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.1, 1.5.2
> Environment: Unix and Linux
> Reporter: Mark Diggory
> Assignee: Mark Diggory
> Attachments: xpdf-filters.html, xpdf-filters.xml, XPDFFilters.patch

Mark Diggory (JIRA)

unread,
Aug 19, 2015, 1:41:38 PM8/19/15
to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Diggory reassigned DS-183:
-------------------------------

Assignee: Mark Diggory

> XPDF support for filtering PDFs for text extraction/search.
> -----------------------------------------------------------
>
> Key: DS-183
> URL: http://jira.dspace.org/jira/browse/DS-183
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.1, 1.5.2
> Environment: Unix and Linux
> Reporter: Mark Diggory
> Assignee: Mark Diggory
> Attachments: xpdf-filters.html, XPDFFilters.patch

Mark Diggory (JIRA)

unread,
Aug 19, 2015, 1:42:09 PM8/19/15
to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Diggory resolved DS-183.
-----------------------------

Resolution: Fixed
Fix Version/s: 1.5.2

Brad commited docs, now available in 1.5.x branch. Closing.

> XPDF support for filtering PDFs for text extraction/search.
> -----------------------------------------------------------
>
> Key: DS-183
> URL: http://jira.dspace.org/jira/browse/DS-183
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.1, 1.5.2
> Environment: Unix and Linux
> Reporter: Mark Diggory
> Assignee: Mark Diggory
> Fix For: 1.5.2
>
> Attachments: xpdf-filters.html, xpdf-filters.xml, XPDFFilters.patch

keith johnson (JIRA)

unread,
Aug 19, 2015, 3:25:46 PM8/19/15
to dspace...@lists.sourceforge.net

[ http://jira.dspace.org/jira/browse/DS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11255#action_11255 ]

keith johnson commented on DS-183:
----------------------------------

xpdf-filters.html to install works in certain formats and not others.
Conversion from pdf to html can work but certain editing abilities get removed.
see http://www.uk-mobile-phone.com
The thumbnail image does not load properly in such 3D documents and renders
a red cross rather than the desired imagery.

> XPDF support for filtering PDFs for text extraction/search.
> -----------------------------------------------------------
>
> Key: DS-183
> URL: http://jira.dspace.org/jira/browse/DS-183
> Project: DSpace 1.x
> Issue Type: Improvement
> Components: DSpace API
> Affects Versions: 1.5.1, 1.5.2
> Environment: Unix and Linux
> Reporter: Mark Diggory
> Assignee: Mark Diggory
> Fix For: 1.5.2
>
> Attachments: xpdf-filters.html, xpdf-filters.xml, XPDFFilters.patch
>
>
Reply all
Reply to author
Forward
0 new messages