[Dspace-tech] XPDF Thumbnail Preview issue

15 views
Skip to first unread message

Christian Völker

unread,
Aug 26, 2015, 1:14:40 PM8/26/15
to dspace-tech
Hello,

I just try to get the XPDF based PDF Thumbnail creation working. It works fine in my DSpace 4.1 test instance.

The feature was already available in DSpace 1.8.2 which is still our production release. Instead of waiting until the new version is production ready, I install the features step by step in the production environment.


On the production machine, I get this error:

esxh-15:/srv/dspace# bin/dspace filter-media -i 2339/4318 -v
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.HTMLFilter
org.dspace.app.mediafilter.HTMLFilter
Full Filter Name: org.dspace.app.mediafilter.WordFilter
org.dspace.app.mediafilter.WordFilter
Full Filter Name: org.dspace.app.mediafilter.JPEGFilter
org.dspace.app.mediafilter.JPEGFilter
Full Filter Name: org.dspace.app.mediafilter.XPDF2Text
org.dspace.app.mediafilter.XPDF2Text
Full Filter Name: org.dspace.app.mediafilter.XPDF2Thumbnail
org.dspace.app.mediafilter.XPDF2Thumbnail
Full Filter Name: org.dspace.app.mediafilter.PowerPointFilter
org.dspace.app.mediafilter.PowerPointFilter
SKIPPED: bitstream 27442 (item: 2339/4318) because 'Limmerstraße.pdf.txt' already exists
ERROR filtering, skipping bitstream:

Item Handle: 2339/4318
Bundle Name: ORIGINAL
File Size: 2667225
Checksum: 3db0096cb62b6d595c1e4bb77f6833d0 (MD5)
Asset Store: 0
javax.imageio.IIOException: Can't read input file!
javax.imageio.IIOException: Can't read input file!
at javax.imageio.ImageIO.read(ImageIO.java:1291)
at org.dspace.app.mediafilter.XPDF2Thumbnail.getDestinationStream(XPDF2Thumbnail.java:244)
at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:737)
at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:561)
at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:511)
at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:479)
at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:353)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:183)
Updating search index:


Note, that the text extraction took place in an earlier run of filter-media. So the message "Can't read input file!" is not very credible. Also the method called when the Exeption took place was XPDF2Thumbnail.getDestinationStream, which means that this issue might not be with the input file but with creating the output file.


In 2012, Osama Alkadi reported a similar issue and solved it by updating the pdftoppm tool. On Debian and Ubuntu, the required tools are contained in the package poppler-utils. I have installed Version 0.18.4 on both test and production machine. Here is the output:

esxh-15:/srv/dspace# pdftoppm -v
pdftoppm version 0.18.4
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC

The version numberings seems to have changed in unexpected ways as Osama Alkadi told that he updated from 3.0 to 3.0.2. For the moment, this does not help too much.

All other components involved are also the same on both machines. jai_imageio is version 1.1 and jai_core is 1.1.3.


As the file is hard to find in the assetstore, I downloaded it using the browser, scped it back to the server and converted it manually using pdftoppm -jpeg inputfile.pdf outputname. It worked.

I exported the item containing the file using the AIP packager, transferred it to the test server running DSpace 4.1, imported it and ran filter-media there. It worked fine.

I compared the installation instructions of DSpace 4.1 and 1.8.2 and could not find a significant difference regarding the XPDF Feature. The mvn package and ant update command had not shown any irregularities.

File permissions in assetstore did not show any problems. On both machines, DSpace is run as the daemon user tomcat7. In both cases, I run Tomcat 7, albeit in slightly different versions. But Tomcat is not involved in running the command line tool like bin/dspace filter-media anyway.

So far, I have not found a clue, where to search for the reason. If anybody has an idea, Id be grateful.

Bye, Christian


SUZUKI Keiji

unread,
Aug 26, 2015, 1:14:41 PM8/26/15
to dspace-tech
Hi Christian,

This error has occurred because ImageIO could not read the file generated 
by pdftoppm command. I think what you should do is to check whether 
pdftoppm generates a correct file. To do this, I recommend the following two steps.

1) Set the logging level to DEBUG and rerun.
2) Comment out the lines 253 to 256 in XPDF2Thumbnail.java
     temporally, rebuild and run.

With step 1, you can see the real command executed by DSpace
and the path name of generated file to check these are correct.

With Step 2, you can retain the generated file to check its content
and the mode.

Regards,
Keiji Suzuki



------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
DSpace-tech mailing list
DSpac...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette



--
鈴木敬二@江別市

Christian Völker

unread,
Aug 26, 2015, 1:16:05 PM8/26/15
to SUZUKI Keiji, dspace-tech
Hello,

Am 13.06.2014 um 04:47 schrieb SUZUKI Keiji <zu...@mbc.ocn.ne.jp>:

> 1) Set the logging level to DEBUG and rerun.

Should have done so before. Thanks you for the heads up.

You were perfectly right. But then, the result leaves me a bit clueless for now:

> esxh-15:/srv/dspace> tail -n 10 log/dspace.log.2014-06-15
> 2014-06-15 12:45:17,812 DEBUG org.dspace.content.BitstreamFormat @ anonymous::find_bitstream_format:bitstream_format_id=2
> 2014-06-15 12:45:17,812 DEBUG org.dspace.storage.rdbms.DatabaseManager @ Running query "SELECT * FROM fileextension WHERE bitstream_format_id= ? " with parameters: 2
> 2014-06-15 12:45:17,851 DEBUG org.dspace.storage.rdbms.DatabaseManager @ Running query "select * from bitstream where bitstream_id = ? " with parameters: 27442
> 2014-06-15 12:45:17,852 DEBUG org.dspace.storage.bitstore.BitstreamStorageManager @ Local filename for 87066288396181747611585923333395102959 is /srv/dspace/assetstore/87/06/62/87066288396181747611585923333395102959
> 2014-06-15 12:45:17,865 INFO net.sf.ehcache.util.UpdateChecker @ New update(s) found: 2.4.7 [http://www.terracotta.org/confluence/display/release/Release+Notes+Ehcache+Core+2.4]
> 2014-06-15 12:45:17,919 DEBUG org.dspace.app.mediafilter.XPDF2Thumbnail @ DPI: pdfinfo method got dpi=75 for max dim=759 (points, 1/72")
> 2014-06-15 12:45:17,920 DEBUG org.dspace.app.mediafilter.XPDF2Thumbnail @ Running xpdf command: [/usr/bin/pdftoppm, -q, -f, 1, -l, 1, -r, 75, /tmp/DSfilt2327548125683453130.pdf, /tmp/prevu8591868713129272046out]
> 2014-06-15 12:45:18,357 DEBUG org.dspace.app.mediafilter.XPDF2Thumbnail @ PDFTOPPM output is: /tmp/prevu8591868713129272046out-000001.ppm, exists=false
> 2014-06-15 12:45:18,420 ERROR org.dspace.app.mediafilter.XPDF2Thumbnail @ Unable to delete file
> 2014-06-15 12:45:18,421 DEBUG org.dspace.storage.rdbms.DatabaseManager @ Running query "SELECT bundle.* FROM bundle, bundle2bitstream WHERE bundle.bundle_id=bundle2bitstream.bundle_id AND bundle2bitstream.bitstream_id= ? " with parameters: 27442
> esxh-15:/srv/dspace> ls -l /tmp
> insgesamt 1272
> drwx------ 2 amanda backup 4096 Jun 15 11:27 amanda
> drwxr-xr-x 2 root root 4096 Jun 15 12:17 hsperfdata_root
> drwxr-xr-x 2 tomcat7 tomcat7 4096 Jun 15 12:45 hsperfdata_tomcat7
> -rw-r--r-- 1 tomcat7 tomcat7 1281435 Jun 15 12:45 prevu8591868713129272046out-1.ppm
> drwxr-xr-x 2 tomcat7 root 4096 Jun 15 12:12 tomcat7-tomcat7-tmp
> drwx------ 2 root root 4096 Jun 15 11:26 vmware-root
> esxh-15:/srv/dspace>

This means, the enumeration scheme used by pdftoppm for writing image files from several pages is different from what the XPDF Plugin expects. If I got it right, the plugin tells pdftoppm to do this:

/usr/bin/pdftoppm -q -f 1 -l 1 -r 75 /tmp/DSfilt2327548125683453130.pdf /tmp/prevu8591868713129272046out

It expects to find the resulting file here:

/tmp/prevu8591868713129272046out-000001.ppm

However, the file gets written here:

/tmp/prevu8591868713129272046out-1.ppm

Everything is fine regarding file permissions, the file is in the expected directory /tmp, only the six digits instead of a single digit make the difference. There are several questions here. Why does the filter write a .ppm file and not a .jpg file using the -jpeg option of pdftoppm and when does the actual conversion happen? The task of the filter is always to produce a thumbnail image of the first page. So it would seem much more logical and robust to me to use the -singlepage attribute of pdftoppm which does not add anything to the output name besides the file extension. Instead first page -f and last page -l are set to 1. But well I would not need to bother if everything worked fine.

Where does this six digit rule get set?

During my tests I had produced thousands of files starting with /tmp/prevu*. Most of them ended on -1.ppm, but some of them on -01.ppm. Mysterious.

I will try to produce the same fault on my test system which works fine for now, just to understand where are the differences.

For now, I wont try the second suggestion to recompile with source code commented out, because I guess, I already found the issue, just dont understand it yet.

Thanks for your support. Further suggestions welcome.

Bye, Christian


SUZUKI Keiji

unread,
Aug 26, 2015, 1:16:06 PM8/26/15
to Christian Völker, dspace-tech
Hi Christian,

2014-06-15 20:18 GMT+09:00 Christian Völker <C.Vo...@gmx.net>:


This means, the enumeration scheme used by pdftoppm for writing image files from several pages is different from what the XPDF Plugin expects. If I got it right, the plugin tells pdftoppm to do this:

/usr/bin/pdftoppm -q -f 1 -l 1 -r 75 /tmp/DSfilt2327548125683453130.pdf /tmp/prevu8591868713129272046out

It expects to find the resulting file here:

/tmp/prevu8591868713129272046out-000001.ppm

However, the file gets written here:

/tmp/prevu8591868713129272046out-1.ppm

Everything is fine regarding file permissions, the file is in the expected directory /tmp, only the six digits instead of a single digit make the difference. There are several questions here. Why does the filter write a .ppm file and not a .jpg file using the -jpeg option of pdftoppm and when does the actual conversion happen? The task of the filter is always to produce a thumbnail image of the first page. So it would seem much more logical and robust to me to use the -singlepage attribute of pdftoppm which does not add anything to the output name besides the file extension. Instead first page -f and last page -l are set to 1. But well I would not need to bother if everything worked fine.

Where does this six digit rule get set?

The version of my pdftoppm is different from yours and my version of it 
makes a output ppm with 6 digit as a sequece.

  dspace@www:~$ pdftoppm -v
  pdftoppm version 3.02
  Copyright 1996-2007 Glyph & Cog, LLC

And I confirm the version of pdftoppm in the package "poppler-utils" of
Ubuntu 1204LTS server (64bit version)  is same as yours and this version 
of pdftoppm make a output file with one digit.

I use Ubuntu 12.04LTS server (32bit version). I can't remember how did
I install my version but there is the "xpdf-utils" package that is not in 64bit OS.
I might install my version from this package.

In any case, I think there are two options.

1) Install the version 3.02 of pdftoppm in some way,
2) Edit line 237 of XPDF2Thumbnail.java and rebuild DSpace

from 

File outf = new File(outPrefix+"-000001.ppm");

to

File outf = new File(outPrefix+"-1.ppm");

Hope this helps you.

Regards,
Keiji Suzuki

 
During my tests I had produced thousands of files starting with /tmp/prevu*. Most of them ended on -1.ppm, but some of them on -01.ppm. Mysterious.

I will try to produce the same fault on my test system which works fine for now, just to understand where are the differences.

For now, I wont try the second suggestion to recompile with source code commented out, because I guess, I already found the issue, just dont understand it yet.

Thanks for your support. Further suggestions welcome.

Bye, Christian


------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
DSpace-tech mailing list
DSpac...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette



--
鈴木敬二@江別市

SUZUKI Keiji

unread,
Aug 26, 2015, 1:16:24 PM8/26/15
to dspace-tech
Hi Christian,

I wrote the following option in last post but Poppler's version of pdftoppm seems 
to make a varying length of sequence by the page number of the original pdf.
This problem already has been fixed from the DSpace version 3.0. You can see 
this fix at the following url. 


Sorry I have checked only DSpace 1.8.2 and not the current version.

Regards,
Keiji Suzuki
Reply all
Reply to author
Forward
0 new messages