[Dspace-tech] filter-media and out of memory error.

536 views
Skip to first unread message

Jose Blanco

unread,
Aug 24, 2015, 4:09:55 PM8/24/15
to dspac...@lists.sourceforge.net

Recently I imported about 7,000 items into our repository, and right after that the media-filter has been giving us an out of memory error.  I’ve increase the memory allocated when dsrun is executed:

 

java -Xmx768m -classpath $FULLPATH "$@"

 

But I’m still getting the error.  I’m also getting a unusual number of errors about pdf files being left opened.  Here is a snippet of the error log.  Any help would be greatly appreciated.

 

$FinalizerThread.run(Finalizer.java:160)

java.lang.Throwable: Warning: You did not close the PDF Document

      at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:384)

      at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)

      at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)

      at java.lang.ref.Finalizer.access$100(Finalizer.java:14)

      at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)

java.lang.Throwable: Warning: You did not close the PDF Document

      at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:384)

      at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)

      at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)

      at java.lang.ref.Finalizer.access$100(Finalizer.java:14)

      at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)

ERROR filtering, skipping bitstream #159877 java.io.IOException: You do not have permission to extract text

java.io.IOException: You do not have permission to extract text

      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:140)

      at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:99)

      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:106)

      at org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:162)

      at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:287)

     at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:250)

      at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:224)

      at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:195)

java.lang.Throwable: Warning: You did not close the PDF Document

      at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:384)

      at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)

      at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)

      at java.lang.ref.Finalizer.access$100(Finalizer.java:14)

      at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)

java.lang.Throwable: Warning: You did not close the PDF Document

      at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:384)

      at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)

      at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)

      at java.lang.ref.Finalizer.access$100(Finalizer.java:14)

      at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)

Exception in thread "main" java.lang.OutOfMemoryError

Scott Yeadon

unread,
Aug 24, 2015, 4:10:00 PM8/24/15
to bla...@umich.edu, dspac...@lists.sourceforge.net
Jose,

If you're using DSpace 1.4, you can break filter-media job down to run
over individual communities, collections and items. Running a set of
separate jobs rather than one over the entire repository will stop you
running out of memory.

WRT your PDF issue, maybe try updating the PDFBox.jar file to the latest
version, that might resolve some of the PDF messages you're getting (see
http://sourceforge.net/tracker/index.php?func=detail&aid=1553991&group_id=19984&atid=319984)

Scott.

>Date: Tue, 12 Sep 2006 16:51:38 -0400
>From: "Jose Blanco" <bla...@umich.edu>
>Subject: [Dspace-tech] filter-media and out of memory error.
>To: <dspac...@lists.sourceforge.net>
>Message-ID: <E1GNFDe-...@mail.sourceforge.net>
>Content-Type: text/plain; charset="us-ascii"

Jose Blanco

unread,
Aug 24, 2015, 4:10:03 PM8/24/15
to Scott Yeadon, dspac...@lists.sourceforge.net
Scott:

It may be a couple of weeks before I get around to completely upgrading to
1.4. Do you think that getting the latest version PDFBox.jar might fix the
memory problem? Is it running out of memory because of all these PDF
errors, or because there are too many items to index?

Thanks!
Jose

birong ho

unread,
Aug 24, 2015, 4:10:04 PM8/24/15
to scott....@anu.edu.au, dspac...@lists.sourceforge.net
Hi, Scott,

Here at Eastern Michigan University, we just upgraded our development
instance
to 1.4.

We thought upgrading to 1.4 will help our problem with filter-media. But it
failed.

Can you provide any further insight about how you can filter-media certain
collection ? Are there any docs ... that I might have missed ...

Thank you very much.

Birong Ho.
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> DSpace-tech mailing list
> DSpac...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>


Jose Blanco

unread,
Aug 24, 2015, 4:10:05 PM8/24/15
to birong ho, scott....@anu.edu.au, dspac...@lists.sourceforge.net
Birong:

Do you get a lot of the PDF errors when you run filter-media?

birong ho

unread,
Aug 24, 2015, 4:10:07 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, scott....@anu.edu.au
No,

mine just hang ...

Here is the message

Applying Media Filters
SKIPPED: bitstream 62 because 'thes_hon_05_LittleJ_1.pdf.txt' already exists
SKIPPED: bitstream 65 because 'thes_hon_05_GilbertK_1.pdf.txt' already
exists

Scott Yeadon

unread,
Aug 24, 2015, 4:10:14 PM8/24/15
to birong ho, dspac...@lists.sourceforge.net
Hi Birong,

Our cron job works on particular collections and communities now, rather
than tries to run over the whole repository at once.

If you use the -i <handle> option you can specify a community,
collection or item to run filter-media across, we also use the -n option
to save time. If you check the 1.4 docs there's a MediaFilter section in
the "Application Layer" section in the index - click on that and it
should explain all.

You might also try the new PDFBox.jar file as suggested in an earlier
email, see if that helps. If you can narrow down the object causing the
hanging maybe by using the -i option (assuming it's an object having a
media-filter crisis and not some other issue) that might help as well.
(Also check the dspace logs)

Scott.

Scott Yeadon

unread,
Aug 24, 2015, 4:10:15 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net
Hi Jose,

It's worth a try - all you need to do is donwload the new jar and re-run
filter-media, so it shouldn't take long to find out!

Scott.

Gary Browne

unread,
Aug 24, 2015, 4:10:16 PM8/24/15
to dspac...@lists.sourceforge.net
Hi All

I just tried updating to 0.7.2 PDFBox.jar which I believe is the latest
version. I'm running DSpace 1.3.2 in production. I was also getting the
following errors with the media filter script:
"You did not close the PDF"
"You do not have permission to extract text"

Neither of these were fixed by updating to the 0.7.2 PDFBox.jar.
However, I took one of the offending PDF docs and put it in our DSpace
1.4 development version, and it was parsed no problem.

So perhaps upgrading to 1.4 is the simplest solution...?

Regards
Gary


Gary Browne
Development Programmer
Library IT Services
University of Sydney
Australia
ph: 61-2-9351 5946
>1
>
>
>>95)
>>
>>java.lang.Throwable: Warning: You did not close the PDF Document
>>
>> at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:384)
>>
>> at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
>>
>> at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)
>>
>> at java.lang.ref.Finalizer.access$100(Finalizer.java:14)
>>
>> at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)
>>
>>java.lang.Throwable: Warning: You did not close the PDF Document
>>
>> at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:384)
>>
>> at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
>>
>> at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)
>>
>> at java.lang.ref.Finalizer.access$100(Finalizer.java:14)
>>
>> at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)
>>
>>Exception in thread "main" java.lang.OutOfMemoryError
>>
>>
>>
>>
>
>
>
>
>
>
>


------------------------------------------------------------------------

Mark H. Wood

unread,
Aug 24, 2015, 4:10:24 PM8/24/15
to dspac...@lists.sourceforge.net
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 14 Sep 2006, Gary Browne wrote:
> I just tried updating to 0.7.2 PDFBox.jar which I believe is the latest
> version. I'm running DSpace 1.3.2 in production. I was also getting the
> following errors with the media filter script:
> "You did not close the PDF"

I think that the code which calls PDFBox has to handle that. If so, then
a newer PDFBox won't fix it. (It's been a while since I looked at the
code, so perhaps I misremember. Once I grew annoyed enough to track down
the problem but, alas! not annoyed enough to fix it -- yet.)

> "You do not have permission to extract text"

This should not be "fixed" in PDFBox or any other code. PDFBox is being a
good citizen and obeying the author's choice of permission flags, set
inside the PDF. The proper fix for this message is to get the author to
unlock text extraction and send you the updated copy.

However, the newer version *did* eliminate other messages about things
produced by bleeding-edge PDF writer tools that older versions of PDFBox
didn't understand. I think it's worth having anyway. In fact I'd rather
that DSpace didn't come with so many foreign packages included -- I'd
rather fetch the latest from the original source, and know what version
I've got.

- --
Mark H. Wood, Lead System Programmer mw...@IUPUI.Edu
Typically when a software vendor says that a product is "intuitive" he
means the exact opposite.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.4 (GNU/Linux)
Comment: pgpenvelope 2.10.2 - http://pgpenvelope.sourceforge.net/

iD8DBQFFCU8Cs/NR4JuTKG8RAuAsAJ4ofOteefNP1nRQd1WWCwMAUSml8ACbBs1A
nIcOWJ3PEon+gFYTe7rbDeI=
=ZimM
-----END PGP SIGNATURE-----

Jose Blanco

unread,
Aug 24, 2015, 4:10:31 PM8/24/15
to Scott Yeadon, dspac...@lists.sourceforge.net
Is this where I download the latest PDFBox.jar from:

http://dspace.cvs.sourceforge.net/dspace/dspace/lib/PDFBox.jar?view=log

Thanks!
>>)
>>
>> at
>>org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:1
6
>>
>>
>2
>
>
>>)
>>
>> at
>>org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterM
a
>>
>>
>n
>
>
>>ager.java:287)
>>
>> at
>>org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManage
r
>>
>>
>.
>
>
>>java:250)
>>
>> at
>>org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFi
l
>>
>>

Claudia Jürgen

unread,
Aug 24, 2015, 4:10:32 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net
Hi Jose,

you can get the latest PDFBox.jar from:

http://sourceforge.net/project/showfiles.php?group_id=78314

unzip it, the PDFBox.jar resides in the lib directory.

Claudia


Jose Blanco schrieb:
> -------------------------------------------------------------------------

Jose Blanco

unread,
Aug 24, 2015, 4:10:53 PM8/24/15
to Claudia Jürgen, dspac...@lists.sourceforge.net
BTW, this latest version seems to have fixed the out of memory error also.

Thanks!
Reply all
Reply to author
Forward
0 new messages