[Dspace-tech] OutOfMemory errors during large PDF indexing

718 views
Skip to first unread message

Tim Donohue

unread,
Aug 24, 2015, 5:18:56 PM8/24/15
to dspace-tech
All,

I'm curious if anyone out there has run into strange OutOfMemory errors
while full-text indexing larger (>10MB) PDF files in DSpace.

It usually appears as either:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

OR

Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead limit
exceeded

I've located the main "problem" PDF in our DSpace instance:
http://hdl.handle.net/2142/2050

I've also done a large amount of searching/testing based on
recommendations from various sites. In particular, I've done a memory
dump using JHat
(http://java.sun.com/javase/6/docs/technotes/tools/share/jhat.html), and
it looks like the problem may reside with a potential memory leak in the
3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it *looks*
like PDFBox is attempting to load most/all of the textual content into a
giant HashMap)

Here's the latest settings I've been testing on:

RHEL 4
Java 1.6.0_02
Postgres 8.1.9
DSpace 1.4.2

We also have the following JAVA_OPTS settings in place for our JVM:

JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8

(We initially had Xmx and Xms at 512MB, but I bumped it up and we're
still getting the OutOfMemory exception at 1GB!)

Anyone have any hints/tips or JVM settings to share? I personally don't
see why PDFBox would need so much JVM memory to parse a 15MB PDF. But,
the JHat analysis seemed to be pointing to PDFBox.

- Tim

P.S. an example of the full error stack trace is below:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(Unknown Source)
at java.util.HashMap.addEntry(Unknown Source)
at java.util.HashMap.put(Unknown Source)
at org.fontbox.cmap.CMap.addMapping(CMap.java:132)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at
org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:114)
at
org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:602)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:513)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:461)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:428)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:391)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:342)

Jayan Chirayath Kurian

unread,
Aug 24, 2015, 5:19:02 PM8/24/15
to Tim Donohue, dspac...@lists.sourceforge.net
Hi! Tim,

Here we faced similar errors while trying out full-text indexing on
DSpace 1.4.1/windows 2003 standard edition. We had roughly 100,000
records. This was rectified once dsrun.bat was given 1000m at java
-Xmx256m -classpath ........
http://repositorydev.ntu.edu.sg

Jayan
------------------------------------------------------------------------
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
DSpac...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Mark Diggory

unread,
Aug 24, 2015, 5:19:04 PM8/24/15
to Jayan Chirayath Kurian, dspac...@lists.sourceforge.net, Tim Donohue
We should consider adding more sane defaults, most machines that
DSpace is running on have well over 1Gig of memory available and its
important to remember this is a maximum heap size and is not take
unless required. I think setting dsrun and the other commandline
scripts to be 512m (1/2 * 1Gig) would eliminate most outlying cases
where PDF docs need to be held in memory.

-Mark Diggory
~~~~~~~~~~~~~
Mark R. Diggory - DSpace Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology



Tim Donohue

unread,
Aug 24, 2015, 5:19:05 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net, Jayan Chirayath Kurian
Jayan & Mark,

Thanks for the suggestions. But, our problem is that we're currently
running Java & dsrun using:

JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8

(I've modified our local dsrun script to read from the JAVA_OPTS
environment variable).

So, even setting a maximum heap size of 1GB, we don't seem to be able to
full text index a 15MB PDF without encountering "OutOfMemory: Java heap
space" errors. Strange, I know. My current theory is that there may be
a memory leak in the PDFBox tools. I'm still working on a definite
diagnosis though. If no one else out there has noticed this with DSpace
1.4.2, then I guess it's possible there's something in our local
settings (or customizations of DSpace) which could be causing this issue.

- Tim
>> (http://java.sun.com/javase/6/docs/technotes/tools/share/jhat.html), and
>> org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452
>> )
>> at
>> org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:21
>> 5)
>> at
>> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>> at
>> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>> at
>> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>> at
>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>> at
>> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>> at
>> org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java
>> :114)
>> at
>> org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilt
>> erManager.java:602)
>> at
>> org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
>> rManager.java:513)
>> at
>> org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
>> ger.java:461)
>> at
>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
>> erManager.java:428)
>> at
>> org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(Media
>> FilterManager.java:391)
>> at
>> org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
>> va:342)
>>
>> ------------------------------------------------------------------------
>> -
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> DSpace-tech mailing list
>> DSpac...@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> DSpace-tech mailing list
>> DSpac...@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
> ~~~~~~~~~~~~~
> Mark R. Diggory - DSpace Systems Manager
> MIT Libraries, Systems and Technology Services
> Massachusetts Institute of Technology
>
>
>

--

========================================
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
135 Grainger Engineering Library
University of Illinois at Urbana-Champaign

email: tdon...@uiuc.edu
web: http://www.ideals.uiuc.edu
phone: (217) 333-4648
fax: (217) 244-7764
========================================

Mark Diggory

unread,
Aug 24, 2015, 5:19:07 PM8/24/15
to Tim Donohue, dspac...@lists.sourceforge.net, Jayan Chirayath Kurian
I would also then recommend trying to get the latest PDFBox and
replace the jar in your lib directory.

http://sourceforge.net/project/showfiles.php?
group_id=78314&package_id=79377
>>> (ShowText.java:64)
>>> at
>>> org.pdfbox.util.PDFStreamEngine.processOperator
>>> (PDFStreamEngine.java:452
>>> )
>>> at
>>> org.pdfbox.util.PDFStreamEngine.processSubStream
>>> (PDFStreamEngine.java:21
>>> 5)
>>> at
>>> org.pdfbox.util.PDFStreamEngine.processStream
>>> (PDFStreamEngine.java:174)
>>> at
>>> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:
>>> 336)
>>> at
>>> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:
>>> 259)
>>> at
>>> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>>> at
>>> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>>> at
>>> -
>>> This SF.net email is sponsored by: Microsoft
>>> Defy all challenges. Microsoft(R) Visual Studio 2005.
>>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>>> _______________________________________________
>>> DSpace-tech mailing list
>>> DSpac...@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>>
>>> --------------------------------------------------------------------
>>> -----

Jimmy Zhang

unread,
Aug 24, 2015, 5:19:11 PM8/24/15
to Tim Donohue, dspac...@lists.sourceforge.net
The responsibility of PDFBox is to extract the full text of the pdf file.I am wondering it maybe has to do with the pdf file.Do you mean any pdf files whose size more than 10M can cause problem or only that pdf file?

--
Website: www.drepository.com

>>>          at java.util.HashMap.put (Unknown Source)
>>> org.pdfbox.util.PDFTextStripper.getText (PDFTextStripper.java:149)
-------------------------------------------------------------------------

Mark Diggory

unread,
Aug 24, 2015, 5:19:12 PM8/24/15
to Jimmy Zhang, dspac...@lists.sourceforge.net, Tim Donohue
Yes, I recall some issues in the past which we addressed by upgrading that jar to the latest. DSpace 1.4.2/1.5 should have the latest PDFBox jar (0.7.3) while DSpace 1.4.1 has an older version.

Unfortunately, PDFBox is a more full featured PDF editor and not just a text extractor, it pulls large portions of the PDF into memory when processing it.  If we could find a more stream based text extractor for pdf files, it would make the memory footprint much more fixed for FilterMedia.

-Mark

Dan Scott

unread,
Aug 24, 2015, 5:19:13 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net, Tim Donohue, Jimmy Zhang
I should note to the list that the latest development version of
PDFBox claims to have solved an Out of Memory Exception error. Sounds
familiar :)

I had suggested to Tim privately that maybe he could test it out and let us
know if it resolves the problem:

http://www.pdfbox.org/changes.html#version_0.7.4-dev
Dan Scott
Laurentian University

Tim Donohue

unread,
Aug 24, 2015, 5:20:04 PM8/24/15
to Dan Scott, dspac...@lists.sourceforge.net, Mark Diggory, Jimmy Zhang
Just to follow up briefly:

Thanks for all the great suggestions from everyone! I've tried the
latest development version of PDFBox today and unfortunately that didn't
seem to resolve anything :(

I've also noticed that it *seems* to be related to the size of the PDFs.
We just received a bulk load into our DSpace of about 100+ PDFs, and
I've now run into about 3-4 which cause the OutOfMemory errors (all of
which are between 11MB and 15MB). The only other thing in common is
that all of these PDFs were initially image-based, and were OCRed before
ingesting them into DSpace (not sure if that could be "confusing" PDFBox)

In any case, I've logged a bug with PDFBox on SourceForge and referenced
a few of the PDFs which have these issues. I'm hoping they'll be able
to help debug it :)

http://sourceforge.net/tracker/index.php?func=detail&aid=1805929&group_id=78314&atid=552832

I'll post back to this thread once there is a resolution to the issue in
case others run across this problem as well.

- Tim
Reply all
Reply to author
Forward
0 new messages