All,
I'm curious if anyone out there has run into strange OutOfMemory errors
while full-text indexing larger (>10MB) PDF files in DSpace.
It usually appears as either:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
OR
Exception in thread "main" java.lang.OutOfMemoryError: GC Overhead limit
exceeded
I've located the main "problem" PDF in our DSpace instance:
http://hdl.handle.net/2142/2050
I've also done a large amount of searching/testing based on
recommendations from various sites. In particular, I've done a memory
dump using JHat
(
http://java.sun.com/javase/6/docs/technotes/tools/share/jhat.html), and
it looks like the problem may reside with a potential memory leak in the
3rd party PDFBox tool used by DSpace 1.4.2. (In particular, it *looks*
like PDFBox is attempting to load most/all of the textual content into a
giant HashMap)
Here's the latest settings I've been testing on:
RHEL 4
Java 1.6.0_02
Postgres 8.1.9
DSpace 1.4.2
We also have the following JAVA_OPTS settings in place for our JVM:
JAVA_OPTS=-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
(We initially had Xmx and Xms at 512MB, but I bumped it up and we're
still getting the OutOfMemory exception at 1GB!)
Anyone have any hints/tips or JVM settings to share? I personally don't
see why PDFBox would need so much JVM memory to parse a 15MB PDF. But,
the JHat analysis seemed to be pointing to PDFBox.
- Tim
P.S. an example of the full error stack trace is below:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(Unknown Source)
at java.util.HashMap.addEntry(Unknown Source)
at java.util.HashMap.put(Unknown Source)
at org.fontbox.cmap.CMap.addMapping(CMap.java:132)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:153)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at
org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:114)
at
org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:602)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:513)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:461)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:428)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:391)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:342)