OutOfMemory when extracting text from pdf

732 views
Skip to first unread message

javierf...@gmail.com

unread,
Feb 9, 2015, 2:03:57 PM2/9/15
to hippo-c...@googlegroups.com
Hi,

We are getting an OutOfMemory error when Hippo tries to extract the content of a pdf. We do not really need to index that document, is there a way to disable text extraction for particular documents?

This is not a cms upload, but a new document created in HST using WorkflowPersistenceManagerImpl.

Thanks!

This is the stack we get:

[INFO] [talledLocalContainer] 10.02.2015 01:49:00 WARN  jackrabbit-pool-3 [LazyTextExtractorField$ParsingTask.run:181] Failed to extract text from a binary property
[INFO] [talledLocalContainer] java.lang.OutOfMemoryError: Java heap space
[INFO] [talledLocalContainer] at java.util.Arrays.copyOf(Arrays.java:2271)
[INFO] [talledLocalContainer] at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
[INFO] [talledLocalContainer] at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:102)
[INFO] [talledLocalContainer] at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
[INFO] [talledLocalContainer] at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
[INFO] [talledLocalContainer] at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
[INFO] [talledLocalContainer] at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
[INFO] [talledLocalContainer] at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
[INFO] [talledLocalContainer] at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
[INFO] [talledLocalContainer] at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
[INFO] [talledLocalContainer] at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
[INFO] [talledLocalContainer] at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
[INFO] [talledLocalContainer] at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
[INFO] [talledLocalContainer] at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
[INFO] [talledLocalContainer] at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
[INFO] [talledLocalContainer] at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
[INFO] [talledLocalContainer] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
[INFO] [talledLocalContainer] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
[INFO] [talledLocalContainer] at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:192)
[INFO] [talledLocalContainer] at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:175)
[INFO] [talledLocalContainer] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[INFO] [talledLocalContainer] at java.util.concurrent.FutureTask.run(FutureTask.java:262)
[INFO] [talledLocalContainer] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
[INFO] [talledLocalContainer] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[INFO] [talledLocalContainer] at java.lang.Thread.run(Thread.java:745)

Ard Schrijvers

unread,
Feb 9, 2015, 6:11:17 PM2/9/15
to hippo-c...@googlegroups.com
On Mon, Feb 9, 2015 at 8:03 PM, <javierf...@gmail.com> wrote:
> Hi,
>
> We are getting an OutOfMemory error when Hippo tries to extract the content
> of a pdf. We do not really need to index that document, is there a way to
> disable text extraction for particular documents?

Per document it might be hard, but for all pdf binaries it is possible.

>
> This is not a cms upload, but a new document created in HST using
> WorkflowPersistenceManagerImpl.

Pdf extraction takes quite some memory, certainly when it is big, but
of course, it might also be that memory is already low due to another
problem (a memory leak), hence, the pdf indexing might just trigger
the OOM but not be the real underlying cause.

How much memory does the application have and how large was the pdf?

Regards Ard
> --
> Hippo Community Group: The place for all discussions and announcements about
> Hippo CMS (and HST, repository etc. etc.)
>
> To post to this group, send email to hippo-c...@googlegroups.com
> RSS:
> https://groups.google.com/group/hippo-community/feed/rss_v2_0_msgs.xml?num=50
> ---
> You received this message because you are subscribed to the Google Groups
> "Hippo Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hippo-communi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/hippo-community.
> For more options, visit https://groups.google.com/d/optout.



--
Hippo Netherlands, Oosteinde 11, 1017 WT Amsterdam, Netherlands
Hippo USA, Inc.- 745 Atlantic Ave, Eight Floor, Boston MA 02111,
United states of America.

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

javierf...@gmail.com

unread,
Feb 10, 2015, 5:27:40 AM2/10/15
to hippo-c...@googlegroups.com
Thanks Ard,

The PDF is not too big, 3.8M. Fortunately it seems the indexing happens on a different thread so even if it fails, the server is stable.

Anyway, I think it would be great to have the option to disable text extraction per file. 

Ard Schrijvers

unread,
Feb 11, 2015, 3:52:51 AM2/11/15
to hippo-c...@googlegroups.com
Hey,

On Tue, Feb 10, 2015 at 11:27 AM, <javierf...@gmail.com> wrote:
> Thanks Ard,
>
> The PDF is not too big, 3.8M. Fortunately it seems the indexing happens on a
> different thread so even if it fails, the server is stable.

and how much memory do the applications have? Indexing a 3.8 M pdf
does require quite some memory: No so much the Lucene indexing part
but the pdf text extraction is cpu and memory intensive

>
> Anyway, I think it would be great to have the option to disable text
> extraction per file.

You have two options where you play with:

1) If you store a pdf into the repository, then *every* cluster node
will do the indexing (and thus text extraction) separately. Namely the
pdf execution required the extract the text is the cpu and memory
intensive part taking place on every cluster node. A pdf of 100Mb can
for this reason take down or halt an entire cluster. when uploading a
pdf via the cms, we avoid this problem by doing the pdf extraction in
the cms, and store the extracted text in hippo:text binary property.
The repository does not do the pdf execution/ text extraction when a
hippo:text property is available but uses the hippo:text. Thus as a
result, extraction is done on a single cluster node. Indexing the
hippo:text with Lucene does not require much cpu or memory. You could
do this text extraction as well when uploading a pdf. That way, only a
single cluster node gets hit with an expensive task

2) If you see that some pdf is too large, you can choose to skip
indexing completely. You can do so by adding the mixin
'hippo:skipindex' to the hippo:resource node. Although I never
experimented with this to skip indexing hippo:resource nodes I think
that should work

Hope this info helps

Regards Ard

Koen van der Weijden

unread,
Feb 11, 2015, 5:18:36 AM2/11/15
to hippo-c...@googlegroups.com

Hi,


In a project I had a similar problem, not with pdf files but with large Excel documents. This was a 7.8 project not sure how to do this in 7.9.


The solution was to exclude Excel documents from indexing, so Excel documents are not indexed anymore. In the repository.xml there is the SearchIndex section, with ‘param  name’ ‘textFilterCasses’, removing the class org.apache.jackrabbit.extractor.MsExcelTextExtractor stops indexing Excel documents. I think you can do something similar for the pdfTextExtractor class and disabling pdf indexing.


-Koen

Ard Schrijvers

unread,
Feb 11, 2015, 5:36:24 AM2/11/15
to hippo-c...@googlegroups.com
On Wed, Feb 11, 2015 at 11:18 AM, Koen van der Weijden
<k.vande...@onehippo.com> wrote:
> Hi,
>
>
> In a project I had a similar problem, not with pdf files but with large
> Excel documents. This was a 7.8 project not sure how to do this in 7.9.
>
>
> The solution was to exclude Excel documents from indexing, so Excel
> documents are not indexed anymore. In the repository.xml there is the
> SearchIndex section, with ‘param name’ ‘textFilterCasses’, removing the
> class org.apache.jackrabbit.extractor.MsExcelTextExtractor stops indexing
> Excel documents. I think you can do something similar for the
> pdfTextExtractor class and disabling pdf indexing.

Yes that is how you can disable all pdf indexing but not on a per pdf basis.

Regards Ard

Woonsan Ko

unread,
Feb 11, 2015, 10:03:01 AM2/11/15
to hippo-c...@googlegroups.com
On 2/9/15 6:11 PM, Ard Schrijvers wrote:
> On Mon, Feb 9, 2015 at 8:03 PM, <javierf...@gmail.com> wrote:
>> Hi,
>>
>> We are getting an OutOfMemory error when Hippo tries to extract the content
>> of a pdf. We do not really need to index that document, is there a way to
>> disable text extraction for particular documents?
>
> Per document it might be hard, but for all pdf binaries it is possible.
>
>>
>> This is not a cms upload, but a new document created in HST using
>> WorkflowPersistenceManagerImpl.
>
> Pdf extraction takes quite some memory, certainly when it is big, but
> of course, it might also be that memory is already low due to another
> problem (a memory leak), hence, the pdf indexing might just trigger
> the OOM but not be the real underlying cause.

If I understood it correctly from the stack trace and source, then
Jackrabbit search indexer executes text extraction tasks asynchronously
using a thread (pool) executor (RepositoryContext#getExecutor()).
So, of course just one pdf file, say < 10M, wouldn't be the cause of
OOME, but if there are multiple pooled tasks like that in the thread
pool, then it might be the cause of huge heap memory consumption.
In that case, o.a.j.c.q.lucene.LazyTextExtractorField.ParsingTask.run()
seems to just save "TextExtractionError" instead.
Jukka seems to have experimented out-of-process solution for more
reliability:
- https://issues.apache.org/jira/browse/TIKA-416
Has anyone had an experience with that?

Kind regards,

Woonsan
w....@onehippo.com www.onehippo.com
Boston - 745 Atlantic Ave, 8th Floor, Boston MA 02111
Amsterdam - Oosteinde 11, 1017 WT Amsterdam

Ard Schrijvers

unread,
Feb 11, 2015, 10:16:16 AM2/11/15
to hippo-c...@googlegroups.com
Afaik not, but as I mentioned in the other thread, the biggest issue
is solved by extracting the pdf *per* cluster node. The cms does this
in org.hippoecm.frontend.editor.plugins.resource.ResourceHelper, see
[1]. Possibly, you do the pdf extraction with a fixed thread pool or
synchronized within a static method if you expect too many concurrent
pdf extractions might be required

Regards Ard


[1] https://svn.onehippo.org/repos/hippo/hippo-cms7/cms/trunk/api/src/main/java/org/hippoecm/frontend/editor/plugins/resource/ResourceHelper.java
Hippo Netherlands, Oosteinde 11, 1017 WT Amsterdam, Netherlands
Hippo USA, Inc.- 745 Atlantic Ave, Eight Floor, Boston MA 02111,
United states of America.

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com
Reply all
Reply to author
Forward
0 new messages