java.lang.OutOfMemoryError: Java heap space - help

170 views
Skip to first unread message

Grigoriy Shmigol

unread,
Jul 11, 2014, 1:45:04 PM7/11/14
to cascadi...@googlegroups.com
Need  assistance with figuring out why cascading is running out of heap space while running in a local mode

scenario
cascading - /,/2.5.5
mode - local
I have two csv files wiht about 3.5M rows and about 20 columns- size: 1,005 Mb
data is tab separated and have quotes around it

code is trying to read in the data but get the error below.

2014-07-10 18:24:19,361 WARN   [pool-2-thread-2] Cascade M[logWarn] - [] flow failed: ...
java.lang.OutOfMemoryError: Java heap space
    at java.lang.String.substring(String.java:1913)
    at java.lang.String.subSequence(String.java:1946)
    at java.util.regex.Pattern.split(Pattern.java:1202)
    at cascading.scheme.util.DelimitedParser.createSplit(DelimitedParser.java:253)
    at cascading.scheme.util.DelimitedParser.onlyParseLine(DelimitedParser.java:417)
    at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:342)
    at cascading.scheme.local.TextDelimited.source(TextDelimited.java:650)
    at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)


Yourkit reports a deadlock but perhaps this one is due to inefficiency of how cascading processes large csv file in a local mode

pool-1109-thread-1 <--- Frozen for at least 48m 27 sec
java.lang.StringBuilder.toString()
cascading.scheme.util.DelimitedParser.onlyParseLine(LineReader, StringBuilder, boolean)
cascading.scheme.util.DelimitedParser.parseLine(LineReader, boolean)
cascading.scheme.local.TextDelimited.source(FlowProcess, SourceCall)
cascading.tuple.TupleEntrySchemeIterator.getNext()
cascading.tuple.TupleEntrySchemeIterator.hasNext()
cascading.flow.stream.SourceStage.map(Object)
cascading.flow.stream.SourceStage.call()<2 recursive calls>
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()

Grigoriy Shmigol

unread,
Jul 11, 2014, 1:57:47 PM7/11/14
to cascadi...@googlegroups.com
forgot to add that file is being read in using TextDelimited

    final Tap sourceLhs = new FileTap(new TextDelimited(true, CompareToolConstants.DELIMITER), inputFileNameLhs);

Bertrand Dechoux

unread,
Jul 13, 2014, 9:34:59 PM7/13/14
to cascadi...@googlegroups.com
You might want to check the format of your file. From what I remember, TextDelimited is well behaved if the format is correct.
Missing quote or extra quote might be responsible for what you are seeing.

Regards

Bertrand Dechoux

Grigoriy Shmigol

unread,
Jul 14, 2014, 4:11:01 PM7/14/14
to cascadi...@googlegroups.com
Bertrand - thank you for your response, I thought the same - and have cleaned input files - essentially escaping every non-letter/non-numeric character. 

I am still seeing the error but now it seems that it moved into java.util.Arrays.copyOfRange  - which suggests that results collection is simply too large for the heap i am allocating. I will certainly try larger heap (just to see if it works) but i would rather try to fix the issue. Any recommendations?

So more details: 
simple outerJoin with a custom filter that removes rows that match good know criteria.
2 csv files LHS and RHS; each with 2,986,923 rows and 25 columns. Note: worked fine on a files with 1,322,267 rows with 25 columns. 

Heap: 24G
file size 588Mb (each)
only 1 pipe running
operations:
 CoGroup with outerJoin
 Each with new Identity
 Each with a custom filter

Also, does it means that i have reached some sort of a limitation of what cascading could do in a local mode? 

java.util.Arrays.copyOfRange(char[], int, int)
java.lang.String.<init>(char[], int, int)
java.lang.String.substring(int, int)
java.lang.String.subSequence(int, int)
java.util.regex.Pattern.split(CharSequence, int)
cascading.scheme.util.DelimitedParser.createSplit(String, Pattern, int)

cascading.scheme.util.DelimitedParser.onlyParseLine(LineReader, StringBuilder, boolean)
cascading.scheme.util.DelimitedParser.parseLine(LineReader, boolean)
cascading.scheme.local.TextDelimited.source(FlowProcess, SourceCall)
cascading.tuple.TupleEntrySchemeIterator.getNext()
cascading.tuple.TupleEntrySchemeIterator.hasNext()
cascading.flow.stream.SourceStage.map(Object)
cascading.flow.stream.SourceStage.call()<2 recursive calls>
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()


--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/69Tr4OW0w6k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/aa1f1a5c-34c9-4aca-84f1-50c148a25fe6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Chris K Wensel

unread,
Jul 14, 2014, 4:24:48 PM7/14/14
to cascadi...@googlegroups.com
Just want to point out you are running in Cascading local mode (FileTap), so you _will_ have memory issues with large inputs if you are doing a GroupBy or CoGroup.

it wasn't designed to run at scale. use Hadoop for that.

ckw

You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

Grigoriy Shmigol

unread,
Jul 15, 2014, 1:01:39 AM7/15/14
to cascadi...@googlegroups.com
Chris, thank you for your response. Yes, i understand regarding local mode vs hadoop. Any recommendations on how to proceed with hadoop: version, architecture, memory / cpu / storage requirements?


Andre Kelpe

unread,
Jul 15, 2014, 4:57:07 AM7/15/14
to cascadi...@googlegroups.com
You can get started pretty easily by using EMR: (http://aws.amazon.com/elasticmapreduce/) We have bootstrap actions for driven (http://docs.cascading.io/driven/1.0/getting-started/index.html#emr) and the SDK (http://www.cascading.org/sdk/)  to make your life even easier :)

- André



For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages