java.lang.OutOfMemoryError: Java heap space

Grigoriy Shmigol

unread,

Jul 11, 2014, 1:45:04 PM7/11/14

to cascadi...@googlegroups.com

Need assistance with figuring out why cascading is running out of heap space while running in a local mode

scenario
cascading - /,/2.5.5
mode - local
I have two csv files wiht about 3.5M rows and about 20 columns- size: 1,005 Mb
data is tab separated and have quotes around it

code is trying to read in the data but get the error below.

2014-07-10 18:24:19,361 WARN   [pool-2-thread-2] Cascade M[logWarn] - [] flow failed: ...
java.lang.OutOfMemoryError: Java heap space
    at java.lang.String.substring(String.java:1913)
    at java.lang.String.subSequence(String.java:1946)
    at java.util.regex.Pattern.split(Pattern.java:1202)
    at cascading.scheme.util.DelimitedParser.createSplit(DelimitedParser.java:253)
    at cascading.scheme.util.DelimitedParser.onlyParseLine(DelimitedParser.java:417)
    at cascading.scheme.util.DelimitedParser.parseLine(DelimitedParser.java:342)
    at cascading.scheme.local.TextDelimited.source(TextDelimited.java:650)
    at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
    at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Yourkit reports a deadlock but perhaps this one is due to inefficiency of how cascading processes large csv file in a local mode

pool-1109-thread-1 <--- Frozen for at least 48m 27 sec
java.lang.StringBuilder.toString()
cascading.scheme.util.DelimitedParser.onlyParseLine(LineReader, StringBuilder, boolean)
cascading.scheme.util.DelimitedParser.parseLine(LineReader, boolean)
cascading.scheme.local.TextDelimited.source(FlowProcess, SourceCall)
cascading.tuple.TupleEntrySchemeIterator.getNext()
cascading.tuple.TupleEntrySchemeIterator.hasNext()
cascading.flow.stream.SourceStage.map(Object)
cascading.flow.stream.SourceStage.call()<2 recursive calls>
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()

Grigoriy Shmigol

unread,

Jul 11, 2014, 1:57:47 PM7/11/14

to cascadi...@googlegroups.com

forgot to add that file is being read in using TextDelimited

final Tap sourceLhs = new FileTap(new TextDelimited(true, CompareToolConstants.DELIMITER), inputFileNameLhs);

Bertrand Dechoux

unread,

Jul 13, 2014, 9:34:59 PM7/13/14

to cascadi...@googlegroups.com

You might want to check the format of your file. From what I remember, TextDelimited is well behaved if the format is correct.

Missing quote or extra quote might be responsible for what you are seeing.

Regards

Bertrand Dechoux

Grigoriy Shmigol

unread,

Jul 14, 2014, 4:11:01 PM7/14/14

to cascadi...@googlegroups.com

Bertrand - thank you for your response, I thought the same - and have cleaned input files - essentially escaping every non-letter/non-numeric character.

I am still seeing the error but now it seems that it moved into java.util.Arrays.copyOfRange - which suggests that results collection is simply too large for the heap i am allocating. I will certainly try larger heap (just to see if it works) but i would rather try to fix the issue. Any recommendations?

So more details:

simple outerJoin with a custom filter that removes rows that match good know criteria.

2 csv files LHS and RHS; each with 2,986,923 rows and 25 columns. Note: worked fine on a files with 1,322,267 rows with 25 columns.

Heap: 24G

file size 588Mb (each)

only 1 pipe running

operations:

CoGroup with outerJoin

Each with new Identity

Each with a custom filter

Also, does it means that i have reached some sort of a limitation of what cascading could do in a local mode?

java.util.Arrays.copyOfRange(char[], int, int)
java.lang.String.<init>(char[], int, int)
java.lang.String.substring(int, int)
java.lang.String.subSequence(int, int)
java.util.regex.Pattern.split(CharSequence, int)
cascading.scheme.util.DelimitedParser.createSplit(String, Pattern, int)

cascading.scheme.util.DelimitedParser.onlyParseLine(LineReader, StringBuilder, boolean)
cascading.scheme.util.DelimitedParser.parseLine(LineReader, boolean)
cascading.scheme.local.TextDelimited.source(FlowProcess, SourceCall)
cascading.tuple.TupleEntrySchemeIterator.getNext()
cascading.tuple.TupleEntrySchemeIterator.hasNext()
cascading.flow.stream.SourceStage.map(Object)
cascading.flow.stream.SourceStage.call()<2 recursive calls>
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()

--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/69Tr4OW0w6k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/aa1f1a5c-34c9-4aca-84f1-50c148a25fe6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Chris K Wensel

unread,

Jul 14, 2014, 4:24:48 PM7/14/14

to cascadi...@googlegroups.com

Just want to point out you are running in Cascading local mode (FileTap), so you _will_ have memory issues with large inputs if you are doing a GroupBy or CoGroup.

it wasn't designed to run at scale. use Hadoop for that.

ckw

You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CA%2BTOWoT10AmCUAYwHerVtxbR-vzM8JArH%2BCmYapMBDDTXYe3FQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Chris K Wensel

ch...@concurrentinc.com

http://concurrentinc.com

Grigoriy Shmigol

unread,

Jul 15, 2014, 1:01:39 AM7/15/14

to cascadi...@googlegroups.com

Chris, thank you for your response. Yes, i understand regarding local mode vs hadoop. Any recommendations on how to proceed with hadoop: version, architecture, memory / cpu / storage requirements?

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/E1F4F318-6F89-4DEC-8E19-FDD0FE685AF0%40wensel.net.

Andre Kelpe

unread,

Jul 15, 2014, 4:57:07 AM7/15/14

to cascadi...@googlegroups.com

You can get started pretty easily by using EMR: (http://aws.amazon.com/elasticmapreduce/) We have bootstrap actions for driven (http://docs.cascading.io/driven/1.0/getting-started/index.html#emr) and the SDK (http://www.cascading.org/sdk/) to make your life even easier :)

- André

To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/CA%2BTOWoQKFAoRzpREi0Wg7x5uWVTKRhSxsNyD_YN40LgbGMabdw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

Reply all

Reply to author

Forward

java.lang.OutOfMemoryError: Java heap space - help

Grigoriy Shmigol

Grigoriy Shmigol

Bertrand Dechoux

Grigoriy Shmigol

Chris K Wensel

Grigoriy Shmigol

Andre Kelpe