How to debug GC overhead error in scalding

710 views
Skip to first unread message

Hugo Gävert

unread,
Oct 15, 2013, 12:46:41 PM10/15/13
to cascadi...@googlegroups.com
Hi!

I'm running a scalding job that basically aggregates a monoid in few different passes. The monoid is a class that has a few fields in it, but some of them contain Maps. Everything seems fine, but I seem to be hitting a java.lang.OutOfMemoryError: GC overhead limit exceeded all the time. The pipeline has quite a few steps, so I'm still trying to figure out where exactly does it fail, but I think there should not be that many keys that have huge number of events. Actually in one situation I was quite certain that it was basically joining at max 2 events per key. Here this could have been more, but all the data was already aggregated on more accurate level than in this phase. If I'm not totally mistaken. Only think I can guess is that one or more of those maps becomes somehow extremely huge. But I don't know.

So, the question is, what is the best way to start to debug where I'm running out of memory? Or what is causing the GC overhead limit exceeded, when basically this is all I have in one of the the failed logs (tail):

2013-10-15 18:22:13,753 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 70
2013-10-15 18:22:13,761 INFO org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2013-10-15 18:22:13,765 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 177
2013-10-15 18:22:13,829 INFO org.apache.hadoop.mapred.MapTask: data buffer = 141054448/176318064
2013-10-15 18:22:13,829 INFO org.apache.hadoop.mapred.MapTask: record buffer = 463994/579993
2013-10-15 18:22:13,831 INFO cascading.tuple.hadoop.TupleSerialization: using default comparator: com.twitter.scalding.IntegralComparator
2013-10-15 18:22:13,831 INFO cascading.tuple.hadoop.TupleSerialization: using hadoop serializations from the job conf: cascading.tuple.hadoop.TupleSerialization,org.apache.hadoop.io.seriali
zer.WritableSerialization,com.twitter.scalding.serialization.KryoHadoop 
2013-10-15 18:22:13,831 INFO cascading.tuple.hadoop.TupleSerialization: adding serialization token: 127, for classname: org.apache.hadoop.io.BytesWritable
2013-10-15 18:22:13,832 INFO cascading.tuple.hadoop.TupleSerialization: using default comparator: com.twitter.scalding.IntegralComparator
2013-10-15 18:22:13,836 INFO cascading.tuple.hadoop.TupleSerialization: using hadoop serializations from the job conf: cascading.tuple.hadoop.TupleSerialization,org.apache.hadoop.io.seriali
zer.WritableSerialization,com.twitter.scalding.serialization.KryoHadoop 
2013-10-15 18:22:13,836 INFO cascading.tuple.hadoop.TupleSerialization: adding serialization token: 127, for classname: org.apache.hadoop.io.BytesWritable
2013-10-15 18:22:13,848 INFO cascading.flow.hadoop.FlowMapper: cascading version: Concurrent, Inc - Cascading 2.1.6
2013-10-15 18:22:13,848 INFO cascading.flow.hadoop.FlowMapper: child jvm opts:  -Xmx1073741824
2013-10-15 18:22:16,934 INFO cascading.flow.hadoop.FlowMapper: sourcing from: TempHfs["SequenceFile[['cookie', 'session', 'profile', 'shogunList&apos
;, 'shogun']]"][283867434/_pipe_0__pipe_1/]
2013-10-15 18:22:16,934 INFO cascading.flow.hadoop.FlowMapper: sinking to: GroupBy(_pipe_0*_pipe_1)[by:[{1}:'cookie']]
2013-10-15 18:22:16,937 INFO cascading.pipe.assembly.AggregateBy: using threshold value: 100000
2013-10-15 18:22:16,942 INFO cascading.tuple.hadoop.TupleSerialization: using default comparator: com.twitter.scalding.IntegralComparator
2013-10-15 18:24:28,387 ERROR cascading.flow.stream.TrapHandler: caught Throwable, no trap available, rethrowing
2013-10-15 18:24:50,597 ERROR cascading.flow.stream.SourceStage: caught throwable
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
        at java.lang.StringBuilder.<init>(StringBuilder.java:68)
        at java.lang.StackTraceElement.toString(StackTraceElement.java:157)
        at java.lang.String.valueOf(String.java:2826)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at java.lang.Throwable.printStackTrace(Throwable.java:512)
        at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:60)
        at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
        at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
        at org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:83)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at org.slf4j.impl.Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:576)
        at cascading.flow.stream.TrapHandler.handleReThrowableException(TrapHandler.java:103)
        at cascading.flow.stream.TrapHandler.handleException(TrapHandler.java:121)
        at cascading.flow.stream.TrapHandler.handleException(TrapHandler.java:115)
        at cascading.flow.stream.ElementStage.handleException(ElementStage.java:134)
        at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:107)
        at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
        at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
        at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
2013-10-15 18:24:50,598 INFO org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
        at sun.nio.cs.StreamDecoder.<init>(StreamDecoder.java:231)
        at sun.nio.cs.StreamDecoder.<init>(StreamDecoder.java:211)
        at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:50)
        at java.io.InputStreamReader.<init>(InputStreamReader.java:57)
        at java.io.FileReader.<init>(FileReader.java:55)
        at org.apache.hadoop.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:514)
        at org.apache.hadoop.util.ProcfsBasedProcessTree.getProcessTree(ProcfsBasedProcessTree.java:189)
        at org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues(LinuxResourceCalculatorPlugin.java:401)
        at org.apache.hadoop.mapred.Task.updateResourceCounters(Task.java:821)
        at org.apache.hadoop.mapred.Task.updateCounters(Task.java:843)
        at org.apache.hadoop.mapred.Task.access$500(Task.java:72)
        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:670)
        at java.lang.Thread.run(Thread.java:662)

2013-10-15 18:25:02,303 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-10-15 18:25:02,305 WARN org.apache.hadoop.mapred.Child: Error running child
cascading.flow.FlowException: internal error during mapper execution
        at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:138)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
        at java.lang.StringBuilder.<init>(StringBuilder.java:68)
        at java.lang.StackTraceElement.toString(StackTraceElement.java:157)
        at java.lang.String.valueOf(String.java:2826)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at java.lang.Throwable.printStackTrace(Throwable.java:512)
        at org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:60)
        at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
        at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
        at org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:83)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at org.slf4j.impl.Log4jLoggerAdapter.error(Log4jLoggerAdapter.java:576)
        at cascading.flow.stream.TrapHandler.handleReThrowableException(TrapHandler.java:103)
        at cascading.flow.stream.TrapHandler.handleException(TrapHandler.java:121)
        at cascading.flow.stream.TrapHandler.handleException(TrapHandler.java:115)
        at cascading.flow.stream.ElementStage.handleException(ElementStage.java:134)
        at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:107)
        at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
        at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
        at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
        at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)


Any suggestions on debugging?

-- 
HG

Chris K Wensel

unread,
Oct 15, 2013, 12:59:05 PM10/15/13
to cascadi...@googlegroups.com
fwiw, Cascading 2.2 cleaned up much of the handling for the OOME case. might be easier to debug.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.


Oscar Boykin

unread,
Oct 15, 2013, 1:10:26 PM10/15/13
to cascadi...@googlegroups.com
You could try the 2.2 dep. It should *just work*.

Usually OOM comes when you use toList, mkString or some other collection building operation.


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.



--
Oscar Boykin :: @posco :: http://twitter.com/posco

Hugo Gävert

unread,
Oct 15, 2013, 2:09:11 PM10/15/13
to cascadi...@googlegroups.com
Hi!

How do I get 2.2? I'm running 0.8.11 scalding, my code breaks with 0.9.0rc1 (implicit value for mode... haven't had time to figure out what that means), but is that really when I get 2.2?

The groupBy operations I do are really like this:
.groupBy('id) { _.reduce('profile -> 'profile) (( accu: userProfile, curr: userProfile) => accu + curr ) }
and the userProfile contains algebird's Max, a few smallish Maps, and then decayedMaps (decayedVector) which I think might contain a bit more keys in the map - but still should not be anything really huge. No toLists, mkStrings or others directly. Must be aggregating those monoids somehow (all the fields of the class are summed by algebird).

-- 
HG

Oscar Boykin

unread,
Oct 15, 2013, 2:13:48 PM10/15/13
to cascadi...@googlegroups.com
1) just update your build to point to the latest cascadig 2.2 (it is backwards compatible).

2) If you are keeping a few "smallish maps" my prediction is that they are not small. There is some key you didn't expect that is very common (like a 0 user id or something) and you are creating huge buckets.

3) These sed rule should help with porting to 0.9.0: https://gist.github.com/johnynek/6632488 Let me know if you have any issues.

Hugo Gävert

unread,
Oct 15, 2013, 2:57:31 PM10/15/13
to cascadi...@googlegroups.com
Hi!

I like the sed script :-)

I don't quite get the cascading 2.2... this is what I have for dependencies in sbt:
// Scalding dependencies
resolvers ++= Seq(
  "twitter-maven" at "http://maven.twttr.com"
)

libraryDependencies ++= Seq(
  "com.twitter" %% "scalding-core" % "0.9.0rc1",
  "com.twitter" %% "scalding-date" % "0.9.0rc1",
  "com.twitter" %% "scalding-args" % "0.9.0rc1",
  "org.apache.hadoop" % "hadoop-core" % "0.20.2" % "provided"
)

-> scalding pulls the cascading in, but seems that 0.9.0rc1 still pulls 2.1.6.

-- 
HG

Andre Kelpe

unread,
Oct 15, 2013, 3:03:36 PM10/15/13
to cascadi...@googlegroups.com
If you add cascading-hadoop and cascading-core in version 2.2.0 to
your project. sbt will use those instead of the ones coming in via
scalding. You also will have to go to a newer hadoop dependency as
well, since we have dropped support for 0.20.2. Anything from 1.x will
work. See also: http://www.cascading.org/support/compatibility/

- André

Hugo Gävert

unread,
Oct 18, 2013, 7:38:43 AM10/18/13
to cascadi...@googlegroups.com
Hi!

2.2 didn't help at all with hitting the GC limit. I included the cascading libraries as André suggested. The hadoop version is only "provided" and I'm not quite sure what the actual version on the cluster is (it's Cloudera 2.0.0-mr1-cdh4.3.0 - but I cannot figure out what the hadoop version there actually is).

I'll go now with the assumption that I'm actually running out of memory because I'm building a collection (immutable) in the aggregations. The error happens on map side, probably inside MRMAggregator. Basically I now need a new monoid, one that limits the size of the contained map... Finding out where the problem is inside complicated job is really difficult. I wish there would be other ways than trial and error. 

Thanks for the comments.

-- 
HG


tiistai, 15. lokakuuta 2013 20.10.26 UTC+3 Oscar Boykin kirjoitti:
Reply all
Reply to author
Forward
0 new messages