pyspark memory usage

840 views

Skip to first unread message

unread,

Oct 15, 2013, 10:12:55 AM10/15/13

to spark...@googlegroups.com

Hello,

i setuped spark-0.8.0-incubating-bin-cdh4 on 5 node cluster.

I limited SPARK_WORKER_MEMORY to 2g and there are 4 cores per node, so i expected total memory consumption by spark to be 512mb + 2gb.

Spark webui shows Memory: 10.0 GB Total, 0.0 B Used

Then i tried to run simple wordcount.py from examples on a hdfs file, which size is 11GB.

Spark launched 4 workers per node, and did not limited its total size by 2gb - top showed RES consumption about 750mb and then

Out of memory: Kill process 26336 (python) score 97 or sacrifice child

Killed process 26336, UID 500, (python) total-vm:969696kB, anon-rss:782976kB, file-rss:196kB

and in the logs

INFO cluster.ClusterTaskSetManager: Loss was due to org.apache.spark.SparkException

org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)

at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:167)

at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:173)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:116)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)

at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:193)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)

at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:149)

at org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

So i could not finished the task. Yes, spark resubmited the task, but it was continuing OOM Killed.

Against a smaller file spark was doing good.

So the question is - why spark does not limit its memory accordinaly and how to analyze files larger than ram with it?