troubleshooting: java.lang.OutOfMemoryError: Java heap space

2,093 views
Skip to first unread message

seanm

unread,
Oct 30, 2012, 2:04:24 PM10/30/12
to spark...@googlegroups.com
Trying to figure out why I'm getting OOM errors.   I'm loading 61 million lines and splitting each line into a hashmap (unfortunately I can't use an array to save space).  I'm using kryo and I've tried setting persist to MEMORY_ONLY_SER and MEMORY_AND_DISK_SER..

When I set MEMORY_AND_DISK_SER, why don't I see it spilling to disk before it OOM's?  I have tried setting spark.boundedMemoryCache.memoryFraction to a lower value, but it still seems to die in the same spot.

If I disable caching/persisting, it is able to run and complete just fine.  Does anyone have some insight?


Here is te output up to where it dies:


12/10/30 17:38:24 INFO scheduler.DAGScheduler: Completed ResultTask(0, 225)
12/10/30 17:38:24 INFO cluster.TaskSetManager: Starting task 0.0:285 as TID 285 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:38:24 INFO cluster.TaskSetManager: Serialized task 0.0:285 as 2673 bytes in 0 ms
12/10/30 17:38:43 INFO storage.BlockManagerMasterActor: Added rdd_3_235 in memory on 10.164.14.8:38141 (size: 165.3 MB, free: 3.9 GB)
12/10/30 17:38:43 INFO cluster.TaskSetManager: Finished TID 235 in 147888 ms (progress: 231/1030)
12/10/30 17:38:43 INFO scheduler.DAGScheduler: Completed ResultTask(0, 235)
12/10/30 17:38:43 INFO cluster.TaskSetManager: Starting task 0.0:286 as TID 286 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:38:43 INFO cluster.TaskSetManager: Serialized task 0.0:286 as 2673 bytes in 0 ms
12/10/30 17:38:54 INFO storage.BlockManagerMasterActor: Added rdd_3_240 in memory on 10.164.14.7:36860 (size: 165.3 MB, free: 4.2 GB)
12/10/30 17:38:54 INFO cluster.TaskSetManager: Finished TID 240 in 145081 ms (progress: 232/1030)
12/10/30 17:38:54 INFO scheduler.DAGScheduler: Completed ResultTask(0, 240)
12/10/30 17:38:54 INFO cluster.TaskSetManager: Starting task 0.0:287 as TID 287 on slave worker-20121030050635-10.164.14.7-8080: 10.164.14.7 (non-preferred)
12/10/30 17:38:54 INFO cluster.TaskSetManager: Serialized task 0.0:287 as 2673 bytes in 0 ms
12/10/30 17:38:55 INFO storage.BlockManagerMasterActor: Added rdd_3_231 in memory on 10.164.14.4:57606 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:02 INFO storage.BlockManagerMasterActor: Added rdd_3_236 in memory on 10.164.14.8:38141 (size: 165.3 MB, free: 3.7 GB)
12/10/30 17:39:02 INFO cluster.TaskSetManager: Finished TID 236 in 167146 ms (progress: 233/1030)
12/10/30 17:39:02 INFO scheduler.DAGScheduler: Completed ResultTask(0, 236)
12/10/30 17:39:02 INFO cluster.TaskSetManager: Starting task 0.0:288 as TID 288 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:39:02 INFO cluster.TaskSetManager: Serialized task 0.0:288 as 2673 bytes in 1 ms
12/10/30 17:39:13 INFO storage.BlockManagerMasterActor: Added rdd_3_248 in memory on 10.164.14.6:43423 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:13 INFO cluster.TaskSetManager: Finished TID 248 in 144358 ms (progress: 234/1030)
12/10/30 17:39:13 INFO scheduler.DAGScheduler: Completed ResultTask(0, 248)
12/10/30 17:39:13 INFO cluster.TaskSetManager: Starting task 0.0:289 as TID 289 on slave worker-20121030050635-10.164.14.6-8080: 10.164.14.6 (non-preferred)
12/10/30 17:39:13 INFO cluster.TaskSetManager: Serialized task 0.0:289 as 2673 bytes in 0 ms
12/10/30 17:39:22 INFO storage.BlockManagerMasterActor: Added rdd_3_237 in memory on 10.164.14.2:59549 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:22 INFO storage.BlockManagerMasterActor: Added rdd_3_229 in memory on 10.164.14.2:59549 (size: 165.3 MB, free: 4.0 GB)
12/10/30 17:39:30 INFO storage.BlockManagerMasterActor: Added rdd_3_231 in memory on 10.164.14.4:57606 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:30 INFO cluster.TaskSetManager: Finished TID 231 in 210445 ms (progress: 235/1030)
12/10/30 17:39:30 INFO scheduler.DAGScheduler: Completed ResultTask(0, 231)
12/10/30 17:39:30 INFO cluster.TaskSetManager: Starting task 0.0:290 as TID 290 on slave worker-20121030050634-10.164.14.4-8080: 10.164.14.4 (non-preferred)
12/10/30 17:39:30 INFO cluster.TaskSetManager: Serialized task 0.0:290 as 2673 bytes in 0 ms
12/10/30 17:39:31 INFO storage.BlockManagerMasterActor: Added rdd_3_242 in memory on 10.164.14.8:38141 (size: 165.3 MB, free: 3.5 GB)
12/10/30 17:39:31 INFO cluster.TaskSetManager: Finished TID 242 in 177412 ms (progress: 236/1030)
12/10/30 17:39:31 INFO scheduler.DAGScheduler: Completed ResultTask(0, 242)
12/10/30 17:39:31 INFO cluster.TaskSetManager: Starting task 0.0:291 as TID 291 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:39:31 INFO cluster.TaskSetManager: Serialized task 0.0:291 as 2673 bytes in 0 ms
12/10/30 17:39:32 INFO cluster.TaskSetManager: Finished TID 229 in 217530 ms (progress: 237/1030)
12/10/30 17:39:32 INFO scheduler.DAGScheduler: Completed ResultTask(0, 229)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Starting task 0.0:292 as TID 292 on slave worker-20121030050635-10.164.14.2-8080: 10.164.14.2 (non-preferred)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Serialized task 0.0:292 as 2673 bytes in 1 ms
12/10/30 17:39:32 INFO cluster.TaskSetManager: Finished TID 237 in 193598 ms (progress: 238/1030)
12/10/30 17:39:32 INFO scheduler.DAGScheduler: Completed ResultTask(0, 237)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Starting task 0.0:293 as TID 293 on slave worker-20121030050635-10.164.14.2-8080: 10.164.14.2 (non-preferred)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Serialized task 0.0:293 as 2673 bytes in 1 ms
12/10/30 17:39:43 INFO storage.BlockManagerMasterActor: Added rdd_3_233 in memory on 10.164.14.2:59549 (size: 165.4 MB, free: 3.8 GB)
12/10/30 17:39:43 INFO cluster.TaskSetManager: Finished TID 233 in 214881 ms (progress: 239/1030)
12/10/30 17:39:43 INFO scheduler.DAGScheduler: Completed ResultTask(0, 233)
12/10/30 17:39:43 INFO cluster.TaskSetManager: Starting task 0.0:294 as TID 294 on slave worker-20121030050635-10.164.14.2-8080: 10.164.14.2 (non-preferred)
12/10/30 17:39:43 INFO cluster.TaskSetManager: Serialized task 0.0:294 as 2673 bytes in 0 ms
12/10/30 17:39:51 INFO storage.BlockManagerMasterActor: Added rdd_3_222 in memory on 10.164.14.4:57606 (size: 165.4 MB, free: 4.0 GB)
12/10/30 17:39:51 INFO cluster.TaskSetManager: Finished TID 222 in 246331 ms (progress: 240/1030)
12/10/30 17:39:51 INFO scheduler.DAGScheduler: Completed ResultTask(0, 222)
12/10/30 17:39:51 INFO cluster.TaskSetManager: Starting task 0.0:295 as TID 295 on slave worker-20121030050634-10.164.14.4-8080: 10.164.14.4 (non-preferred)
12/10/30 17:39:51 INFO cluster.TaskSetManager: Serialized task 0.0:295 as 2673 bytes in 0 ms
12/10/30 17:39:55 INFO storage.BlockManagerMasterActor: Added rdd_3_260 in memory on 10.164.14.6:43423 (size: 165.2 MB, free: 4.0 GB)
12/10/30 17:39:55 INFO cluster.TaskSetManager: Finished TID 260 in 165127 ms (progress: 241/1030)
12/10/30 17:39:55 INFO scheduler.DAGScheduler: Completed ResultTask(0, 260)
12/10/30 17:39:55 INFO cluster.TaskSetManager: Starting task 0.0:296 as TID 296 on slave worker-20121030050635-10.164.14.6-8080: 10.164.14.6 (non-preferred)
12/10/30 17:39:55 INFO cluster.TaskSetManager: Serialized task 0.0:296 as 2673 bytes in 0 ms
12/10/30 17:39:58 INFO cluster.TaskSetManager: Lost TID 232 (task 0.0:232)
12/10/30 17:39:58 INFO cluster.TaskSetManager: Loss was due to java.lang.OutOfMemoryError: Java heap space
        at it.unimi.dsi.fastutil.bytes.ByteArrays.grow(ByteArrays.java:170)
        at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:97)
        at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:292)
        at spark.KryoSerializationStream.writeObject(KryoSerializer.scala:81)
        at spark.serializer.SerializationStream$class.writeAll(Serializer.scala:58)
        at spark.KryoSerializationStream.writeAll(KryoSerializer.scala:73)
        at spark.storage.BlockManager.dataSerialize(BlockManager.scala:834)
        at spark.storage.MemoryStore.putValues(MemoryStore.scala:59)
        at spark.storage.BlockManager.put(BlockManager.scala:593)
        at spark.CacheTracker.getOrCompute(CacheTracker.scala:215)
        at spark.RDD.iterator(RDD.scala:159)
        at spark.scheduler.ResultTask.run(ResultTask.scala:18)
        at spark.executor.Executor$TaskRunner.run(Executor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)



Shivaram Venkataraman

unread,
Oct 30, 2012, 2:11:16 PM10/30/12
to spark...@googlegroups.com
Are you on the 0.6 release ? If so, could you try setting
"spark.storage.memoryFraction" to a lower value ? I think with the 0.6
release, this variable controls the amount of memory used for caching.

Also you can find more info on tuning memory usage at
http://spark-project.org/docs/0.6.0/tuning.html

Thanks
Shivaram

seanm

unread,
Oct 30, 2012, 2:41:23 PM10/30/12
to spark...@googlegroups.com
Yes, I am on 0.6.  I will try it at an even lower setting.  I set it to 0.4 and I got this:


12/10/30 18:30:28 INFO cluster.TaskSetManager: Serialized task 0.0:287 as 2673 bytes in 0 ms
12/10/30 18:30:29 INFO storage.BlockManagerMasterActor: Added rdd_3_231 in memory on 10.164.14.7:42294 (size: 165.3 MB, free: 3.9 GB)
12/10/30 18:30:29 INFO cluster.TaskSetManager: Finished TID 231 in 164689 ms (progress: 233/1030)
12/10/30 18:30:29 INFO scheduler.DAGScheduler: Completed ResultTask(0, 231)
12/10/30 18:30:29 INFO cluster.TaskSetManager: Starting task 0.0:288 as TID 288 on slave worker-20121030050635-10.164.14.7-8080: 10.164.14.7 (non-preferred)
12/10/30 18:30:29 INFO cluster.TaskSetManager: Serialized task 0.0:288 as 2673 bytes in 0 ms
12/10/30 18:30:46 INFO storage.BlockManagerMasterActor: Added rdd_3_248 in memory on 10.164.14.6:51906 (size: 165.3 MB, free: 4.1 GB)
12/10/30 18:30:46 INFO cluster.TaskSetManager: Finished TID 248 in 120715 ms (progress: 234/1030)
12/10/30 18:30:46 INFO scheduler.DAGScheduler: Completed ResultTask(0, 248)
12/10/30 18:30:46 INFO cluster.TaskSetManager: Starting task 0.0:289 as TID 289 on slave worker-20121030050635-10.164.14.6-8080: 10.164.14.6 (non-preferred)
12/10/30 18:30:46 INFO cluster.TaskSetManager: Serialized task 0.0:289 as 2673 bytes in 0 ms
12/10/30 18:30:57 INFO storage.BlockManagerMasterActor: Added rdd_3_242 in memory on 10.164.14.7:42294 (size: 165.3 MB, free: 3.8 GB)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Finished TID 242 in 138863 ms (progress: 235/1030)
12/10/30 18:30:57 INFO scheduler.DAGScheduler: Completed ResultTask(0, 242)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Starting task 0.0:290 as TID 290 on slave worker-20121030050635-10.164.14.7-8080: 10.164.14.7 (non-preferred)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Serialized task 0.0:290 as 2673 bytes in 0 ms
12/10/30 18:30:57 INFO storage.BlockManagerMasterActor: Added rdd_3_244 in memory on 10.164.14.7:42294 (size: 165.4 MB, free: 3.6 GB)
12/10/30 18:30:57 INFO storage.BlockManagerMasterActor: Added rdd_3_255 in memory on 10.164.14.6:51906 (size: 165.3 MB, free: 4.0 GB)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Finished TID 244 in 138518 ms (progress: 236/1030)
12/10/30 18:30:57 INFO scheduler.DAGScheduler: Completed ResultTask(0, 244)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Starting task 0.0:291 as TID 291 on slave worker-20121030050635-10.164.14.7-8080: 10.164.14.7 (non-preferred)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Serialized task 0.0:291 as 2673 bytes in 0 ms
12/10/30 18:30:57 INFO cluster.TaskSetManager: Finished TID 255 in 121425 ms (progress: 237/1030)
12/10/30 18:30:57 INFO scheduler.DAGScheduler: Completed ResultTask(0, 255)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Starting task 0.0:292 as TID 292 on slave worker-20121030050635-10.164.14.6-8080: 10.164.14.6 (non-preferred)
12/10/30 18:30:57 INFO cluster.TaskSetManager: Serialized task 0.0:292 as 2673 bytes in 0 ms
12/10/30 18:31:00 INFO storage.BlockManagerMasterActor: Added rdd_3_229 in memory on 10.164.14.4:39042 (size: 165.3 MB, free: 4.2 GB)
12/10/30 18:31:10 INFO cluster.TaskSetManager: Lost TID 229 (task 0.0:229)
12/10/30 18:31:10 INFO cluster.TaskSetManager: Loss was due to spark.SparkException: Error communicating with CacheTracker
        at spark.CacheTracker.askTracker(CacheTracker.scala:125)
        at spark.CacheTracker.communicate(CacheTracker.scala:131)
        at spark.CacheTracker.notifyFromBlockManager(CacheTracker.scala:161)
        at spark.storage.BlockManager.notifyCacheTracker(BlockManager.scala:758)
        at spark.storage.BlockManager.put(BlockManager.scala:636)
        at spark.CacheTracker.getOrCompute(CacheTracker.scala:215)
        at spark.RDD.iterator(RDD.scala:159)
        at spark.scheduler.ResultTask.run(ResultTask.scala:18)
        at spark.executor.Executor$TaskRunner.run(Executor.scala:76)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)

....


12/10/30 18:32:59 INFO storage.BlockManagerMasterActor: Added rdd_3_279 in memory on 10.164.14.8:33350 (size: 165.1 MB, free: 3.2 GB)
12/10/30 18:32:59 INFO cluster.TaskSetManager: Finished TID 279 in 187778 ms (progress: 260/1030)
12/10/30 18:32:59 INFO scheduler.DAGScheduler: Completed ResultTask(0, 279)
12/10/30 18:32:59 INFO cluster.TaskSetManager: Starting task 0.0:315 as TID 316 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 18:32:59 INFO cluster.TaskSetManager: Serialized task 0.0:315 as 2673 bytes in 0 ms
12/10/30 18:33:04 INFO cluster.TaskSetManager: Lost TID 262 (task 0.0:262)
12/10/30 18:33:04 INFO cluster.TaskSetManager: Loss was due to java.lang.OutOfMemoryError: Java heap space

seanm

unread,
Oct 30, 2012, 3:22:34 PM10/30/12
to spark...@googlegroups.com
I tried increasing spark.kryoserializer.buffer.mb and lowering spark.boundedMemoryCache.memoryFraction to 0.2.   Still dies in about the same spot and doesn't seem to be spilling to disk.

Sean

Shivaram Venkataraman

unread,
Oct 30, 2012, 9:24:25 PM10/30/12
to spark...@googlegroups.com
What are the values for Xmx, Xms and SPARK_MEM in your setup ? From
the logs what I can see is that the BlockManager thinks it has more
memory available ("size: 165.3 MB, free: 4.2 GB") but somehow the
tasks OOM despite that.

To answer your question of why you don't see any disk writes, it is
because the BlockManager starts writing to disk only when it runs out
of memory and right now it doesn't look like it has run out of 4.2GB.

Thanks
Shivaram

seanm

unread,
Oct 31, 2012, 2:06:48 AM10/31/12
to spark...@googlegroups.com
I'm using 14G for SPARK_MEM and I'm setting 2g for Xmx Xms.  I'm only setting Xmx and Xms on the driver though, so I'm guessing it's picking up whatever the default value is from running ./start-all.sh

Thanks
Sean

Shivaram Venkataraman

unread,
Oct 31, 2012, 4:23:54 PM10/31/12
to spark...@googlegroups.com
My guess is that the JVM on the slave is also picking up a 2g memory
limit and that leads to the OOM. You can try to look at the JVM
arguments on the slaves using a tool like top or ps.

The other reason could be that a single partition is somehow much
larger than the amount of memory available to the JVM. The code
assumes that even if the storage level is set to spill to disk, the
spilling happens at the granularity of partitions and that each
partition can itself fit in memory.

Thanks
Shivaram

seanm

unread,
Oct 31, 2012, 5:13:17 PM10/31/12
to spark...@googlegroups.com
Thanks Shivaram-  I will check the workers to make sure they're not picking up the 2g mem.  I'm not seeing the issue at the moment, I now believe it was caused by me not setting the parallelism right, so the partitions were not fitting in memory.  It wasn't apparent to me what the issue was at the time, but now makes  sense.

Thanks

Sean
Reply all
Reply to author
Forward
0 new messages