Trying to figure out why I'm getting OOM errors. I'm loading 61 million lines and splitting each line into a hashmap (unfortunately I can't use an array to save space). I'm using kryo and I've tried setting persist to MEMORY_ONLY_SER and MEMORY_AND_DISK_SER..
When I set MEMORY_AND_DISK_SER, why don't I see it spilling to disk before it OOM's? I have tried setting spark.boundedMemoryCache.memoryFraction to a lower value, but it still seems to die in the same spot.
If I disable caching/persisting, it is able to run and complete just fine. Does anyone have some insight?
Here is te output up to where it dies:
12/10/30 17:38:24 INFO scheduler.DAGScheduler: Completed ResultTask(0, 225)
12/10/30 17:38:24 INFO cluster.TaskSetManager: Starting task 0.0:285 as TID 285 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:38:24 INFO cluster.TaskSetManager: Serialized task 0.0:285 as 2673 bytes in 0 ms
12/10/30 17:38:43 INFO storage.BlockManagerMasterActor: Added rdd_3_235 in memory on
10.164.14.8:38141 (size: 165.3 MB, free: 3.9 GB)
12/10/30 17:38:43 INFO cluster.TaskSetManager: Finished TID 235 in 147888 ms (progress: 231/1030)
12/10/30 17:38:43 INFO scheduler.DAGScheduler: Completed ResultTask(0, 235)
12/10/30 17:38:43 INFO cluster.TaskSetManager: Starting task 0.0:286 as TID 286 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:38:43 INFO cluster.TaskSetManager: Serialized task 0.0:286 as 2673 bytes in 0 ms
12/10/30 17:38:54 INFO storage.BlockManagerMasterActor: Added rdd_3_240 in memory on
10.164.14.7:36860 (size: 165.3 MB, free: 4.2 GB)
12/10/30 17:38:54 INFO cluster.TaskSetManager: Finished TID 240 in 145081 ms (progress: 232/1030)
12/10/30 17:38:54 INFO scheduler.DAGScheduler: Completed ResultTask(0, 240)
12/10/30 17:38:54 INFO cluster.TaskSetManager: Starting task 0.0:287 as TID 287 on slave worker-20121030050635-10.164.14.7-8080: 10.164.14.7 (non-preferred)
12/10/30 17:38:54 INFO cluster.TaskSetManager: Serialized task 0.0:287 as 2673 bytes in 0 ms
12/10/30 17:38:55 INFO storage.BlockManagerMasterActor: Added rdd_3_231 in memory on
10.164.14.4:57606 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:02 INFO storage.BlockManagerMasterActor: Added rdd_3_236 in memory on
10.164.14.8:38141 (size: 165.3 MB, free: 3.7 GB)
12/10/30 17:39:02 INFO cluster.TaskSetManager: Finished TID 236 in 167146 ms (progress: 233/1030)
12/10/30 17:39:02 INFO scheduler.DAGScheduler: Completed ResultTask(0, 236)
12/10/30 17:39:02 INFO cluster.TaskSetManager: Starting task 0.0:288 as TID 288 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:39:02 INFO cluster.TaskSetManager: Serialized task 0.0:288 as 2673 bytes in 1 ms
12/10/30 17:39:13 INFO storage.BlockManagerMasterActor: Added rdd_3_248 in memory on
10.164.14.6:43423 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:13 INFO cluster.TaskSetManager: Finished TID 248 in 144358 ms (progress: 234/1030)
12/10/30 17:39:13 INFO scheduler.DAGScheduler: Completed ResultTask(0, 248)
12/10/30 17:39:13 INFO cluster.TaskSetManager: Starting task 0.0:289 as TID 289 on slave worker-20121030050635-10.164.14.6-8080: 10.164.14.6 (non-preferred)
12/10/30 17:39:13 INFO cluster.TaskSetManager: Serialized task 0.0:289 as 2673 bytes in 0 ms
12/10/30 17:39:22 INFO storage.BlockManagerMasterActor: Added rdd_3_237 in memory on
10.164.14.2:59549 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:22 INFO storage.BlockManagerMasterActor: Added rdd_3_229 in memory on
10.164.14.2:59549 (size: 165.3 MB, free: 4.0 GB)
12/10/30 17:39:30 INFO storage.BlockManagerMasterActor: Added rdd_3_231 in memory on
10.164.14.4:57606 (size: 165.3 MB, free: 4.1 GB)
12/10/30 17:39:30 INFO cluster.TaskSetManager: Finished TID 231 in 210445 ms (progress: 235/1030)
12/10/30 17:39:30 INFO scheduler.DAGScheduler: Completed ResultTask(0, 231)
12/10/30 17:39:30 INFO cluster.TaskSetManager: Starting task 0.0:290 as TID 290 on slave worker-20121030050634-10.164.14.4-8080: 10.164.14.4 (non-preferred)
12/10/30 17:39:30 INFO cluster.TaskSetManager: Serialized task 0.0:290 as 2673 bytes in 0 ms
12/10/30 17:39:31 INFO storage.BlockManagerMasterActor: Added rdd_3_242 in memory on
10.164.14.8:38141 (size: 165.3 MB, free: 3.5 GB)
12/10/30 17:39:31 INFO cluster.TaskSetManager: Finished TID 242 in 177412 ms (progress: 236/1030)
12/10/30 17:39:31 INFO scheduler.DAGScheduler: Completed ResultTask(0, 242)
12/10/30 17:39:31 INFO cluster.TaskSetManager: Starting task 0.0:291 as TID 291 on slave worker-20121030050635-10.164.14.8-8080: 10.164.14.8 (non-preferred)
12/10/30 17:39:31 INFO cluster.TaskSetManager: Serialized task 0.0:291 as 2673 bytes in 0 ms
12/10/30 17:39:32 INFO cluster.TaskSetManager: Finished TID 229 in 217530 ms (progress: 237/1030)
12/10/30 17:39:32 INFO scheduler.DAGScheduler: Completed ResultTask(0, 229)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Starting task 0.0:292 as TID 292 on slave worker-20121030050635-10.164.14.2-8080: 10.164.14.2 (non-preferred)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Serialized task 0.0:292 as 2673 bytes in 1 ms
12/10/30 17:39:32 INFO cluster.TaskSetManager: Finished TID 237 in 193598 ms (progress: 238/1030)
12/10/30 17:39:32 INFO scheduler.DAGScheduler: Completed ResultTask(0, 237)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Starting task 0.0:293 as TID 293 on slave worker-20121030050635-10.164.14.2-8080: 10.164.14.2 (non-preferred)
12/10/30 17:39:32 INFO cluster.TaskSetManager: Serialized task 0.0:293 as 2673 bytes in 1 ms
12/10/30 17:39:43 INFO storage.BlockManagerMasterActor: Added rdd_3_233 in memory on
10.164.14.2:59549 (size: 165.4 MB, free: 3.8 GB)
12/10/30 17:39:43 INFO cluster.TaskSetManager: Finished TID 233 in 214881 ms (progress: 239/1030)
12/10/30 17:39:43 INFO scheduler.DAGScheduler: Completed ResultTask(0, 233)
12/10/30 17:39:43 INFO cluster.TaskSetManager: Starting task 0.0:294 as TID 294 on slave worker-20121030050635-10.164.14.2-8080: 10.164.14.2 (non-preferred)
12/10/30 17:39:43 INFO cluster.TaskSetManager: Serialized task 0.0:294 as 2673 bytes in 0 ms
12/10/30 17:39:51 INFO storage.BlockManagerMasterActor: Added rdd_3_222 in memory on
10.164.14.4:57606 (size: 165.4 MB, free: 4.0 GB)
12/10/30 17:39:51 INFO cluster.TaskSetManager: Finished TID 222 in 246331 ms (progress: 240/1030)
12/10/30 17:39:51 INFO scheduler.DAGScheduler: Completed ResultTask(0, 222)
12/10/30 17:39:51 INFO cluster.TaskSetManager: Starting task 0.0:295 as TID 295 on slave worker-20121030050634-10.164.14.4-8080: 10.164.14.4 (non-preferred)
12/10/30 17:39:51 INFO cluster.TaskSetManager: Serialized task 0.0:295 as 2673 bytes in 0 ms
12/10/30 17:39:55 INFO storage.BlockManagerMasterActor: Added rdd_3_260 in memory on
10.164.14.6:43423 (size: 165.2 MB, free: 4.0 GB)
12/10/30 17:39:55 INFO cluster.TaskSetManager: Finished TID 260 in 165127 ms (progress: 241/1030)
12/10/30 17:39:55 INFO scheduler.DAGScheduler: Completed ResultTask(0, 260)
12/10/30 17:39:55 INFO cluster.TaskSetManager: Starting task 0.0:296 as TID 296 on slave worker-20121030050635-10.164.14.6-8080: 10.164.14.6 (non-preferred)
12/10/30 17:39:55 INFO cluster.TaskSetManager: Serialized task 0.0:296 as 2673 bytes in 0 ms
12/10/30 17:39:58 INFO cluster.TaskSetManager: Lost TID 232 (task 0.0:232)
12/10/30 17:39:58 INFO cluster.TaskSetManager: Loss was due to java.lang.OutOfMemoryError: Java heap space
at it.unimi.dsi.fastutil.bytes.ByteArrays.grow(ByteArrays.java:170)
at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:97)
at java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:292)
at spark.KryoSerializationStream.writeObject(KryoSerializer.scala:81)
at spark.serializer.SerializationStream$class.writeAll(Serializer.scala:58)
at spark.KryoSerializationStream.writeAll(KryoSerializer.scala:73)
at spark.storage.BlockManager.dataSerialize(BlockManager.scala:834)
at spark.storage.MemoryStore.putValues(MemoryStore.scala:59)
at spark.storage.BlockManager.put(BlockManager.scala:593)
at spark.CacheTracker.getOrCompute(CacheTracker.scala:215)
at spark.RDD.iterator(RDD.scala:159)
at spark.scheduler.ResultTask.run(ResultTask.scala:18)
at spark.executor.Executor$TaskRunner.run(Executor.scala:76)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)