It works when I pass in a dataset of 5 or 10 GB.
However, when I pass in a dataset of 15 GB, it fails. I am wondering if there is some memory limitation I am running into.
I get hundreds of errors like this one:
13/04/23 10:55:51 INFO local.LocalScheduler: Running ResultTask(0, 377)
13/04/23 10:55:51 INFO storage.BlockManager: Started 0 remote gets in 1 ms
13/04/23 10:55:51 INFO local.LocalScheduler: Size of task 377 is 1667 bytes
13/04/23 10:55:51 INFO storage.BlockManager: Started 0 remote gets in 0 ms
13/04/23 10:55:51 INFO storage.BlockManager: Started 0 remote gets in 0 ms
13/04/23 10:55:52 INFO storage.BlockManager: Started 0 remote gets in 0 ms
13/04/23 10:55:52 ERROR local.LocalScheduler: Exception in task 373
java.io.FileNotFoundException: /var/folders/g_/djj279317y12n7wn7wc6vfj00000gn/T/spark-local-20130423105354-3d97/01/shuffle_0_60_373 (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at spark.storage.DiskStore.getBytes(DiskStore.scala:85)
at spark.storage.DiskStore.getValues(DiskStore.scala:92)
at spark.storage.BlockManager.getLocal(BlockManager.scala:269)
at spark.storage.BlockManager$$anonfun$getMultiple$5.apply(BlockManager.scala:566)
at spark.storage.BlockManager$$anonfun$getMultiple$5.apply(BlockManager.scala:565)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at spark.storage.BlockManager.getMultiple(BlockManager.scala:565)
at spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:48)
at spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:31)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:19)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.rdd.MappedRDD.compute(MappedRDD.scala:12)
at spark.RDD.computeOrReadCheckpoint(RDD.scala:206)
at spark.RDD.iterator(RDD.scala:195)
at spark.scheduler.ResultTask.run(ResultTask.scala:76)
at spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:74)
at spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
I have attached a full log, if it helps.
Thanks,
Eric
[hadoop] deals better with reduce operations where one task's data doesn't fit in memory (by being able to spill sort data to disk).