I am running Spark 0.7.2 on a five-node cluster. Each machine is 64GB RAM and 24 cores. I started the driver and worker node manually (somehow bin/start-all.sh doesn't work). I set the SPARK_MEM to be 60g and SPARK_WORKER_MEMORY to be 64g. I just run a simple application to test the connectivity between nodes, i.e.
val hostnames = sc.parallelize(1 to 24*5, 24*5).map(i =>
InetAddress.getLocalHost().getHostName()
).collect()
13/07/18 22:32:43 INFO SparkDeploySchedulerBackend: Granted executor ID app-20130718223243-0005/2 on host a.b.c.d with 24 cores, 128.0 MB RAM
13/07/18 22:32:43 INFO DAGScheduler: Got job 0 (collect at cluster.scala:23) with 120 output partitions (allowLocal=false)
13/07/18 22:32:43 INFO DAGScheduler: Final stage: Stage 0 (map at cluster.scala:21)
13/07/18 22:32:43 INFO DAGScheduler: Parents of final stage: List()
13/07/18 22:32:43 INFO Client$ClientActor: Executor updated: app-20130718223243-0005/0 is now RUNNING
13/07/18 22:32:43 INFO Client$ClientActor: Executor updated: app-20130718223243-0005/1 is now RUNNING
13/07/18 22:32:43 INFO Client$ClientActor: Executor updated: app-20130718223243-0005/0 is now FAILED (class java.io.IOException: Cannot run program "/.../lib/spark-0.7.2/run" (in directory "/.../lib/spark-0.7.2/work/app-20130718223243-0005/0"): java.io.IOException: error=12, Cannot allocate memory)
13/07/18 22:32:43 INFO SparkDeploySchedulerBackend: Executor app-20130718223243-0005/0 removed: class java.io.IOException: Cannot run program "/.../lib/spark-0.7.2/run" (in directory "/export/cnc_cup/lib/spark-0.7.2/work/app-20130718223243-0005/0"): java.io.IOException: error=12, Cannot allocate memory