DataProc master does not see the correct numbers of Spark executors

2,228 views
Skip to first unread message

poiuytrez

unread,
Oct 29, 2015, 11:23:18 AM10/29/15
to Google Cloud Dataproc Discussions
Hello, 

Dataproc for Spark does not seems to be reliable at all ! I am unable to create my Spark cluster with the correct number of executors. 

1. Create a dataproc cluster: 
$ gcloud beta dataproc clusters create my-dataproc 

2. Follow the documentation to access to the web UI [1]
3. Connect to the master node using gcloud
4. Run PySpark
$ pyspark
5. Check the number of executors on the Spark web UI

I got only 2 executors (screenshot 1). I should have 8 executors as I have 2 workers nodes with 4 CPUs on each. 

6. After about 3 minutes, I got a lost executor error in my logs:
>>> 15/10/29 15:11:33 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 1 on guillaume-dataproc-w-1.c.databerries-staging.internal: remote Rpc client disassociated
15/10/29 15:11:33 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkE...@guillaume-dataproc-w-1.c.databerries-staging.internal:48238] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

7. If I check the number of executors in the web UI, I have only one executor remaining. 

Screenshot 1


Cluster config:

Dennis Huo

unread,
Oct 29, 2015, 12:55:50 PM10/29/15
to Google Cloud Dataproc Discussions
Hi there!

Thanks for checking in with us on this; for the most part, what you are seeing is actually just the differences in how Spark on YARN can be configured vs Spark standalone. At the moment, YARN's reporting of "VCores Used" doesn't actually correctly correspond to a real container reservation of cores, and containers are actually just based on the memory reservation. Overall there are a few things at play here:

  • The Spark AppMaster must run inside a container when running on YARN; this is supposed to consume half of 1 machine, so in your case it should take 5GB and 2 cores, but the "cores used" would only show up as 1 since it's considered a single container.
    • There appear to be a minor rounding bug which currently causes the AppMaster to overshoot the target memory allocation causing the packing to be unable to pack a full second executor on the same machine as the AppMaster.
  • Dataproc then divides up half a machine for each Spark executor. You can check your Spark "environment" tab and look at spark.executor.cores; if you're using 4-core machines each spark.executor.cores should say 2, and if you're using 8-core machines, each one would say 4. Each executor should also be grabbing 5GB. The reason for this is that there is wasted overhead with each individual "executor" process, so having larger multi-core executors can improve performance compared to having lots of executors with only 1 core each. Note that Spark will correctly pack multiple concurrent tasks onto single executors based on spark.executor.cores.
    • If you want lots of small executors you can try something like the following, though note that the following will be somewhat thin on memory-per-executor:
      • pyspark --conf spark.dynamicAllocation.enabled=false --conf spark.executor.cores=1 --conf spark.executor.memory=1400m --conf spark.executor.instances=99999
  • Dynamic allocation causes Spark to relinquish idle executors back to YARN, and unfortunately at the moment Spark prints that spammy but harmless "lost executor" message. This was the classical problem of Spark on YARN where Spark originally paralyzed clusters it ran on because it would grab the maximum number of containers it thought it needed and then never give them up. With Dynamic allocation, when you start a long job, Spark quickly allocates new containers (with something like exponential rampup to quickly be able to fill a full YARN cluster within a couple minutes), and when Idle, relinquishes executors with the same ramp-down at an interval of about 60 seconds (if idle for 60 seconds, relinquish some executors).
    • If you want to disable dynamic allocation you can run:
      • pyspark --conf spark.dynamicAllocation.enabled=false
      • gcloud beta dataproc jobs submit spark --properties spark.dynamicAllocation.enabled=false --cluster <your-cluster> foo.jar
    • Alternatively, if you specify a fixed number of executors, it should also automatically disable dynamic allocation:
      • pyspark --conf spark.executor.instances=123
      • gcloud beta dataproc jobs submit spark --properties spark.executor.instances=123 --cluster <your-cluster> foo.jar
Reply all
Reply to author
Forward
0 new messages