If I recall correctly, this is just an issue with how Spark-on-YARN works.
In MapReduce, map and reduce tasks are generally pretty short (seconds or a couple minutes max). So, if one job starts and fills up the entire cluster, then you submit a second job, the second job will have slots to allocate its map and reduce containers within seconds.
However, Spark is different. As long as there are enough Spark tasks pending, the first Spark job will allocate executors to fill the entire cluster. Executors are essentially long running daemons that will not exit unless they have not run a task in 1 minute. The
Spark docs have a great explanation of how dynamic allocation works. We discuss it in the
Autoscaling docs as well.
This means that if you submit a second job, it will not have space to allocate its app master or executors until the first job finishes or at least until executors are idle for 1m.
There are a couple ways to get around this:
1) Disable dynamic allocation: set spark.dynamicAllocation.enabled=false and explicitly set spark.executor.instances=<some-number>.
2) Or, keep dynamic allocation on, and set the max number of executors to a smaller number (spark.dynamicAllocation.maxExecutors=<some-number>).
Note that on n1-standard-1 VMs, we run 1 executor per node, and on n1-standard-4 VMs we run 2 executors per node. Also note that the app master for each job takes one "slot". So a cluster of 2 n1-standard-1 worker VMs will only run 1 executor iirc.
After lunch, I'll run your repro and confirm this theory.