Hello everyone:
I recently attended the talk of Akshay Rai about Dr. Elephant and it was a very enlightening talk. Specially because this is a sometimes ignored topic despite having a huge impact on the cost, energy and resources used when running Big Data jobs. A bit of analysis on the metrics of your jobs, like Dr. Elephant does, can prevent future cluster usage problems.
On this line we published one paper not long ago on providing automatic recommendations on the level of parallelism of Spark jobs. Parallelism parameters, like spark.executor.cores or spark.executor.memory, can have a huge impact on performance, but it's sometimes difficult to know the values you need. What we did was to leverage the metrics information from previous executions, to train a boost decision tree and predict what will be the impact of changing the parallelism parameters in Spark. Depending on the workload you can see that different values for spark.executor memory and spark.executor.cores can greatly reduce the execution time. The good thing is that the model can learn by itself and you don't need heuristics for this. I attach the paper in case some concepts might be useful for you guys.
Keep up the good work! :)
Alvaro