Dataproc serverless for Spark

113 views

Skip to first unread message

Mich Talebzadeh

unread,

Nov 28, 2022, 5:56:14 PM11/28/22

to Google Cloud Dataproc Discussions

Hi,

I tried to discuss this topic in spark user group us...@spark.apache.org

I have not used standalone for a good while. The standard dataproc uses YARN as the resource manager. The vanilla dataproc is Google's answer to Hadoop on the cloud. Move your analytics workload from on-premise to Cloud with little effort with the same look and feel. Google then introduced dynamic allocation of resources to cater for those apps that could not be easily migrated to Kubernetes (GKE). so the doc states that without dynamic allocation, it only asks for containers at the beginning of the job. With dynamic allocation, it will remove containers, or ask for new ones, as necessary. This is still using YARN. See here This approach was as not necessarily very successful as adding executors dynamically for larger workloads could freeze the spark application itself. Reading the doc it says startup time for serverless is 60 seconds compared to dataproc on Compute engine (the one you setup your own spark cluster on dataproc tin boxes) of 90 seconds

Dataproc serverless for Spark autoscaling makes a reference to "Dataproc Serverless autoscaling is the default behavior, and uses Spark dynamic resource allocation to determine whether, how, and when to scale your workload" So the key point is Not standalone mode but generally references to "Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster."

Is'nt this the standard Spark resource allocation? So why has this suddenly been elevated from Spark 3.2?