I have migrated a portion of C application to process on DataProc using PySpark Jobs (Reading and writing into Big Query - Amount of data - around 10 GB) . The C application that is running in 8 minutes in local data centre taking around 4 Hrs on Data Proc . Could someone advise me the optimal Data Proc configuration ? At present I am using below one :
--master-machine-type n2-highmem-32 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n2-highmem-32 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 500 --image-version 1.4-debian10
Will really appreciate any help on optimal dataproc configuration .
Thanks,
Rahul
RPS: it is completely on GCP and we are using Spark-BQ API
2.Where is the perceived bottleneck R/W to BigQuery from Dataproc?
RPS : This is like one big program with queries/searches running consecutively. We have broken the big query by writing them to temporary BQ tables and reading them again. It is mostly these writes and reads that are taking a lot of time
3. What version of Spark are you are using on Dataproc? running spark-shell command will tell you
RPS: We are using image-version 1.4-debian10, which has Spark 2.4 installed on it
4.Are you using Spark API to write to BQ from PySpark or some JDBC
RPS: Yes we are using Spark API : --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar
Also the version of Dataproc deployed
5. Both Stages and executors tabs will tell you the individual task's processing times that may point to the bottleneck, for example writing to BigQuery
RPS: It is taking time while writing to BigQuery tables.
6.Also as you are using PySpark, you are running it in Yarn and client mode. Dataproc is offered as IaaS so unless you change the node types (tin boxes), it is already pretty optimised for what you have deployed.
RPS: We are running it in Yarn mode (master='yarn-cluster')
Thanks,