DataProc is taking more than 3 hrs to process than expected less than 10 mins

Rahul Pandey

unread,

Mar 2, 2021, 3:20:54 PM3/2/21

to Google Cloud Dataproc Discussions

I have migrated a portion of C application to process on DataProc using PySpark Jobs (Reading and writing into Big Query - Amount of data - around 10 GB) . The C application that is running in 8 minutes in local data centre taking around 4 Hrs on Data Proc . Could someone advise me the optimal Data Proc configuration ? At present I am using below one :

--master-machine-type n2-highmem-32 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n2-highmem-32 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 500 --image-version 1.4-debian10

Will really appreciate any help on optimal dataproc configuration .

Thanks,

Rahul

Mich Talebzadeh

unread,

Mar 2, 2021, 3:29:04 PM3/2/21

to Google Cloud Dataproc Discussions

Can you please clarify are you writing from on-premise to BigQuery that is taking 8 minutes? I assume you are using Spark-BQ API
Where is the perceived bottleneck R/W to BigQuery from Dataproc?
What version of Spark are you are using on Dataproc? running spark-shell command will tell you
Are you using Spark API to write to BQ from PySpark or some JDBC

Also the version of Dataproc deployed

HTH

Mich Talebzadeh

unread,

Mar 2, 2021, 4:14:16 PM3/2/21

to Google Cloud Dataproc Discussions

Also please check Spark GUI (default port 4040) for possible causes of delay.

Both Stages and executors tabs will tell you the individual task's processing times that may point to the bottleneck, for example writing to BigQuery

Also as you are using PySpark, you are running it in Yarn and client mode. Dataproc is offered as IaaS so unless you change the node types (tin boxes), it is already pretty optimised for what you have deployed.

HTH

Rahul Pandey

unread,

Mar 3, 2021, 11:05:15 AM3/3/21

to Google Cloud Dataproc Discussions

Thanks HTP , Please find my inline reply :

Can you please clarify are you writing from on-premise to BigQuery that is taking 8 minutes? I assume you are using Spark-BQ API

RPS: it is completely on GCP and we are using Spark-BQ API

2.Where is the perceived bottleneck R/W to BigQuery from Dataproc?

RPS : This is like one big program with queries/searches running consecutively. We have broken the big query by writing them to temporary BQ tables and reading them again. It is mostly these writes and reads that are taking a lot of time

3. What version of Spark are you are using on Dataproc? running spark-shell command will tell you

RPS: We are using image-version 1.4-debian10, which has Spark 2.4 installed on it

4.Are you using Spark API to write to BQ from PySpark or some JDBC

RPS: Yes we are using Spark API : --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

Also the version of Dataproc deployed

5. Both Stages and executors tabs will tell you the individual task's processing times that may point to the bottleneck, for example writing to BigQuery

RPS: It is taking time while writing to BigQuery tables.

6.Also as you are using PySpark, you are running it in Yarn and client mode. Dataproc is offered as IaaS so unless you change the node types (tin boxes), it is already pretty optimised for what you have deployed.

RPS: We are running it in Yarn mode (master='yarn-cluster')

Thanks,

RPS

Mich Talebzadeh

unread,

Mar 3, 2021, 12:03:15 PM3/3/21

to Google Cloud Dataproc Discussions

Hi,

Ok you are using a stable version of Spark on Dataproc 1.4. I believe the spark version is 2.3.3 but I may be wrong. Anyway it is stable.

Using Spark-BQ API is fine. The jar file spark-bigquery-latest_2.12.jar is the one we use as well. The only issue I see here is that BigQuery buffers the writes through temporary work area gs://<temporay_Buffer>/tmp, before writing to BQ table and that temporary work area may be saturated by multiple simultaneous read/writes. You have many read/writes and also alluded " It is taking time while writing to BigQuery tables. ". So my first point of call would be if something can be done to this. For example, if it is possible to write to different temporary work areas or replace them with faster disks or memdisks. I am not knowledgeable enough on the choice of temporary storage on BigQuery. However, we do spark structured streaming from on-premise to BigQuery and the tasks taken longest are writing to BigQuery table as well. The other option is to increase memory and CPUs on your VM cluster. You need to shut the Dataproc hosts down before increasing memory and CPUs. Worth a try that may gain you some performance. Personally the write buffers are more of a concern.

HTH,

Mich

Mich Talebzadeh

unread,

Mar 4, 2021, 4:46:38 AM3/4/21

to Google Cloud Dataproc Discussions

Hi,

Just to reiterate, when writing to BigQuery from Spark, the rows are buffered through temporary storage area as shown in Spark GUI under SQL. That is I believe the next improvement needs to come:

3) Execute InsertIntoHadoopFsRelationCommand Input [7]: [rowkey#82, ticker#83, timeissued#84, price#85, currency#86, op_type#87, op_time#88] Arguments: gs://tmp_storage_bucket/tmp/.spark-bigquery-application_1611250163243_0055-5e5b6f7d-f5f4-4367-93a1-55cb64817f6a, false, Parquet, Map(path -> gs://tmp_storage_bucket/tmp/.spark-bigquery-application_1611250163243_0055-5e5b6f7d-f5f4-4367-93a1-55cb64817f6a), ErrorIfExists, [rowkey, ticker, timeissued, price, currency, op_type, op_time]

Spark DAG diagram shows this. I attach it here.

HTH

DAG.PNG

Reply all

Reply to author

Forward