Excessive google cloud storage repair time

1,597 views
Skip to first unread message

Daniel Solow

unread,
Apr 23, 2021, 12:11:36 PM4/23/21
to Google Cloud Dataproc Discussions
Using dataproc version  2.0.6-ubuntu18 I'm seeing several spark jobs essentially pause processing after writing output to google cloud storage while the following messages appears many times:

21/04/23 15:16:21 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem: Successfully repaired '.../_temporary/0/task_202104231515498442210457656115980_0335_m_000054/' directory.

This seems to hold the job up for quite a while, as spark processing does not begin until it ends. Is this processing necessary? Can I avoid it? It is expected to be running so slowly?

Thanks

Mich Talebzadeh

unread,
Apr 23, 2021, 12:28:21 PM4/23/21
to Google Cloud Dataproc Discussions
where is spark job writing to?

is that BigQuery?

Daniel Solow

unread,
Apr 23, 2021, 12:29:53 PM4/23/21
to Google Cloud Dataproc Discussions
As I stated in my post, it's writing to google cloud storage.

Mich Talebzadeh

unread,
Apr 23, 2021, 12:42:58 PM4/23/21
to Google Cloud Dataproc Discussions
sure but that looks like referring to temporary storage for buffering I think.

Something you set with

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])


Have you checked that location to start with and removed old files?


Daniel Solow

unread,
Apr 23, 2021, 12:45:49 PM4/23/21
to Google Cloud Dataproc Discussions
I am not setting any configuration option called "temporaryGcsBucket" -- this is behavior I'm noticing after moving from spark 2 to spark 3

What concerns me is that it seems like these repairs are happening only on the driver (i.e. no parallelization) and hold up the rest of the job

Mich Talebzadeh

unread,
Apr 23, 2021, 1:01:07 PM4/23/21
to Google Cloud Dataproc Discussions


Hard to guess what is happening with Spark 3.1.1 without delving into the code. Do you have access to Spark GUI. What is the most consuming task? 

Daniel Solow

unread,
Apr 23, 2021, 1:11:43 PM4/23/21
to Google Cloud Dataproc Discussions
while this log message is being printed, no spark processing is occurring. As I posted above it seems to be happening only in the driver after I write output to GCS

Mich Talebzadeh

unread,
Apr 23, 2021, 1:30:17 PM4/23/21
to Google Cloud Dataproc Discussions
   I have not seen that behaviour myself. Is this PySpark running on client mode

I tend to create a temporary storage anyway myself on a given bucket.

For example tmp_storage_bucket

and create a tmp bucket beneath that folder

tmp_storage_bucket/tmp


I  set the spark parameter as below (In my case with Spark streaming)

        spark.conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
        spark.conf.set("temporaryGcsBucket", 'tmp_storage_bucket/tmp')
        spark.conf.set("spark.sql.streaming.checkpointLocation", 'tmp_storage_bucket/tmp')

Dennis Huo

unread,
Apr 23, 2021, 1:35:39 PM4/23/21
to Google Cloud Dataproc Discussions
A few questions that can help pinpoint the issue:
  • Does this occur at the end of the job or in the middle of one?
  • When you say "spark processing does not begin until it ends" are you talking about the processing of the next separate independent job, or of the next "stage" within a single job?
  • How are you performing the writes? Is it a Dataframe or are you using lower-level RDD interfaces?
  • Is the final output GCS directory in the same bucket as where you're seeing the "repair" directory commands?
  • What is the value of fs.defaultFS?
  • Which old Dataproc image version were you using that did not exhibit this behavior? 
I suspect the "directory repair" itself isn't the root problem, because normally directory repair only happens if incompatible sources of data are being used or if there are unexpected sources of concurrency. In this case, the temporary directories might be getting cleaned up by a threadpool while others are listing contents, but would be harmless in the cleanup phase of a Spark stage.

The main solution I would suggest trying is adding the following Spark property to your job:

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

Would love to hear if setting that property fixes your issue.

Daniel Solow

unread,
Apr 23, 2021, 1:52:02 PM4/23/21
to Google Cloud Dataproc Discussions
  • It occurs whenever spark finishes writing output, which can be the middle or a job, or the end of a job. It's especially slow when I am partitioning the output by several columns (but still not an unreasonable number)
  • The next stage within a single job
  • df.write.partitionBy(...).parquet("gs://...")
  • Yes, it's the same bucket
  • fs.defaultFS is hdfs://$clutster_name-m -- but I want to be clear that HDFS if not involved anywhere as far as I can tell
  • 1.4-ubuntu18 is a good example that was not having this issue

I'm trying a run with fs.gs.implicit.dir.repair.enable=false and if that doesn't help I'll try your suggestion next

Daniel Solow

unread,
Apr 23, 2021, 1:53:42 PM4/23/21
to Google Cloud Dataproc Discussions
It looks like setting fs.gs.implicit.dir.repair.enable=false fixed the problem, FYI

Guilherme Nobre

unread,
Jun 14, 2021, 12:48:56 PM6/14/21
to Google Cloud Dataproc Discussions
I'm having exactly the same issue when writing a dataframe to BigQuery with the spark-bigquery-connector. I migrated from Spark 2.4.4 & 1.4-ubuntu18 to Spark 3.1.2 & 2.0-ubuntu18
The logs get floded by GoogleCloudStorageFileSystem: Successfully repaired when the writing process starts.
Here is the code:

df.write
.format("bigquery")
.option("temporaryGcsBucket", "gs://mybucket")
.option("partitionField", "timestamp")
.option("clusteredFields", "type")
.mode("append")
.save(tableId)

I'll try the fs.gs.implicit.dir.repair.enable option too.

Guilherme Nobre

unread,
Jun 14, 2021, 2:52:35 PM6/14/21
to Google Cloud Dataproc Discussions
This worked for me:

--properties=spark.hadoop.fs.gs.implicit.dir.repair.enable=false
Reply all
Reply to author
Forward
0 new messages