A few questions that can help pinpoint the issue:
- Does this occur at the end of the job or in the middle of one?
- When you say "spark processing does not begin until it ends" are you talking about the processing of the next separate independent job, or of the next "stage" within a single job?
- How are you performing the writes? Is it a Dataframe or are you using lower-level RDD interfaces?
- Is the final output GCS directory in the same bucket as where you're seeing the "repair" directory commands?
- What is the value of fs.defaultFS?
- Which old Dataproc image version were you using that did not exhibit this behavior?
I suspect the "directory repair" itself isn't the root problem, because normally directory repair only happens if incompatible sources of data are being used or if there are unexpected sources of concurrency. In this case, the temporary directories might be getting cleaned up by a threadpool while others are listing contents, but would be harmless in the cleanup phase of a Spark stage.
The main solution I would suggest trying is adding the following Spark property to your job:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
Would love to hear if setting that property fixes your issue.