PySpark, parquet and google storage

1,445 views
Skip to first unread message

cvisi...@xebia.com

unread,
Feb 10, 2016, 2:07:16 AM2/10/16
to Google Cloud Dataproc Discussions
Hi,

I'm using PySpark to write parquet files to google storage and I notice that sparks default behavior of writing to the `_temporary` folder before moving all the files can take a long time on google storage

I found this which is probably a solution to the problem:

However I can't seem to get at the hadoop configuration (where this setting needs to be applied) from PySpark like i can in java/scala.

Any suggestions on how to write parquet files from dataproc to gs://... efficiently using pyspark?

Thanks,
Constantijn

Dennis Huo

unread,
Feb 10, 2016, 1:34:13 PM2/10/16
to Google Cloud Dataproc Discussions
Generally, Spark will wire out anything that is specified as a Spark property prefixed with "spark.hadoop.*" into the underlying Hadoop configuration after stripping off that prefix. See this code for the behavior:


It doesn't seem to be well documented, and I suppose it's not clear whether there would ever be plans to deprecate the functionality, but a lot of code has probably come to rely on it by now.

Of course, different classes might interact differently with the Hadoop configuration, and as far as I can tell, things which grab "new Configuration()" instead of via SparkContext.hadoopConfiguration() may not get those wired configs, but either way it's worth a try:

Dataproc CLI:

gcloud beta dataproc jobs submit pyspark --properties spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter

From direct SSH session or other client:

pyspark --conf spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter

cvisi...@xebia.com

unread,
Feb 10, 2016, 4:45:51 PM2/10/16
to Google Cloud Dataproc Discussions
Thanks!

That helped shave another few minutes off off my CSV to Parquet job.
It's taking some getting used to to tune my jobs for gs:// ... It started out with taking 5.5 hrs to transform 20GB of CSV into parquet. With this latest tweak I'm down to a much more acceptable 15 minutes (on 4 x n1-standard-8 workers)
Reply all
Reply to author
Forward
0 new messages