PySpark, parquet and google storage

cvisi...@xebia.com

unread,

Feb 10, 2016, 2:07:16 AM2/10/16

to Google Cloud Dataproc Discussions

Hi,

I'm using PySpark to write parquet files to google storage and I notice that sparks default behavior of writing to the `_temporary` folder before moving all the files can take a long time on google storage

I found this which is probably a solution to the problem:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala

However I can't seem to get at the hadoop configuration (where this setting needs to be applied) from PySpark like i can in java/scala.

Any suggestions on how to write parquet files from dataproc to gs://... efficiently using pyspark?

Thanks,

Constantijn

Dennis Huo

unread,

Feb 10, 2016, 1:34:13 PM2/10/16

to Google Cloud Dataproc Discussions

Generally, Spark will wire out anything that is specified as a Spark property prefixed with "spark.hadoop.*" into the underlying Hadoop configuration after stripping off that prefix. See this code for the behavior:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L100

It doesn't seem to be well documented, and I suppose it's not clear whether there would ever be plans to deprecate the functionality, but a lot of code has probably come to rely on it by now.

Of course, different classes might interact differently with the Hadoop configuration, and as far as I can tell, things which grab "new Configuration()" instead of via SparkContext.hadoopConfiguration() may not get those wired configs, but either way it's worth a try:

Dataproc CLI:

gcloud beta dataproc jobs submit pyspark --properties spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter

From direct SSH session or other client:

pyspark --conf spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter

cvisi...@xebia.com

unread,

Feb 10, 2016, 4:45:51 PM2/10/16

to Google Cloud Dataproc Discussions

Thanks!

That helped shave another few minutes off off my CSV to Parquet job.

It's taking some getting used to to tune my jobs for gs:// ... It started out with taking 5.5 hrs to transform 20GB of CSV into parquet. With this latest tweak I'm down to a much more acceptable 15 minutes (on 4 x n1-standard-8 workers)

Reply all

Reply to author

Forward