How to not generate _SUCCESS, _committed, _started files when generating csv

2,879 views
Skip to first unread message

sriram.v...@gmail.com

unread,
Feb 4, 2021, 11:06:42 PM2/4/21
to Delta Lake Users and Developers
I am trying to generate csv file by reading a Delta table using Databricks. The csv file will be used for downstream Systems. While doing this, I want to read all the csv files in an adls directory. Spark generates the _SUCCESS, _committed, _started files along with the partitioned csv files. My downstream etl process is not processing the csv files because of the presence of these 3 files. How can I not generate these files?

I have tried setting these 3 flags on the Databricks cluster.

mapreduce.fileoutputcommitter.marksuccessfuljobs=false
parquet.enable.summary-metadata=false
spark.hadoop.parquet.enable.summary-metadata=false
The above 3 flags didn't work.

Below Flag works but this is overriding the DBIO outputcommitter
spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol # This worked

Is there a way to achieve this with Delta and Databricks?

rohit haseja

unread,
Feb 5, 2021, 1:06:14 AM2/5/21
to sriram.v...@gmail.com, Delta Lake Users and Developers
Hey Sriram,

I haven't seen any direct way of doing this but you could use following steps
1. copy the csv file at specified location using dbutils.fs.cp()
2.Once copied and validated the data then using dbutils.fs.rm() remove the folder where all the _success,_started, and _committed files are present.

I am currently using this approach.
Thanks,
Rohit Haseja.

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/babdc1ae-f9d2-4dc9-a7ad-aee16f26a772n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages