overwrite mode is not working on Partiton Date folder

Vipinkumar Jha

unread,

Dec 14, 2021, 5:51:12 AM12/14/21

to Delta Lake Users and Developers

Hi All,

I am working on one requirement where I need to check whether current date folder exists or not in Processed_Path.

If it exist it should be deleted. I am able to do this

# function to check pathExistence

def path_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise

#if path_exists(process_date):
# dbutils.fs.rm(process_date,True)
# print("Path is Exist and got Removed")
#
#else:
#
# print("Path Does Not Exist")

I am able to check the current date folder and delete it. But it is falling to next step when trying to read the process_path by saying below error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 89.0 failed 4 times, most recent failure: Lost task 1.3 in stage 89.0 (TID 145, 172.21.128.133, executor 0): com.databricks.sql.io.FileReadException: Error while reading file abfss:mnt/root/USA/process_date=2021-12-14/part-00000-1e134b56-b886-96ac-9faf-bc34cfda44df.c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions

by writing the data in processed_path I am using below command

df.write.mode("overwrite").format("delta").option("mergeSchema", "true").partitionBy("process_date").save(process_delta_path)

but overwrite is not working and it keep writing the new file in same folder.

I want to over write the process_date=2021-12-14 folder and write the new file in it.

can you please let me know where I am missing the thing which help me to achieve my requirement.

Shixiong(Ryan) Zhu

unread,

Dec 15, 2021, 10:02:38 PM12/15/21

to Vipinkumar Jha, Delta Lake Users and Developers

The data files in your table folder should be managed by Delta. Deleting any files in the table manually through file system operations such as `dbutils.fs` may cause issues like this.

The overwrite command you provided is the right way to delete data, but it will overwrite the entire table. If you just need to replace a partition/partitions, you can use replaceWhere.

When overwriting a table, Delta will create new parquet files but still keep the old files so that you can time travel back to a previous table version. Delta uses transaction logs to track which parquet files should be read and will ignore old files. You can take a look at this blog post to understand how Delta works: https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html

If you worry about the storage cost of old files, you can run vacuum to clean up old files.

Best Regards,

Ryan

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/e1aa23b4-754e-4882-89ed-310a0f85599fn%40googlegroups.com.

Vipinkumar Jha

unread,

Dec 16, 2021, 4:05:40 AM12/16/21

to Delta Lake Users and Developers

Thanks for the clarification.

Even I sued the below command now

Delta_Df.write.mode("overwrite").format("delta").options("replaceWhere","process_date = '2021-12-14' ").option("mergeSchema", "true").partitionBy("process_date").save(process_delta_path)

but it is keeping ading a new file in processed_path folder.

My aim is to have only 1 file.

As there is another process in ADF which will copy the file from processed_path to sql table, so I dont want it to copy all files.

Is there any way to do the vaccume on folder path, if yes can you please share the synatx for same.