How to correctly use delta.logRetentionDuration in spark streaming context

103 views
Skip to first unread message

Amar Laddha

unread,
Jul 27, 2020, 8:25:40 PM7/27/20
to Delta Lake Users and Developers
Hi Team,

I am using the open source version of delta 0.7.0 and our use case is to stream the data and save in Delta Lake.

Periodically I am repartitioning to small number of partitions so that reads are faster and periodically vacuum-ing old version of data to reduce space.

But I have seen that delta logs keeps on growing and vacuum command is not designed to remove delta logs.

I want to limit the delta logs retention duration, will following work and will it be correct way to retain/clean logs periodically?

1. spark.writeStream.option("delta.logRetentionDuration", "interval 1 hours").format("delta")....

2. delta.logRetentionDuration only removes used logs from the given interval correct ? In that case keeping interval of 1 hour should be safe correct?

Thanks,
Amar

Tathagata Das

unread,
Jul 27, 2020, 8:49:12 PM7/27/20
to Amar Laddha, Delta Lake Users and Developers
By default, the log is maintained for 30 days, that is, log versions older than 30 days are automatically removed. "delta.logRetentionDuration" is a table property (not a write.option) that controls that. See https://docs.delta.io/latest/delta-batch.html#-data-retention You can use "ALTER TABLE SET TBLPROPERTIES" SQL command to set that property.

Why do you want to reduce the log duration? Is the size of the delta log too much?

TD 

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/d73c9169-832d-46be-9573-efb983784d97o%40googlegroups.com.

Amar Laddha

unread,
Jul 27, 2020, 9:04:42 PM7/27/20
to Tathagata Das, Delta Lake Users and Developers
Thanks Tathagata, will try alter table command.

Yes, the log size is 253 GB and the actual data size is around 2.5 GB. This is my append only ingestion table where I am reading from Kafka at 500 millisecs.

Thanks,
Amar

Amar Laddha

unread,
Jul 28, 2020, 6:37:15 PM7/28/20
to Delta Lake Users and Developers
Hi Tathagata,

I added the table property on my delta table through spark thrift server and I restarted my spark streaming job (append mode). As per documentation, log files are automatically cleaned up after checkpoints are written.

But I don't see older delta logs getting cleaned up.

Details: Spark 3.0
Delta: 0.7.0
Table properties:

SHOW TBLPROPERTIES <table_name>;

transient_lastDdlTime    1595974271
delta.logRetentionDuration    interval 2 days

Can you please let me know if I am missing something.

Thanks,
Amar
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users+unsubscribe@googlegroups.com.

Tathagata Das

unread,
Jul 28, 2020, 7:43:03 PM7/28/20
to Amar Laddha, Delta Lake Users and Developers
I dont get you. You said "log files are automatically cleaned up" but also said "I don't see older delta logs getting cleaned up.". Can you elaborate and clarify?

To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/6e7ceca1-94b3-4daf-8cf7-90b77659307bo%40googlegroups.com.

Amar Laddha

unread,
Jul 29, 2020, 3:50:29 AM7/29/20
to Tathagata Das, Delta Lake Users and Developers

I meant to say that the documentation link which you shared, says that the log files will be automatically cleaned up after the checkpoints are written.

image.png

After applying the retention period, I reran the streaming job and I am compacting the data periodically. I saw that the Delta API created new checkpoints, so was hoping that it will also remove delta logs for unused delta files, but it did not. 

So wanted to know if I have missed any steps.

Thanks,
Amar

Tathagata Das

unread,
Jul 31, 2020, 1:14:54 PM7/31/20
to Amar Laddha, Delta Lake Users and Developers
If you keep running usual write workloads for a while with the log retention set to 2 hours, then eventually logs for versions that are more than 2 hours old should get cleaned up. However, there is some fuzziness there, as some of the log files may be needed for versions that less than 2 hours old. Hence, you might see some delays. But if the table is continuously being written, they should get cleaned up. 

Amar Laddha

unread,
Jul 31, 2020, 1:42:02 PM7/31/20
to Tathagata Das, Delta Lake Users and Developers
Thanks again Tathagata, will try it out.

Thanks,
Amar

Amar Laddha

unread,
Aug 3, 2020, 3:07:00 PM8/3/20
to Tathagata Das, Delta Lake Users and Developers
Hi Tathagata,

I tried following steps but the older logs (along with older checkpoint files) are still not getting removed from _delta_logs folder even after 12 hours, where the retention interval was 2 hours.

1. Existing ingestion job which data and saves to HDFS
2. Created a Hive table against an existing Delta table using following command:

CREATE TABLE <TABLE_NAME>
USING DELTA
LOCATION '<HDFS_LOCATION>'
TBLPROPERTIES('delta.logRetentionDuration'='interval 2 hours');

3. Restarted job
4. Compacted and Vacuumed data
5. Job has been running for more than 12 hours, but it didn't clean older logs

Please let me know if I am doing any step incorrectly.

Thanks,
Amar

Tathagata Das

unread,
Aug 4, 2020, 2:29:42 PM8/4/20
to Amar Laddha, Delta Lake Users and Developers
Do you see any log files getting removed? Like are there gaps in the log files after 12 hours, where it's like
- last 2 hours of log files present
- previous 10 hours  of log files deleted
- very old log files still present


Amar Laddha

unread,
Aug 4, 2020, 2:50:45 PM8/4/20
to Tathagata Das, Delta Lake Users and Developers
I checked that before I reached out. There are no gaps in the logs. I count the number of files every couple of hours, it simply keeps growing.

Here are the names of initial files which are still present and are non-empty:
00000000000000000000.checkpoint.parquet
00000000000000000000.json
00000000000000000001.json
...
00000000000000000010.checkpoint.parquet

Thanks,
Amar

Amar Laddha

unread,
Aug 5, 2020, 7:11:03 PM8/5/20
to Tathagata Das, Delta Lake Users and Developers
Hi Tathagata,

I was also trying the following table property on another delta table and it didn't seem to auto create/update the manifest files. Is it possible that none of the table properties are being picked up by Delta? That might explain why the logs are also not getting cleaned up. Any way that I can check if Delta is picking those table properties?

ALTER TABLE <TABLE> SET TBLPROPERTIES('delta.compatibility.symlinkFormatManifest.enabled'='true');

Thanks,
Amar

Amar Laddha

unread,
Aug 7, 2020, 11:32:13 PM8/7/20
to Tathagata Das, Delta Lake Users and Developers
Hi Tathagata,

Any other way to debug this issue? Let me know if you would need logs or any additional details.

Thanks,
Amar
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
0 new messages