compaction and merge

27 views
Skip to first unread message

Gourav Sengupta

unread,
Apr 8, 2021, 2:02:38 PM4/8/21
to Delta Lake Users and Developers
Hi,

a streaming pipeline is writing data to a location using merge into and I also have to run compaction because of several small files.

As per this link
https://docs.delta.io/latest/concurrency-control.html#optimistic-concurrency-control running compaction with datachange as false can lead to conflicts.

It will be really great if someone could kindly let me know what is the best way to resolve it. Should I stop the streaming job, and run compaction and then start the streaming job again?

Also what about vacuum, does that lead to conflicts as well if I am running it while merging data into streaming location in a separate pipeline?


Regards,
Gourav

Tathagata Das

unread,
Apr 8, 2021, 4:48:53 PM4/8/21
to Gourav Sengupta, Delta Lake Users and Developers
Hi Gourav,

If you are running merge from streaming query using foreachBatch, then a simple strategy is to run the compaction in foreachBatch as well (so no concurrency). 

foreachBatch { 
  // run merge
  // if time since last compaction > 1 day, run compaction
}

Vacuum does not modify current data in the table, hence it cannot conflict with any concurrent operation. 

TD


--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/CAPbE_8%2BeU2mWwMPtgC0Gfie16rZCmS9wouwXML3m1wwBHbazVg%40mail.gmail.com.

Gourav Sengupta

unread,
Apr 9, 2021, 4:34:53 AM4/9/21
to Tathagata Das, Delta Lake Users and Developers
Hi TD,
Thank you so much for your kind response. Can I please draw your kind attention to another detail?
We do end up getting around 100GB of compressed parquet data into each partition and compressing them takes a larger cluster size than we need to have for a continuously running streaming job. So my streaming job runs in c.xlarge cluster and my compression in c5d.4xlarge. 

While I can surely run the compression job in the same cluster as streaming, trying to do that may not end up being cost optimal. 

But let me give your suggestion a try and see how things go.


Regards,
Gourav Sengupta

Zohar Meir

unread,
May 26, 2021, 11:57:09 AM5/26/21
to Delta Lake Users and Developers
FYI if the merge operations in your pipeline are insert-only (no updates) this PR might interest you: https://github.com/delta-io/delta/pull/626

We're using a local build of delta with this feature added and we can run our streaming pipeline concurrently with a scheduled compaction job without any conflicts.

Reply all
Reply to author
Forward
0 new messages