Getting ConcurrentAppendException S3(multi-cluster option)

33 views
Skip to first unread message

rameshkumar....@gmail.com

unread,
Apr 25, 2024, 2:43:10 PMApr 25
to Delta Lake Users and Developers
Hi All,

  I am evaluating the s3 muti-cluster option and using the following version. 

hadoop-aws:3.3.6,

delta-spark_2.12:3.0.0

delta-storage-s3-dynamodb:3.0.0

I still encounter a ConcurrentAppendException when performing merge operations in two different clusters.

I can see the entry in the DynamoDB and the .tmp file in the _delta_log.

Delta. Exceptions.ConcurrentAppendException: Files were added to partition [data_date=2023-09-14] by a concurrent update. Please try the operation again. Conflicting commit:

I followed the steps as mentioned on the page.

https://docs.delta.io/latest/delta-storage.html#requirements-s3-multi-cluster 

Has anyone successfully set up a multi-cluster? Do you have any idea how to resolve the issue I am facing?

Thanks,

Ramehskumar S

Burak Yavuz

unread,
Apr 25, 2024, 2:46:52 PMApr 25
to rameshkumar....@gmail.com, Delta Lake Users and Developers
The error you're receiving shows that it is working! You're getting a commit conflict: your concurrent MERGE operations are updating the same partition data, therefore one needs to retry after the other succeeds.

Best,

Burak Yavuz

Software Engineer

Databricks Inc.

bu...@databricks.com

databricks.com



--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/fb4bf113-8d65-4756-acbe-3155ac9bf6e9n%40googlegroups.com.

rameshkumar....@gmail.com

unread,
Apr 25, 2024, 3:13:23 PMApr 25
to Delta Lake Users and Developers
Hi Burak Yavuz,

Do you think I should try the merge operation again if I encounter the exception? 

The retry can also be performed in single-cluster mode, but what benefits does the multi-cluster option have?

Thanks,
Rameshkumar S

Burak Yavuz

unread,
Apr 29, 2024, 1:16:54 PMApr 29
to rameshkumar....@gmail.com, Delta Lake Users and Developers
Yes, you should just retry the merge operation, and any ConcurrentTransactionException, which you may encounter. 

> The retry can also be performed in single-cluster mode, but what benefits does the multi-cluster option have?

Multi-cluster mode is for when you can't orchestrate these operations on the same cluster. Imagine you have one cluster that is performing streaming ingest to your Delta tables continuously. Then you may have a separate cluster that is performing compactions on your data to improve query performance. Then you may have another cluster performing GDPR deletes every 30 days.

Multi-cluster mode isn't necessarily for performance benefits, it's more for orchestration simplicity.

Best,

Burak Yavuz

Software Engineer

Databricks Inc.

bu...@databricks.com

databricks.com


Reply all
Reply to author
Forward
0 new messages