Delta table disaster recovery

Guru V

unread,

Jan 21, 2021, 6:44:49 AM1/21/21

to Delta Lake Users and Developers

Hi,

I am using Azure databricks delta table on Azure adls gen2 blob storage and based on recommendation on msdn, we should use data replication feature of blob storage. I have few questions for that recommendation

delta table clone is not recommended/appropriate for disaster recovery.
how can i have Active-Active, when for blob storage secondary region is only read-only. Or active-active is not recommended.

Thanks

Burak Yavuz

unread,

Jan 21, 2021, 1:43:56 PM1/21/21

to Guru V, Delta Lake Users and Developers

Hi Guru,

I would actually prefer using CLONE over data replication, because with Delta you need "controlled replication", essentially, data files need to be copied over before files in the transaction log. A data replication feature that doesn't know about these requirements may leave your table in an unusable state in a different region.

Best,

Burak Yavuz

Software Engineer

Databricks Inc.

bu...@databricks.com

databricks.com

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/75fc3d3e-a55e-464d-b15b-4eca642c4d95n%40googlegroups.com.

Guru V

unread,

Jan 21, 2021, 2:06:10 PM1/21/21

to Delta Lake Users and Developers

HI Burak,

is there any documentation on strategy/approach to use Clone for disaster recovery? my understanding was clone is for one time copy of data (for specific version).

So if i have use Clone for DR, i have to write some job to run clone on regular schedule to continuously push data to secondary.

the Databricks Doc (Table utility commands — Databricks Documentation) doesn't mention DR as a possible scenario for Clone, but list only one time copy of data like scenarios for Clone.

Thanks

Gourav Sengupta

unread,

Jan 22, 2021, 3:51:20 AM1/22/21

to Guru V, Delta Lake Users and Developers

Hi Burak,

I am not entirely sure whether this is over kill or not but can we not use the streaming option in spark using delta as source for replication to create another active instance?

Regards

Gourav

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/b2d425ac-78f3-400d-a075-4c9570f2dc11n%40googlegroups.com.

R. Tyler Croy

unread,

Jan 22, 2021, 12:29:14 PM1/22/21

to Guru V, Delta Lake Users and Developers

(replies inline)

On Thu, 21 Jan 2021, Guru V wrote:

> I am using Azure databricks delta table on Azure adls gen2 blob storage and
> based on recommendation on msdn, we should use data replication feature of blob
> storage. I have few questions for that recommendation
>

> 1. delta table clone is not recommended/appropriate for disaster recovery.
> 2. how can i have Active-Active, when for blob storage secondary region is

> only read-only. Or active-active is not recommended.

Greetings! I have also been keenly interested in this topic, and discussed it
with some folks at Databricks. Our requirements are basically to prevent
catastrophic loss of business critical data via:

* Erroneous rewriting of data by an automated job
* Inadvertent table drops through metastore automation.
* Overaggressive use of VACUUM command
* Failed manual sync/cleanup operations by
Data Engineering staff

It's important to consider whether you're worried about the transaction log
getting corrupted, files in storage (e.g. ADLS) disappearing, or both.

We don't have a solution in place yet, but I believe the best way to think
about the problem is through the lens of what "restore" or "recovery" looks
like, and what data loss is acceptable.

For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
"restore" might mean copying the transction log and new parquet files _back_ to
the originating S3 bucket and *losing* up to 24 hours of data, since the
transaction logs would basically be rewound to the last backup point.

The more complex scenarios we have imagined could involve "replaying" or
"undoing" transactions, along with restoring parquet files to the window of
acceptable data loss. I have no thoughts right now on how we would safely
accomplish "fixing" a transaction log with transactions from backups, but this
is one end of the extreme we're considering :)

Hope you don't mind me sharing some thoughts, despite not being able to offer
any solutions at this point :)

Toodles

--
GitHub: https://github.com/rtyler

GPG Key ID: 0F2298A980EE31ACCA0A7825E5C92681BEF6CEA2

signature.asc

Gourav Sengupta

unread,

Jan 22, 2021, 2:32:14 PM1/22/21

to R. Tyler Croy, Guru V, Delta Lake Users and Developers

Dear Tyler,

for the first time (FIRST TIME EVER) ever in this group, as far as I remember, someone is actually stating the design problem/ goal rather than directly jumping to a technical solution and trying to make that work. Your email is absolutely fantastic :)

There are a few other conditions to consider along with the ones you have so correctly mentioned but the central ideas are based on understanding:

1. how does the business use the data, and

2. how to calculate the percentage of recoverability of data.

Making anything 100% available or recoverable gets mostly prohibitively expensive.

For a specific use case I use the following:

1. use the main source table with latest version of the record as delta streaming source

2. from the streaming source generate two tables:

1. TABLE 1: reconciliation table which does aggregates and finds out errors if errors are there then we update a table with the record identifier (PK)

2. TABLE 2: SCD III (append only table with 1 month glacier transition policy in S3 and 6 months deletions)

3. in case there are issues identified later on we audit the changes and roll changes back.

For another use case I just compare two different versions of data in delta and either rollback few records or not.

Metadata corruption has occurred in the past, but since we have backup in SCD therefore restoration is not complicated. Overall using a delta table as streaming input is the best design that Databricks could come out with.

But these are some specific use cases based on bespoke requirements and may not apply to you. Also they might not be the smartest ones as well - but they do cover almost all the failure scenarios to the point of 99% recoverability. If they sound stupid and unacceptable please do let me know.

Caveat: just to be sure none of the tables I work with have more than one pipeline writing to to them at any point in time.

Regards,

Gourav Sengupta

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/20210122172902.vosx454nnxertgoy%40grape.

Guru V

unread,

Jan 27, 2021, 1:11:51 AM1/27/21

to Delta Lake Users and Developers

In my case there are 20+ tables to have disaster recovery and new developers can add new table. for every existing/new table it would be overhead to maintain streaming option to replicate data.

For active-passive scenario, are there any recommendation on approach to use to have secondary workspace as passive(no scheduled jobs execute). Presently every job code has logic to check for a config flag to see if it's active/passive. this config flag is updated by admin and error prone (if both flag are set to active).

Thanks

Paul Bauer

unread,

Apr 26, 2022, 7:08:20 PM4/26/22

to Delta Lake Users and Developers

Just curious if anyone is aware of any further best practices or guidance around DRP in Delta Lake since this conversation was posted? I'm trying to come up with a DRP for our data going forward, and finding any backup/recovery guidance or planning has been tricky to say the least :-)

Paul Bauer

unread,

Apr 26, 2022, 7:10:48 PM4/26/22

to Delta Lake Users and Developers

Forgot to mention - ideally to guard against multiple DRP scenarios...ie. corruption, accidental deletion, systems being compromised, etc. But any guidance is better than nothing :-)

Denny Lee

unread,

Apr 26, 2022, 11:43:02 PM4/26/22

to Paul Bauer, Delta Lake Users and Developers

Hi Paul,

I definitely would like others to chime in here as well but I did want to point out that one potential solution (as Burak noted earlier in the thread) is to use CLONE for disaster recovery purposes. The blog post (shameless plug here as I'm one of the co-authors) Attack of the Delta Clones (Against Disaster Recovery Availability Complexity) covers one of these scenarios.

HTH!

Denny

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/7c2c1b44-cae3-4e1f-b1ec-97ad6a0f4606n%40googlegroups.com.

Paul Bauer

unread,

Apr 27, 2022, 12:03:12 AM4/27/22

to Delta Lake Users and Developers

Thanks Denny! Yours was actually the main blog I *did* find :-). I should have mentioned though, I'm using the OSS (non-databricks) version of Delta Lake - and from what I understand, clone functionality is coming later this year? Or does the open source version also have clone? (I'm not yet on 1.2)

Denny Lee

unread,

Apr 27, 2022, 12:09:29 AM4/27/22

to Paul Bauer, Delta Lake Users and Developers

Oh cool :). You are correct, cloning is coming later this year, here's the current roadmap https://github.com/delta-io/delta/issues/920 and clone is on it. HTH!

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/29ef9e50-dc54-454b-ab0c-7ff90988b689n%40googlegroups.com.

Bauer, Paul

unread,

Jun 30, 2022, 6:02:00 PM6/30/22

to Denny Lee, Delta Lake Users and Developers

Hi Denny,

In light of this weeks announcement of the v2.0 delta lake preview, just wondering if you know of any more concrete timelines for adding clone into the OSS offering? I noticed the roadmap github issue seems to list it as a Q3 priority, but not sure if that means you’re targeting v2.0 GA, or a later release?

Thanks!

Paul

From: Denny Lee <denny...@gmail.com>
Date: Wednesday, 27 April 2022 at 14:09
To: Bauer, Paul <paul....@team.telstra.com>
Cc: Delta Lake Users and Developers <delta...@googlegroups.com>
Subject: Re: Delta table disaster recovery

You don't often get email from denny...@gmail.com. Learn why this is important

[External Email] This email was sent from outside the organisation – be cautious, particularly with links and attachments.

Denny Lee

unread,

Jul 11, 2022, 6:31:56 PM7/11/22

to Bauer, Paul, Delta Lake Users and Developers

Hi Paul,

We will send out a list for us to prioritize together the full list of features in the next week or so.