Delta table disaster recovery

43 views
Skip to first unread message

Guru V

unread,
Jan 21, 2021, 6:44:49 AMJan 21
to Delta Lake Users and Developers
Hi,
I am using Azure databricks delta table on Azure adls gen2 blob storage and based on recommendation on msdn, we should use data replication feature of blob storage. I have few questions for that recommendation
  1. delta table clone is not recommended/appropriate for disaster recovery.
  2. how can i have Active-Active, when for blob storage secondary region is only read-only. Or active-active is not recommended.

Thanks

Burak Yavuz

unread,
Jan 21, 2021, 1:43:56 PMJan 21
to Guru V, Delta Lake Users and Developers
Hi Guru,

I would actually prefer using CLONE over data replication, because with Delta you need "controlled replication", essentially, data files need to be copied over before files in the transaction log. A data replication feature that doesn't know about these requirements may leave your table in an unusable state in a different region.

Best,

Burak Yavuz

Software Engineer

Databricks Inc.

bu...@databricks.com

databricks.com



--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/75fc3d3e-a55e-464d-b15b-4eca642c4d95n%40googlegroups.com.

Guru V

unread,
Jan 21, 2021, 2:06:10 PMJan 21
to Delta Lake Users and Developers
HI Burak,

is there any documentation on strategy/approach to use Clone for disaster recovery? my understanding was clone is for one time copy of data (for specific version).
So if i have use Clone for DR, i have to write some job to run clone on regular schedule to continuously push data to secondary.

the Databricks Doc (Table utility commands — Databricks Documentation) doesn't mention DR as a possible scenario for Clone, but list only one time copy of data like scenarios for Clone.

Thanks

Gourav Sengupta

unread,
Jan 22, 2021, 3:51:20 AMJan 22
to Guru V, Delta Lake Users and Developers
Hi Burak,
I am not entirely sure whether this is over kill or not but can we not use the streaming option in spark using delta as source for replication to create another active instance? 
Regards 
Gourav 

R. Tyler Croy

unread,
Jan 22, 2021, 12:29:14 PMJan 22
to Guru V, Delta Lake Users and Developers
(replies inline)

On Thu, 21 Jan 2021, Guru V wrote:

> I am using Azure databricks delta table on Azure adls gen2 blob storage and
> based on recommendation on msdn, we should use data replication feature of blob
> storage. I have few questions for that recommendation
>
> 1. delta table clone is not recommended/appropriate for disaster recovery.
> 2. how can i have Active-Active, when for blob storage secondary region is
> only read-only. Or active-active is not recommended.


Greetings! I have also been keenly interested in this topic, and discussed it
with some folks at Databricks. Our requirements are basically to prevent
catastrophic loss of business critical data via:

* Erroneous rewriting of data by an automated job
* Inadvertent table drops through metastore automation.
* Overaggressive use of VACUUM command
* Failed manual sync/cleanup operations by
Data Engineering staff

It's important to consider whether you're worried about the transaction log
getting corrupted, files in storage (e.g. ADLS) disappearing, or both.

We don't have a solution in place yet, but I believe the best way to think
about the problem is through the lens of what "restore" or "recovery" looks
like, and what data loss is acceptable.

For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
"restore" might mean copying the transction log and new parquet files _back_ to
the originating S3 bucket and *losing* up to 24 hours of data, since the
transaction logs would basically be rewound to the last backup point.

The more complex scenarios we have imagined could involve "replaying" or
"undoing" transactions, along with restoring parquet files to the window of
acceptable data loss. I have no thoughts right now on how we would safely
accomplish "fixing" a transaction log with transactions from backups, but this
is one end of the extreme we're considering :)



Hope you don't mind me sharing some thoughts, despite not being able to offer
any solutions at this point :)



Toodles

--
GitHub: https://github.com/rtyler

GPG Key ID: 0F2298A980EE31ACCA0A7825E5C92681BEF6CEA2
signature.asc

Gourav Sengupta

unread,
Jan 22, 2021, 2:32:14 PMJan 22
to R. Tyler Croy, Guru V, Delta Lake Users and Developers
Dear Tyler,

for the first time (FIRST TIME EVER) ever in this group, as far as I remember, someone is actually stating the design problem/ goal rather than directly jumping to a technical solution and trying to make that work. Your email is absolutely fantastic :)

There are a few other conditions to consider along with the ones you have so correctly mentioned but the central ideas are based on understanding:
1. how does the business use the data, and
2. how to calculate the percentage of recoverability of data. 
Making anything 100% available or recoverable gets mostly prohibitively expensive. 

For a specific use case I use the following:
1. use the main source table with latest version of the record as delta streaming source
2. from the streaming source generate two tables:
1. TABLE 1: reconciliation table which does aggregates and finds out errors if errors are there then we update a table with the record identifier (PK)
2. TABLE 2: SCD III (append only table with 1 month glacier transition policy in S3 and 6 months deletions)
3. in case there are issues identified later on we audit the changes and roll changes back.

For another use case I just compare two different versions of data in delta and either rollback few records or not. 

Metadata corruption has occurred in the past, but since we have backup in SCD therefore restoration is not complicated. Overall using a delta table as streaming input is the best design that Databricks could come out with.

But these are some specific use cases based on bespoke requirements and may not apply to you. Also they might not be the smartest ones as well - but they do cover almost all the failure scenarios to the point of 99% recoverability. If they sound stupid and unacceptable please do let me know.

Caveat: just to be sure none of the tables I work with have more than one pipeline writing to to them at any point in time.

Regards,
Gourav Sengupta


--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.

Guru V

unread,
Jan 27, 2021, 1:11:51 AMJan 27
to Delta Lake Users and Developers
In my case there are 20+ tables to have disaster recovery and new developers can add new table. for every existing/new table it would be overhead to maintain streaming option to replicate data.

For active-passive scenario, are there any recommendation on approach to use to have secondary workspace as passive(no scheduled jobs execute). Presently every job code has logic to check for a config flag to see if it's active/passive. this config flag is updated by admin and error prone (if both flag are set to active).

Thanks

Reply all
Reply to author
Forward
0 new messages