On Thu, 21 Jan 2021, Guru V wrote:
> I am using Azure databricks delta table on Azure adls gen2 blob storage and
> based on recommendation on msdn, we should use data replication feature of blob
> storage. I have few questions for that recommendation
> 1. delta table clone is not recommended/appropriate for disaster recovery.
> 2. how can i have Active-Active, when for blob storage secondary region is
> only read-only. Or active-active is not recommended.
Greetings! I have also been keenly interested in this topic, and discussed it
with some folks at Databricks. Our requirements are basically to prevent
catastrophic loss of business critical data via:
* Erroneous rewriting of data by an automated job
* Inadvertent table drops through metastore automation.
* Overaggressive use of VACUUM command
* Failed manual sync/cleanup operations by
Data Engineering staff
It's important to consider whether you're worried about the transaction log
getting corrupted, files in storage (e.g. ADLS) disappearing, or both.
We don't have a solution in place yet, but I believe the best way to think
about the problem is through the lens of what "restore" or "recovery" looks
like, and what data loss is acceptable.
For example, with a simple nightly rclone(.org) based snapshot of an S3 bucket, the
"restore" might mean copying the transction log and new parquet files _back_ to
the originating S3 bucket and *losing* up to 24 hours of data, since the
transaction logs would basically be rewound to the last backup point.
The more complex scenarios we have imagined could involve "replaying" or
"undoing" transactions, along with restoring parquet files to the window of
acceptable data loss. I have no thoughts right now on how we would safely
accomplish "fixing" a transaction log with transactions from backups, but this
is one end of the extreme we're considering :)
Hope you don't mind me sharing some thoughts, despite not being able to offer
any solutions at this point :)
GPG Key ID: 0F2298A980EE31ACCA0A7825E5C92681BEF6CEA2