Delta lake is an amazing product & we have been using delta lake to serving millions of customers all over the world
To meet our service requirements, we have done several changes to delta lake. now we want to contribute back to the community
Our changes may not align delta lake direction & design principles. but we would like to describe our scenarios & why we need those changes, then we can discuss how to make those changes to align your design principles & direction.
Scenarios:
We are using Spark 3.3 & delta lake 2.3, we only do append operation with schema merge
a. We use a Spark structured streaming job to write to ADLSGen2 storage, due to delta lake do state construction after every write, it introduces high latency & our streaming job cannot meet SLA, so we add a config to control whether to skip state construction after write, but for checkpointing we always do state re-construction
c. For high availability, we use multiple storages to store data, say we have storage 1, 2, 3. in Streaming job, we write batch 1 to storage 1, batch 2 to storage 2,..batch 3 to storage 3, batch 4 to storage 1,... etc. Reading will read 3 storages' data & do union, in some cases, schema is different between 3 storages in deep nested fields, so union by name doesn't help, so we update the delta lake can accept specified schema during read
d. During state construction, the listing on delta log should only list delta files after last checkpoint, but under the neath, it still listing from beginning, which also slow down read/write, so we changed it to use ADLGen2 'listFrom' API
f. We also implemented our own changed data set query to make it skip state construction & access delta log directly to get files added between time range since we don't do any update/delete operation.
We would like to know what's the best way to handle these scenarios from your point of view & how can we merge the changes.
Looking forward your reply
Thanks