Code Contribution Discuss

26 views

Skip to first unread message

楊閔富

unread,

Jul 31, 2025, 1:57:53 PMJul 31

to Delta Lake Users and Developers

Delta lake is an amazing product & we have been using delta lake to serving millions of customers all over the world

To meet our service requirements, we have done several changes to delta lake. now we want to contribute back to the community

Our changes may not align delta lake direction & design principles. but we would like to describe our scenarios & why we need those changes, then we can discuss how to make those changes to align your design principles & direction.

Scenarios:

We are using Spark 3.3 & delta lake 2.3, we only do append operation with schema merge

a. We use a Spark structured streaming job to write to ADLSGen2 storage, due to delta lake do state construction after every write, it introduces high latency & our streaming job cannot meet SLA, so we add a config to control whether to skip state construction after write, but for checkpointing we always do state re-construction

b. The data in storage will then be ingested to Azure Data Explorer with Ingest from storage using Event Grid subscription - Azure Data Explorer | Microsoft Learn, which involves attaching blob metadata after file creation & before file flush, so we also updated transaction committing code in delta lake, so Data Explorer can ingest data.

c. For high availability, we use multiple storages to store data, say we have storage 1, 2, 3. in Streaming job, we write batch 1 to storage 1, batch 2 to storage 2,..batch 3 to storage 3, batch 4 to storage 1,... etc. Reading will read 3 storages' data & do union, in some cases, schema is different between 3 storages in deep nested fields, so union by name doesn't help, so we update the delta lake can accept specified schema during read

d. During state construction, the listing on delta log should only list delta files after last checkpoint, but under the neath, it still listing from beginning, which also slow down read/write, so we changed it to use ADLGen2 'listFrom' API

f. We also implemented our own changed data set query to make it skip state construction & access delta log directly to get files added between time range since we don't do any update/delete operation.

We would like to know what's the best way to handle these scenarios from your point of view & how can we merge the changes.