Delta Lake 4.0.0 is released!

161 views

Skip to first unread message

Allison Portis

unread,

Jun 9, 2025, 12:24:59 PMJun 9

to delta...@googlegroups.com

We are excited to announce the final release of Delta Lake 4.0.0! This release includes several exciting new features.

Highlights

[Spark] Preview support for catalog-managed tables, a new table feature that transforms Delta Lake into a catalog-oriented lakehouse table format. This feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change.
[Spark] Delta Connect is an extension for Spark Connect which enables the usage of Delta over Spark Connect, allowing Delta to be used with the decoupled client-server architecture of Spark Connect.
[Spark] Support for the Variant data type to enable semi-structured storage and data processing, for flexibility and performance.
[Spark] Support a new DROP FEATURE implementation that allows dropping table features instantly without truncating history.
[Kernel] Support for reading and writing version checksum.
[Kernel] Support reading log compaction files for better performance during snapshot construction, and support writing log compaction files as a post commit hook.
[Kernel] Support for the Clustered Table feature which enables defining and updating the clustering columns on a table.
[Kernel] Support for writing to row tracking enabled tables.
[Kernel] Support for writing file statistics to the Delta log when they are provided by the engine. This enables data skipping using query filters at read time.

Details by each component.

Sunset of Delta Standalone and dependent connectors

Currently, Delta Standalone and its dependent connectors, including Delta Flink and Delta Hive, are no longer under active development. Starting in Delta 4.0 we will not be releasing these projects as part of the 4.x Delta releases. These connectors are in maintenance mode and, going forward, will only receive critical security fixes and high-severity bug patches in the 3.x series. We are committed to a full transition from Delta Standalone to Delta Kernel and a future Kernel-based Flink connector.

Delta Spark

Delta Spark 4.0 is built on Apache Spark™ 4.0 . Similar to Apache Spark, we have released Maven artifacts for Scala 2.13.

Documentation: https://docs.delta.io/4.0.0/index.html
API documentation: https://docs.delta.io/4.0.0/delta-apidoc.html#delta-spark
Maven artifacts: delta-spark_2.13, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-connect-client_2.13, delta-connect-common_2.13, delta-connect-server_2.13
Python artifacts: https://pypi.org/project/delta-spark/4.0.0/

The key features of this release are:

Delta Connect adds Spark Connect support to Scala and Python APIs of Delta Lake for Apache Spark. Spark Connect is a new project released in Apache Spark 4.0 that adds a decoupled client-server infrastructure which allows remote connectivity from Spark from everywhere. Delta Connect makes the DeltaTable interfaces compatible with the new Spark Connect protocol. For more information on how to use Delta Connect, see the Delta Connect documentation. Delta Connect is currently in preview.
Preview support for catalog-managed tables: Delta Spark now supports reading from and writing to tables that have the catalogOwned-preview feature enabled. This feature allows a catalog to broker all commits to the table it manages, giving the catalog the control and visibility it needs to prevent invalid operations (e.g. commits that violate foreign key constraints), enforce security and access controls, and opens the door for future performance optimizations. Currently write support includes INSERT, MERGE INTO, UPDATE, and DELETE operations.
- Note: this feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change. The catalogOwned-preview feature should not be enabled for production tables and tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
Support for Variant data type: The Variant data type is a new Apache Spark data type. The Variant data type enables flexible, and efficient processing of semi-structured data, without a user-specified schema. Variant data does not require a fixed schema on write. Instead, Variant data is queried using the schema-on-read approach. The Variant data type allows flexible ingestion by not requiring a write schema, and enables faster processing with the Spark Variant binary encoding format. This feature was originally released in preview as part of Delta 4.0.0 Preview, as of 4.0.0 this feature is no longer in preview. Please see the documentation and the example for more details.
Preview support for shredded variants: Shredded variants are a storage optimization which allow for efficient sub-field extraction at the cost of higher write overhead, showing up to 20x read performance improvement. Shredded Variant data is stored according to the Parquet Variant Shredding specification. See the variantShredding RFC for more details.
- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
Type Widening now supports a broader set of type changes and is no longer in preview. This feature allows you to change the data type of a column in your Delta table without rewriting the underlying data files. See the type widening documentation for a list of all supported type changes and additional information. Delta 3.3 or above is required to read tables with type widening enabled.
Support dropping table features without truncating history: The current drop feature implementation requires the execution of the command twice with a 24 hour waiting time in between. In addition, it also results in the truncation of the history of the Delta table to the last 24 hours. The new DROP FEATURE implementation allows dropping features instantly without truncating history. Dropping a feature introduces a new writer feature to the table, the checkpointProtection feature.
- Dropping a feature with the new behaviour can be achieved as follows:
```
ALTER TABLE table_name DROP FEATURE feature_name
```
- We can still drop a feature with the old behavior as follows:
```
ALTER TABLE table_name DROP FEATURE feature_name TRUNCATE HISTORY
```
- The checkpointProtection feature can be dropped with history truncation.

Other notable changes include:

Support dropping table features using the DeltaTable Scala/Python APIs with deltaTable.dropFeatureSupport.
Support dropping the deletionVector table feature.
Support DataFrameReader options to unblock non-additive schema changes when streaming.
Invariant checks for DML commands to detect potential bugs in Delta or Spark earlier during execution and prevent committing the transaction in these cases.
Support the timestampdiff and timestampadd expressions for generated columns.
Support sorting within partitions when Z-ordering. This can be enabled using the Spark conf spark.databricks.io.skipping.mdc.sortWithinPartitions (disabled by default) to improve data skipping at the Parquet level.
Miscellaneous bug fixes
- Fix UPDATE and MERGE to resolve struct fields by-name instead of by-position for structs nested inside map types during an update.
- Fix to throw a better exception when null arguments are provided in CDF queries in both the SQL and DataFrame APIs.
- Update the Python and Scala Delta Table MERGE APIs to return a DataFrame with the affected rows instead of Unit to align with SQL behavior.
- Fix a bug in conflict resolution for Optimize to correctly consider DVs.

Delta Kernel Java

API documentation: https://docs.delta.io/4.0.0/api/java/kernel/index.html
Maven artifacts: delta-kernel-api, delta-kernel-defaults

The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details.

The key features of this release are:

Support loading and writing version checksum in Java Kernel: Java Kernel now supports loading and writing version checksum for every table commit via post commit hook. Detailed metrics like file counts, table size, and data distribution histograms bring stronger consistency guarantees and better debugging tools to your data ecosystem. The checksum is also used to bypass the reading of multiple log files to retrieve the protocol and metadata actions in the Java Kernel, resulting in a decreased snapshot initialization latency.
Support reading log-compaction files when reading the delta log during log-replay. This provides a speedup for Snapshot construction and therefore benefits any processes that require creating a snapshot, like scanning or writing to a table.
Support writing log compaction files as a post commit hook. If the table is in a state that requires a compaction file be created, this hook will be returned from the transaction commit. Invoking the hook will build and write the compaction file. The interval between compactions can be set on the TransactionBuilder via calling TransactionBuilder.withLogCompactionInterval.
Support the clustered table feature. This enables Kernel to define and update the clustering columns on a table, making clustering information available for Delta clustering implementations. Users can now use txnBuilder.withClusteringColumns to create a clustered table or update existing clustering columns.
Support collecting MetricsReports in Kernel for major operations including snapshot construction, scanning, and transactions and reporting them to the engine. These reports include metadata about the operation at hand as well as metrics pertaining to the operation. Engines can integrate with this framework by creating MetricsReporters. In the default engine we have provided a basic logging reporter that serializes reports and logs them using Log4J.
Support writing to tables that have row tracking enabled. For such tables, Kernel assigns unique fresh row IDs and row commit versions to all committed rows. Kernel also resolves conflicts with conflicting transactions that assign overlapping row IDs and row commit versions.
Improved table feature framework for better supporting reader and writer features and their upgrade/enablement story. This standardizes adding new features to Kernel and makes it easy to integrate them into the read/write paths.
- Also as a part of this change, adds support for appending into tables with deletionVectors, v2Checkpoint, and timestampNtz enabled.
Support for writing file statistics to the Delta log. File statistics can be provided by the engine when calling generateAppendActions and Kernel will serialize them and write them to the Delta log. These statistics are used in reads to prune files based on query filters.

Other notable changes include:

Additional functionality in Transaction and TransactionBuilder
- Support adding and removing Domain Metadata during a Transaction.
  - Optimization to avoid loading the existing domain metadata from the snapshot when no domain metadata is removed in a transaction.
- Support setting and unsetting user-facing table properties with txnBuilder.withTableProperties and txnBuilder.withTablePropertiesRemoved.
- Support getting the read table version on the Kernel Transaction API with getReadTableVersion.
- Support configuring the number of retries Kernel attempts when encountering a concurrent transaction during commit with txnBuilder.withMaxRetries.
- Support manually enabling table features by setting delta.feature.<featureName> to “supported” in the table properties.
- Support returning PostCommitHooks as part of TransactionCommitResult. This allows Kernel to generalize post-commit operations like checkpointing or writing CRC and enables the engine to control their invocation.
Support writing to tables with the invariant table feature defined when there are no invariants defined in the schema. This is useful since as a legacy feature invariants often ends up in the table protocol even when it is not active.
Expression support in the DefaultExpressionHandler and for data skipping
- Support the STARTS_WITH expression in the default ExpressionHandler.
- Support the SUBSTRING expression in the default ExpressionHandler.
- Support data skipping using file statistics for the NullSafeEquals expression.
Miscellaneous bug fixes
- Fix a bug where protocol validation was skipped during reads when the metadata action is seen after the protocol action.
- Fix to be able to read partition columns with ISO8601 formatted timestamps adjusted to UTC.
- Fix a bug where encountering a commit info action with operationParameters containing non-uniform values would throw an exception.
- Correctly handle preview and graduated table features when auto-enabling new features in the metadata.
- Fix the default parquet reader to not use package private classes from parquet-mr to prevent IllegalAccess errors when Kernel and Parquet libraries are loaded in different classloaders.

Note: basic schema evolution support via providing an updated schema to the txnBuilder.withSchema method is close to completion and just missed the code cutoff for this release. Look out for this exciting change soon!

Delta Sharing

In this release of Delta Sharing Spark we have upgraded delta-sharing-client from 1.2.2 to 1.3.2. This enables the following changes:

Upgrade Spark to version 4.0.0: platform upgrades to update spark to version 4.0.0, Java to 17 and Scala to 2.13.
Optimized cache usage for improved performance: Simplified key structures of the Spark Parquet IO cache, which enables cache reuse in the Spark Parquet IO layer for identical queries to speed up performance.
Enhanced logging and error propagation for better observability
- Added detailed logging to critical Delta Sharing client code paths to aid debugging. This will help with identifying the root cause of client side exceptions.
- Improved error propagation by surfacing server-side error messages to the client in streaming query failure scenarios.

Limitations

In Delta Spark, UniForm with Iceberg is unavailable currently due to their lack of support for Spark 4.0. This will be enabled in a future release.

Allison Portis
Software Engineer
allison...@databricks.com
www.databricks.com

Reply all

Reply to author

Forward

0 new messages