Delta Lake 2.1.0 Release

315 views

Skip to first unread message

Allison Portis

unread,

Aug 31, 2022, 1:37:32 PM8/31/22

to delta...@googlegroups.com

Hi all,

We are excited to announce the release of Delta Lake 2.1.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

Documentation: https://docs.delta.io/2.1.0/index.html
Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
Python artifacts:https://pypi.org/project/delta-spark/2.1.0/

The key features in this release are as follows:

Support for Apache Spark 3.3.
Support for [TIMESTAMP | VERSION] AS OF in SQL. With Spark 3.3, Delta now supports time travel in SQL to query older data easily. With this update, time travel is now available both in SQL and through the DataFrame API.
Support for Trigger.AvailableNow when streaming from a Delta table. Spark 3.3 introduces Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches. This is now supported when using Delta tables as a streaming source.
Support for SHOW COLUMNS to return the list of columns in a table.
Support for DESCRIBE DETAIL in the Scala and Python DeltaTable API. Retrieve detailed information about a Delta table using the DeltaTable API and in SQL.
Support for returning operation metrics from SQL Delete, Merge, and Update commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed.
Optimize performance improvements

Added a config to use repartition(1) instead of coalesce(1) in Optimize for better performance when compacting many small files.
Improve Optimize performance by using a queue-based approach to parallelize the compaction jobs.

Other notable changes

Support for using variables in the VACUUM and OPTIMIZE SQL commands.
Improvements for CONVERT TO DELTA with catalog tables.

Autofill the partition schema from the catalog when it’s not provided.
Use partition information from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.

Support for Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. See the documentation for more details.
Improve Update performance by enabling schema pruning in the first pass.
Fix for DeltaTableBuilder to preserve table property case of non-delta properties when setting properties.
Fix for duplicate CDF row output for delete-when-matched merges with multiple matches.
Fix for consistent timestamps in a MERGE command.
Fix for incorrect operation metrics for DataFrame writes with a replaceWhere option.
Fix for a bug in Merge that sometimes caused empty files to be committed to the table.
Change in log4j properties file format. Apache Spark upgraded the log4j version from 1.x to 2.x which has a different format for the log4j file. Refer to the Spark upgrade notes.

Benchmark framework update:

Improvements to the benchmark framework (initial version added in version 1.2.0) including support for benchmarking arbitrary functions and not just SQL queries. We’ve also added Terraform scripts to automatically generate the infrastructure to run benchmarks on AWS and GCP.

Allison Portis
Software Engineer
allison...@databricks.com
www.databricks.com

Reply all

Reply to author

Forward

0 new messages