We are excited to announce the preview release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
The key features in this release are as follows:
LIMIT
pushdown into Delta scan. Improve the performance of queries containing LIMIT
clauses by pushing down the LIMIT
into Delta scan during query planning. Delta scan uses the LIMIT
and the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could make LIMIT
queries faster by 10-100x depending upon the table size.
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as SELECT COUNT(*)
on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x.
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify NO STATISTICS
clause in the command. Example: CONVERT TO DELTA table_name NO STATISTICS
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from commitTime
to expireTime
. If you already have TTL enabled, please follow the migration steps here.
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
Other notable changes
where()
calls in Optimize scala/python API.
or _
in CONVERT TO DELTA command.MERGE INTO
when there are multiple UPDATE
clauses and one of the UPDATEs is with a schema evolution.SparkSession
object is not found when using Delta APIslast_checkpoint
file fails.AvailableNow
trigger on a Delta table.How to use the preview release
For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
spark-submit --packages io.delta:delta-core_2.12:2.2.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ examples/examples.py
2.2.0rc1
by just providing the --packages io.delta:delta-core_2.12:2.2.0rc1
argument.<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1102/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0rc1</version>
</dependency>
libraryDependencies += "io.delta" %% "delta-core" % "2.2.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1102/
pip install -i https://test.pypi.org/simple/ delta-spark==2.2.0rc1
Credits
Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)