We are excited to announce the release of Delta Lake 1.1.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
The release notes are available at https://github.com/delta-io/delta/releases/tag/v1.1.0 and you can find the official documentation at https://docs.delta.io/1.1.0/index.html.
Here are relevant links for Maven and PyPi, too.
Performance improvements in MERGE operation
On partitioned tables, MERGE operations will automatically repartition the output data before writing to files. This ensures better performance out-of-the-box for both the MERGE operation as well as subsequent read operations.
On very wide tables (e.g., 1000 columns), MERGE operation can be faster since it now avoids quadratic complexity when resolving column names in a table with ~1000 or more columns.
Support for passing Hadoop configurations via DataFrameReader/Writer options -
You can now set Hadoop FileSystem configurations (e.g., access credentials) via DataFrameReader/Writer options. Earlier the only way to pass such configurations was to set Spark session configuration which would set them to the same value for all reads and writes. Now you can set them to different values for each read and write. See the documentation for more details.
Support for arbitrary expressions in`replaceWhere` DataFrameWriter option - Instead of expressions only on partition columns, you can now use arbitrary expressions in the `replaceWhere` DataFrameWriter option. That is you can replace arbitrary data in a table directly with DataFrame writes. See the documentation for more details.
Improvements to nested field resolution and schema evolution in MERGE operation on array of structs - When applying the MERGE operation on a target table having a column typed as an array of nested structs, the nested columns between the source and target data are now resolved by name and not by position in the struct. This ensures structs in arrays have a consistent behavior with structs outside arrays. When automatic schema evolution is enabled for MERGE, nested columns in structs in arrays will follow the same evolution rules (e.g., column added if no column by the same name exists in the table) as columns in structs outside arrays. See the documentation for more details.
Support for Generated Columns in MERGE operation - You can now apply MERGE operations on tables having Generated Columns.
Fix for rare data corruption issue on GCS - Experimental GCS support released in Delta Lake 1.0 has a rare bug that can lead to Delta tables being unreadable due to partially written transaction log files. This issue has now been fixed (1, 2).
Fix for the incorrect return object in Python `DeltaTable.convertToDelta()` - This existing API now returns the correct Python object of type `delta.tables.DeltaTable` instead of an incorrectly-typed, and therefore unusable object.
Python type annotations - We have added Python type annotations which improve auto-completion performance in editors which support type hints. Optionally, you can enable static checking through mypy or built-in tools (for example Pycharm tools).
Other notable changes
Removed support to read tables with certain special characters in partition column name. See migration guide for details.
Support for “delta.`path`” in `DeltaTable.forName()` for consistency with other APIs
Improvements to DeltaTableBuilder API introduced in Delta 1.0.0
Fix for bug that prevented passing of multiple partition columns in Python `DeltaTableBuilder.partitionBy`.
Throw error when column data type is not specified.
Improved support for MERGE/UPDATE/DELETE on temp views.
Support for setting `userMetadata` in the commit information when creating or replacing tables.
Fix for an incorrect analysis exception in MERGE with multiple INSERT and UPDATE clauses and automatic schema evolution enabled.
Fix for incorrect handling of special characters (e.g. spaces) in paths by MERGE/UPDATE/DELETE operations.
Fix for Vacuum parallel mode from being affected by the Adaptive Query Execution enabled by default in Apache Spark 3.2.
Fix for earliest valid time travel version.
Fix for Hadoop configurations not being used to write checkpoints.
