delta lake merge schema compatibility

Jiwei Cao

unread,

Jun 29, 2021, 11:32:46 AM6/29/21

to Delta Lake Users and Developers

Hi ALL:

I recently ran into some schema compatibility issue while using delta merge updates data.

The scenario I'm doing is I have multiple jobs which are updating a single delta table using merge update. I have set the schema autoMerge to be enabled. However, what I observe is some job merge changes and update the schema of the table by adding a new column, at the same time other jobs will fail with error:

ERROR - Job error running for <table>, with exception The schema of your Delta table has changed in an incompatible way since your DataFrame or DeltaTable object was created. Please redefine your DataFrame or DeltaTable object.

When I rerun the other jobs later, the error goes away. I wonder why it is failing and is there a way I can make my job more robust?

Jiwei

Tathagata Das

unread,

Jun 29, 2021, 1:02:33 PM6/29/21

to Jiwei Cao, Delta Lake Users and Developers

Do you have any more information about the failure? For example

- Full stack trace?

- Schema of the data you are merging? Schema of the table?

- What Delta version are you using?

TD

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/dcbcfbbd-60b4-42f1-b177-038d5785c1afn%40googlegroups.com.

Jiwei Cao

unread,

Jun 30, 2021, 5:32:34 PM6/30/21

to Tathagata Das, Delta Lake Users and Developers

Hi Tathagata: let me get some examples to share.

--

Jiwei Cao

Sr. Machine Learning Engineer

Outreach

Follow me on LinkedIn

Jacek Laskowski

unread,

Jul 6, 2021, 10:46:25 AM7/6/21

to Jiwei Cao, Delta Lake Users and Developers

Hi Jiwei,

I think I ran into it today myself!

The reason was that I defined a delta table with id column of type int while the data I was trying to append (with overwrite option) used longs.

I think the description of spark.databricks.delta.checkLatestSchemaOnRead option [1] can clear things up a bit (if not completely):

In Delta, we always try to give users the latest version of their data without " +
"having to call REFRESH TABLE or redefine their DataFrames when used in the context of " +
"streaming. There is a possibility that the schema of the latest version of the table " +
"may be incompatible with the schema at the time of DataFrame creation. This flag " +
"enables a check that ensures that users won't read corrupt data if the source schema " +
"changes in an incompatible way.

HTH

[1] https://github.com/delta-io/delta/blob/e36dc6b9ca8ea8e893080dcea847978d5835125b/core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala#L100-L109

Pozdrawiam,

Jacek Laskowski

----

https://about.me/JacekLaskowski

"The Internals Of" Online Books

Follow me on https://twitter.com/jaceklaskowski

Reply all

Reply to author

Forward