We are using Delta Table Merge statement but we would like to only update rows in the destination when both the ID between our source and destination are matched AND the data in that row is actually different. I've accomplished this in the past in SQL Server DW by using an EXCEPT clause to produce a set of rows where the data in at least one column was different
Example:
Source Table
ID - NAME - CITY
1 - Logan - Austin
2 - Boyd - Houston
Destination Table
ID - NAME - CITY
1 - Logan - Austin
2 - Boyd - Austin
merge into destination using source
on
destination.id =
source.idwhen matched and (
source.name !=
destination.name or source.city != destination.city) then
update set *
when not matched then insert *
The result of the merge should be the row with ID=2 gets updated in the destination but ID=1 does not since values would not change.
I would like to be able to do this in an effective way for tables that have 100+ columns.
Is creating an MD5 hash of the set of columns to potentially be updated a good option?
What other options do I have besides a giant list of not equals conditions?
The rows that actually get updated or inserted from the merge would have a timestamp on them that would indicate to downstream processes so they could update their datasets too. Passing millions of records along that didn't have any data changed will flood our API with too much data.
thanks,
Logan
Austin, TX