Hi everyone,
I have a production issue (Spark 3.1.0 , Databricks Runtime 9.1) with a multiple structured streaming queries ( notebooks written in Python)
From time to time I will get
My structured streaming queries are simply replicating data from tables in the delta lake to identical tables in another delta lake.
I have nightly maintenance jobs (Vacuum and Optimize).
I have zoomed on the problem to be happening on delta tables that sees little to no data updates. The structured streaming query reading from such tables will always be referencing an old table version and never advances. However, every night when vacuum job runs it will create entries in delta transactional log. Eventually, delta runs its course and delete transaction logs when the log retention time is due (i.e. deleting the transactional log file referenced by the structured query checkpoint).
My way to mitigate this is to reset that particular query and starting from the right table version.
My question is: Is that typical? or am I missing something and I am causing my own agony?
Thanks,
Yasmine