Hello Ryan, Burak,
Thank you for such an accurate and quick response. I'm using delta 0.7.0.
You're right the issue is fixed in 0.8.0 I have just tried the fix.
However, now I have another problem. In 0.8.0 I'm not able to start a streaming query, which should work without data-loss.
My delta state (after log cleanup):
00000000000000000180.json
00000000000000000180.checkpoint.parquet
00000000000000000181.json
00000000000000000182.json
00000000000000000183.json
Stream checkpoint state:
_checkpoint/offsets/4 -> "reservoirVersion" : 172, isStartingVersion: false
_checkpoint/offsets/5 -> "reservoirVersion" : 182, isStartingVersion: false (committed)
I'm getting exception:
org.apache.spark.sql.streaming.StreamingQueryException: The stream from your Delta table was expecting process data from version 172,
but the earliest available version in the _delta_log directory is 180
Since all data is downstreamed until version 181 and I have delta logs to continue, the stream should be able to start.
It is the same initialization symptom as the one from my previous mail, however now it is not needed to have isStartingVersion: false.
The problem is that on streaming start, Spark calls the
getBatch method 2 times. Once only for initialization with checkpoints offsets/4 (start = 172), offsets/5 (end = 182).
And second time with correct offsets - start = 182, end = 184 and this DF is used for query.