Hi everyone,
Recently we've upgraded our spark structured streaming process to work on spark-3.1.1 and delta 0.8 (previously it worked with spark 2.4.7 and delta 0.6.1).
Everything works just fine except that randomly corrupted checkpoint files are written as part of the delta log (the parquet files) which causes the streaming process to fail because the latest snapshot can't be loaded.
As a workaround, we remove the corrupted checkpoint file and change the _last_checkpoint to point to the latest valid checkpoint file and the streaming process recovers successfully.
Unfortunately we didn't find a way to reproduce the issue - it happens once in a few days, randomly, in different streaming processes.
I know that delta 0.8 officially doesn't support spark 3.1.1 but I wonder if it could be related?
Do you have any other idea why it could happen?
Any help would be appreciated.
Best,
Gabi
----------------------------------------------------------------------------------------------------------------------------------------------
The exception raised when we try to read the corrupted checkpoint file is:
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Negative length: -2
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.checkContainerReadLength(TCompactProtocol.java:737)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readMapBegin(TCompactProtocol.java:578)
at org.apache.parquet.format.InterningProtocol.readMapBegin(InterningProtocol.java:166)
at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:119)
at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1047)
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:843)
at org.apache.parquet.format.Util.read(Util.java:213)