Corrupted checkpoint file in Delta Log

1,528 views
Skip to first unread message

Pi Gabi

unread,
Apr 20, 2021, 4:12:38 AM4/20/21
to Delta Lake Users and Developers
Hi everyone,

Recently we've upgraded our spark structured streaming process to work on spark-3.1.1 and delta 0.8 (previously it worked with spark 2.4.7 and delta 0.6.1).
Everything works just fine except that randomly corrupted checkpoint files are written as part of the delta log (the parquet files) which causes the streaming process to fail because the latest snapshot can't be loaded. 
As a workaround, we remove the corrupted checkpoint file and change the _last_checkpoint to point to the latest valid checkpoint file and the streaming process recovers successfully.

Unfortunately we didn't find a way to reproduce the issue - it happens once in a few days, randomly, in different streaming processes.

I know that delta 0.8 officially doesn't support spark 3.1.1 but I wonder if it could be related? 
Do you have any other idea why it could happen?

Any help would be appreciated.

Best,
Gabi

----------------------------------------------------------------------------------------------------------------------------------------------

The exception raised when we try to read the corrupted checkpoint file is:

Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Negative length: -2
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.checkContainerReadLength(TCompactProtocol.java:737)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readMapBegin(TCompactProtocol.java:578)
at org.apache.parquet.format.InterningProtocol.readMapBegin(InterningProtocol.java:166)
at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:119)
at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1047)
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:843)
at org.apache.parquet.format.Util.read(Util.java:213)


Jacek Laskowski

unread,
Apr 21, 2021, 5:32:55 AM4/21/21
to Pi Gabi, Delta Lake Users and Developers
Hi,

spark-3.1.1 and delta 0.8?! They're incompatible. You should downgrade to Spark 3.0.2. See https://github.com/delta-io/delta/issues/594

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/7ef831b4-e7f5-4bf0-b42f-bb9adbbad4f1n%40googlegroups.com.

Gourav Sengupta

unread,
Apr 21, 2021, 6:12:11 AM4/21/21
to Jacek Laskowski, Pi Gabi, Delta Lake Users and Developers
Hi Jacek,
spot on :) you almost read my mind.

Hi Gabi,
we are also planning to do the migration and the steps are quite simple, if you look into the history of this chat then TD has already answered my question on this wonderfully. Till now we have tried to run pipelines in SPARK 2.4.x in delta 0.5.0 and stop them, and run on the same location pipelines in SPARK 3.0.1 with delta 0.8.0 and things looks fine.
Can you please let me know the steps that you took to identify and recover from the error in a bit more detail, may be I am missing something and need to look into the checkpoint files a bit deeper? I will be grateful for the same.


Regards,
Gourav

Pi Gabi

unread,
Apr 22, 2021, 3:01:57 AM4/22/21
to Delta Lake Users and Developers
Jacek and Gourav - thanks for your responses.

Do you know maybe what specifically in spark 3.1.1 may cause this issue? because unlike the merge operation, which is a known issue and can be reproduced, our issue with the corrupted delta log checkpoint occurs sporadically.
If you could point out on changes in spark 3.1.1 that might cause it, it would be great.

Gourav - the identification of the issue is pretty easy: the spark structured streaming process suddenly fails to load the latest snapshot from delta log and the spark app fails on `java.io.EOFException` exception. 
Restarting the streaming process doesn't help and it keeps failing on loading the latest snapshot. 
After some investigation, we found out that the problem is caused by latest checkpoint file which is corrupted so we remove this parquet file and change _last_checkpoint  file to point on the last valid checkpoint. After that, we restart the streaming process and it works properly again.
Let me know if you have more questions about it.

Best, 
Gabi




ב-יום רביעי, 21 באפריל 2021 בשעה 13:12:11 UTC+3, gourav....@gmail.com כתב/ה:

Jacek Laskowski

unread,
Apr 22, 2021, 7:45:50 AM4/22/21
to Pi Gabi, Delta Lake Users and Developers
Hi Gabi,

> Do you know maybe what specifically in spark 3.1.1 may cause this issue?

Not really. Sorry. Given it's unsupported version combination you're stepping on a thin ice with 3.1.1 and 0.8.0. Good luck!

Gourav Sengupta

unread,
Apr 22, 2021, 7:50:18 AM4/22/21
to Pi Gabi, Delta Lake Users and Developers
Hi Gabi,

thanks ton for your kind response, it was detailed and helped me quite a bit.


Regards,
Gourav Sengupta

Reply all
Reply to author
Forward
0 new messages