compressed chunk failed checksum

18 views
Skip to first unread message

Ariel Cohen

unread,
Nov 29, 2021, 4:48:01 PM11/29/21
to scyllad...@googlegroups.com
I am seeing following errors, about 1000 times per minute from same shard on one of the node.
What causes it and what should I do to resolve it.
I'm using scylla 4.4.2-0.20210522.2b29568bf
 
[shard 50] compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying
 
 
Thanks
 
 

Avi Kivity

unread,
Nov 30, 2021, 2:24:37 AM11/30/21
to scyllad...@googlegroups.com, Ariel Cohen
It's an indication of data file corruption. Check if previous logs mention the bad sstable. If they don't, open an issue and we'll enhance the logs.

To recover, you can trash the node and rebuild it, or if you identify the bad sstable you can stop the node, delete it, and repair. Note deleting sstables has the chance to cause data resurrection (deleted data becomes undeleted), so trashing the node is the safer option.
--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/trinity-8c28bb6a-12b4-4597-bb5e-13c22c0940a0-1638222479195%403c-app-mailcom-bs15.


Raphael S. Carvalho

unread,
Nov 30, 2021, 6:05:28 AM11/30/21
to ScyllaDB users, Ariel Cohen
On Tue, Nov 30, 2021 at 4:24 AM Avi Kivity <a...@scylladb.com> wrote:
>
> It's an indication of data file corruption. Check if previous logs mention the bad sstable. If they don't, open an issue and we'll enhance the logs.

That's https://github.com/scylladb/scylla/commit/2c8d84b864fa4f3b59edb25fa4aa48451d8b7e0c,
but will only be available in upcoming scylla 4.6

Ariel, can you please share details about your disk, filesystem and
kernel version? Looking at kernel logs is also interesting to see if
disk isn't consistently failing.

>
> To recover, you can trash the node and rebuild it, or if you identify the bad sstable you can stop the node, delete it, and repair. Note deleting sstables has the chance to cause data resurrection (deleted data becomes undeleted), so trashing the node is the safer option.
>
>
> On 11/29/21 23:47, Ariel Cohen wrote:
>
> I am seeing following errors, about 1000 times per minute from same shard on one of the node.
> What causes it and what should I do to resolve it.
> I'm using scylla 4.4.2-0.20210522.2b29568bf
>
> [shard 50] compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying
>
>
> Thanks
>
>
> --
> You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/trinity-8c28bb6a-12b4-4597-bb5e-13c22c0940a0-1638222479195%403c-app-mailcom-bs15.
>
>
> --
> You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/ab00c0c6-accf-43d7-e4c7-39a19ae4a511%40scylladb.com.

Ariel Cohen

unread,
Nov 30, 2021, 10:53:50 AM11/30/21
to Raphael S. Carvalho, ScyllaDB users
there are no disk failure messages in dmesg

# could this be because I stopped server without running 'nodetool drain' first ?
# how can I find corrupted sstables now ?


$ uname -a
Linux ******* 3.10.0-1127.19.1.el7.x86_64

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.8 (Maipo)

$ mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Jun 8 12:50:06 2021
Raid Level : raid0
Array Size : 41254580224 (39343.43 GiB 42244.69 GB)
Raid Devices : 22
Total Devices : 22
Persistence : Superblock is persistent

Update Time : Tue Jun 8 12:50:06 2021
State : clean
Active Devices : 22
Working Devices : 22
Failed Devices : 0
Spare Devices : 0

Chunk Size : 1024K

Consistency Policy : none

Name : *******:0 (local to host *******)
UUID : ecbea2b6:79d218a9:d236e1c8:9f10cd55
Events : 0

Number Major Minor RaidDevice State
0 8 65 0 active sync /dev/sde1
1 8 81 1 active sync /dev/sdf1
2 8 97 2 active sync /dev/sdg1
3 8 113 3 active sync /dev/sdh1
4 8 129 4 active sync /dev/sdi1
5 8 145 5 active sync /dev/sdj1
6 8 161 6 active sync /dev/sdk1
7 8 177 7 active sync /dev/sdl1
8 8 193 8 active sync /dev/sdm1
9 8 209 9 active sync /dev/sdn1
10 8 225 10 active sync /dev/sdo1
11 8 241 11 active sync /dev/sdp1
12 65 1 12 active sync /dev/sdq1
13 65 17 13 active sync /dev/sdr1
14 65 33 14 active sync /dev/sds1
15 65 49 15 active sync /dev/sdt1
16 65 65 16 active sync /dev/sdu1
17 65 81 17 active sync /dev/sdv1
18 65 97 18 active sync /dev/sdw1
19 65 113 19 active sync /dev/sdx1
20 65 129 20 active sync /dev/sdy1
21 65 145 21 active sync /dev/sdz1


Thanks
 
 
 

Sent: Tuesday, November 30, 2021 at 6:05 AM
From: "Raphael S. Carvalho" <raph...@scylladb.com>
To: "ScyllaDB users" <scyllad...@googlegroups.com>
Cc: "Ariel Cohen" <scyll...@gmx.com>
Subject: Re: [scylladb-users] compressed chunk failed checksum
On Tue, Nov 30, 2021 at 4:24 AM Avi Kivity <a...@scylladb.com> wrote:
>
> It's an indication of data file corruption. Check if previous logs mention the bad sstable. If they don't, open an issue and we'll enhance the logs.

That's https://github.com/scylladb/scylla/commit/2c8d84b864fa4f3b59edb25fa4aa48451d8b7e0c,
but will only be available in upcoming scylla 4.6

Ariel, can you please share details about your disk, filesystem and
kernel version? Looking at kernel logs is also interesting to see if
disk isn't consistently failing.

>
> To recover, you can trash the node and rebuild it, or if you identify the bad sstable you can stop the node, delete it, and repair. Note deleting sstables has the chance to cause data resurrection (deleted data becomes undeleted), so trashing the node is the safer option.
>
>
> On 11/29/21 23:47, Ariel Cohen wrote:
>
> I am seeing following errors, about 1000 times per minute from same shard on one of the node.
> What causes it and what should I do to resolve it.
> I'm using scylla 4.4.2-0.20210522.2b29568bf
>
> [shard 50] compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying
>
>
> Thanks
>
>
> --
> You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/trinity-8c28bb6a-12b4-4597-bb5e-13c22c0940a0-1638222479195%403c-app-mailcom-bs15[https://groups.google.com/d/msgid/scylladb-users/trinity-8c28bb6a-12b4-4597-bb5e-13c22c0940a0-1638222479195%403c-app-mailcom-bs15].
>
>
> --
> You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/ab00c0c6-accf-43d7-e4c7-39a19ae4a511%40scylladb.com[https://groups.google.com/d/msgid/scylladb-users/ab00c0c6-accf-43d7-e4c7-39a19ae4a511%40scylladb.com].

Avi Kivity

unread,
Nov 30, 2021, 11:03:44 AM11/30/21
to scyllad...@googlegroups.com, Ariel Cohen, Raphael S. Carvalho, Benny Halevy
It's not caused by an operation error. This is either a hardware error
or a serious (and very rare) bug in Scylla.


There is a nodetool scrub command that can be used to validate sstables,
but I'm not sure which version it appeared in.
Reply all
Reply to author
Forward
0 new messages