Scylla scylla-server-2.3.1 cluster collapse due to low storage on 1 node

39 views
Skip to first unread message

Andrei

unread,
May 31, 2021, 9:50:12 AMMay 31
to ScyllaDB users
I have a scylla cluster with 4 nodes in AWS on i3.large instances.

The node below was left without space on scylla partition and crashed ( not the whole node, only scylla services):

[root@ip-10--83 log]# uptime
 13:46:17 up 478 days, 14:12,  1 user,  load average: 0.66, 0.62, 0.71

[root@ip-10--83 log]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.5G     0  7.5G   0% /dev
tmpfs           7.5G     0  7.5G   0% /dev/shm
tmpfs           7.5G  710M  6.8G  10% /run
tmpfs           7.5G     0  7.5G   0% /sys/fs/cgroup
/dev/xvda1       10G  5.6G  4.5G  56% /
/dev/nvme0n1    443G  257G  186G  59% /var/lib/scylla
tmpfs           1.5G     0  1.5G   0% /run/user/1000


Each node has 2 vCPUs, 14GB of RAM and 480GB NVME storage.

[root@ip-centos]# rpm -qa | grep scylla
scylla-libstdc++73-7.3.1-1.2.el7.centos.x86_64
scylla-kernel-conf-2.3.1-0.20181021.336c77166.el7.x86_64
scylla-debuginfo-2.3.1-0.20181021.336c77166.el7.x86_64
scylla-libgcc73-7.3.1-1.2.el7.centos.x86_64
scylla-2.3.1-0.20181021.336c77166.el7.x86_64
scylla-env-1.1-1.el7.noarch
scylla-tools-2.3.1-20181021.823346d3b0.el7.noarch
scylla-jmx-2.3.1-20181021.5fcbf8e.el7.noarch
scylla-conf-2.3.1-0.20181021.336c77166.el7.x86_64
scylla-tools-core-2.3.1-20181021.823346d3b0.el7.noarch
scylla-server-2.3.1-0.20181021.336c77166.el7.x86_64

At the beginning of the month a maintenance script is run on each node in the first 4 days of the month. For example on the 1st of May at midnight the script was run on node 1, on the 2nd of May at midnight the script was run on node 2 and so on.
This is the script :

[root@ip-centos]# cat scylla_restart.sh
#!/bin/bash
sudo systemctl restart scylla-jmx
sleep 60
nodetool cleanup
sleep 60
nodetool repair
sleep 60
nodetool drain
sleep 60
sudo systemctl stop scylla-server
sleep 60
sudo systemctl start scylla-server
sleep 60
nodetool status >> /tmp/cron.log

So on 4th of May one node crashed because disk got full:

May  4 08:09:36 ip-83 scylla: [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention
May  4 08:09:36 ip-83 scylla: [shard 0] storage_service - Disk error: std::system_error (error system:28, No space left on device)
May  4 08:09:36 ip--83 scylla: [shard 0] storage_service - Stop transport: starts
May  4 08:09:36 ip--83 scylla: [shard 0] gossip - My status = NORMAL
May  4 08:09:36 ip-83 scylla: [shard 0] gossip - Announcing shutdown

 Then at the same moment the other 3 nodes crashed ( by nodes i mean scylla services, not the entire VM). Apparently the 3 nodes crashed because of segmentation fault 

May  4 08:09:36 ip-47 scylla: [shard 0] gossip - InetAddress 10.xxx.xxx.83 is now DOWN, status = shutdown
May  4 08:09:36 ip-47 scylla: [shard 0] stream_session - stream_manager: Close all stream_session with peer = 10.xxx.xxx.83 in on_dead
May  4 08:09:36 ip-47 scylla: Segmentation fault on shard 0.
May  4 08:09:36 ip-47 scylla: Backtrace:
May  4 08:09:36 ip-47 scylla: 0x00000000006885c2
May  4 08:09:36 ip-47 scylla: 0x00000000005a36bc
May  4 08:09:36 ip-47 scylla: 0x00000000005a3965
May  4 08:09:36 ip-47 scylla: 0x00000000005a39b3
May  4 08:09:36 ip-47 scylla: /lib64/libpthread.so.0+0x000000000000f6cf
May  4 08:09:36 ip-47 scylla: 0x0000000001b3a96e
May  4 08:09:36 ip-47 scylla: 0x0000000001bbecc1
May  4 08:09:36 ip-47 scylla: 0x000000000126665e
May  4 08:09:36 ip-47 scylla: 0x00000000005a7f3b
May  4 08:09:36 ip-47 scylla: 0x000000000058332b
May  4 08:09:36 ip-47 scylla: 0x0000000000580614
May  4 08:09:36 ip-47 scylla: 0x0000000000652286
May  4 08:09:36 ip-47 scylla: 0x000000000079383f
May  4 08:09:36 ip-47 scylla: 0x0000000000518e77
May  4 08:09:36 ip-47 scylla: /lib64/libc.so.6+0x0000000000022444
May  4 08:09:36 ip-47 scylla: 0x000000000057ecfd
May  4 08:09:36 ip-47 kernel: scylla[14834]: segfault at 1 ip 0000000001b3a96f sp 00007fff44595be0 error 4 in scylla[400000+4d73000]
May  4 08:09:36 ip-47 kernel: Code: 10 48 85 db 0f 84 f1 01 00 00 66 0f ef e4 45 31 e4 f2 0f 11 64 24 18 f2 0f 11 64 24 10 66 0f 1f 44 00 00 48 8b 7b 10 48 8b 07 <ff> 10 48 85 c0 0f 85 c6 00 00 00 48 8b 1b 48 85 db 75 e6 4c 8b 75

An important mention here is that the cluster was built in the first place with i3.xlarge instances but was migrated around December 2020 on i3.large for costs saving.
Looking in cloudwatch i noticed and increase in CPU usage , network usage and disk operations, starting with december, which occur eactly when the maintenance script is ran on one of the 4 nodes. I noticed this behaviour on all nodes. 
I attached some cloudwatch screenshots.cp03-all-monitors.PNG
CP03-sdb-cpu.PNG

Another important mention is that the application is not more heavily used now compared to december , actually the contrary , there are less users now and the project was decommissioned.
Given all this information I have few questions,  please help me .

Is this hardware config good enough ? how can i asses this myself ?
Are the commands in the maintenance script useful? should I modify them to prevent such crashes ?
Right now i'm thinking to go back to i3.xlarge instance or and 1 more node to the cluster. Which would be the best approach ?
For additional infor please feel free to ask me.

Thanks, 
Andrei

Avi Kivity

unread,
May 31, 2021, 9:58:08 AMMay 31
to scyllad...@googlegroups.com, Andrei, Asias He

2.3.1 is an outdated version. It's no longer supported, you should keep your nodes running updated versions with all the fixes.


It's also very important to keep enough gfree space in the nodes (by pruning snapshots, deleting unneeded data, and growing the cluster if needed).


My recommendation is to resolve the out-of-space error (if you can do this by deleting snapshots), recover the cluster, and upgrade all the way to 4.4.


If you cannot recover space, you may need to replace the node with a larger instance and rebuild it. I'm not sure what the best strategy is. Copying Asias for guidance.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/43c23a61-ae14-4c79-9e74-e5421838dc1an%40googlegroups.com.

Andrei

unread,
Jun 3, 2021, 5:58:39 PMJun 3
to ScyllaDB users

Thanks for the reply Avi. I'm very new to Scylla but I read about the nodetool commands we use in that maintenance script and some of them for me don't make sense.
I don't know who created that "script".
According to the official Scylla documentation "nodetool cleanup" is used to remove data from nodes in case of a token ring modification aka when the cluster is expanded with new nodes or when a node is removed from the cluster. 
This is not the case for us so I don't see why we would run this command at the beginning of every month. What do you think ? Does it help or should we remove this command from the script ?
Next commmand "nodetool repair" I think it makes sense to run it in order to sync data across all nodes periodically but according to the docs maybe we should run it more often because indeed we do regular deletions, we keep 1 year of data and delete everything that's older.
Last command "nodetool drain", i don't see its purpose. According to the docs it is used before an upgrade or a maintenance operation. Why run it after "repair" and "cleanup" ?  

Asias He

unread,
Jun 4, 2021, 3:08:44 AMJun 4
to ScyllaDB users
On Fri, Jun 4, 2021 at 5:58 AM Andrei <fauxnom...@gmail.com> wrote:

Thanks for the reply Avi. I'm very new to Scylla but I read about the nodetool commands we use in that maintenance script and some of them for me don't make sense.
I don't know who created that "script".

I do not know neither ;-) You do not need to run cleanup regularly. Only do it after a new node is added to the cluster.
 
According to the official Scylla documentation "nodetool cleanup" is used to remove data from nodes in case of a token ring modification aka when the cluster is expanded with new nodes or when a node is removed from the cluster. 
This is not the case for us so I don't see why we would run this command at the beginning of every month. What do you think ? Does it help or should we remove this command from the script ?

The periodic script is not very useful except for the repair part.  You can use scylla-manager instead of the script to schedule repair.
 
Next commmand "nodetool repair" I think it makes sense to run it in order to sync data across all nodes periodically but according to the docs maybe we should run it more often because indeed we do regular deletions, we keep 1 year of data and delete everything that's older.
Last command "nodetool drain", i don't see its purpose. According to the docs it is used before an upgrade or a maintenance operation. Why run it after "repair" and "cleanup" ?  

nodetool drain helps to cleanly shutdown a scylla server. No need to run before nodetool repair and cleanup.

 

Asias He

unread,
Jun 4, 2021, 3:10:35 AMJun 4
to Avi Kivity, ScyllaDB users, Andrei
On Mon, May 31, 2021 at 9:58 PM Avi Kivity <a...@scylladb.com> wrote:

2.3.1 is an outdated version. It's no longer supported, you should keep your nodes running updated versions with all the fixes.


It's also very important to keep enough gfree space in the nodes (by pruning snapshots, deleting unneeded data, and growing the cluster if needed).


My recommendation is to resolve the out-of-space error (if you can do this by deleting snapshots), recover the cluster, and upgrade all the way to 4.4.


If you cannot recover space, you may need to replace the node with a larger instance and rebuild it. I'm not sure what the best strategy is. Copying Asias for guidance.


The easiest way is to start new instances with more storage to replace the old instance one by one using the replace_address option.


--
Asias

Andrei

unread,
Jul 26, 2021, 6:46:43 AMJul 26
to ScyllaDB users
Hello guys. I'm reviving this discussion as i encountered some new issues/bugs.
I changed the frequency of the maintenance script from monthly to weekly and also removed from it the nodetool commands which are not useful.
For example on this cluster with 3 nodes we have the following :
Node 1: 
0 0 * * 6 /home/centos/scylla_repair.sh >> /tmp/repair.log  2>&1
Node 2:
0 12 * * 6 /home/centos/scylla_repair.sh >> /tmp/repair.log  2>&1
Node 3:
0 0 * * SUN /home/centos/scylla_repair.sh >> /tmp/repair.log  2>&1

[centos@ip-10-251-135-197 ~]$ cat /home/centos/scylla_repair.sh
#!/bin/bash

sudo systemctl restart scylla-jmx
sleep 60
sudo nodetool repair > /tmp/repair.log 2>&1
sleep 60
sudo nodetool status >> /tmp/repair.log

With the exception of default_time_to_live we used the following settings for all tables:

cqlsh> select * from system_schema.tables where keyspace_name='cldtx' limit 1;

 keyspace_name | table_name           | bloom_filter_fp_chance | caching                                      | comment | compaction                                | compression | crc_check_chance | dclocal_read_repair_chance | default_time_to_live | extensions | flags        | gc_grace_seconds | id                                   | max_index_interval | memtable_flush_period_in_ms | min_index_interval | read_repair_chance | speculative_retry
---------------+----------------------+------------------------+----------------------------------------------+---------+-------------------------------------------+-------------+------------------+----------------------------+----------------------+------------+--------------+------------------+--------------------------------------+--------------------+-----------------------------+--------------------+--------------------+-------------------
         cldtx | account_notification |                   0.01 | {'keys': 'ALL', 'rows_per_partition': 'ALL'} |         | {'class': 'SizeTieredCompactionStrategy'} |            {} |                1 |                        0.1 |                    0 |           {} | {'compound'} |           864000 | 945a58b0-5341-11ea-a67a-000000000000 |               2048 |                           0 |                128 |                  0 |    99.0PERCENTILE

(1 rows)


So during weekends at  00:00 on Saturday we run repair on first node from cluster, at 12:00 on Saturday on 2nd node and so on .  I changed the repair frequency to 1 week because we use gc_grace_seconds = 864000 ( 10 days ) and it is recommended to have at least 1 repair before gc race is reached in order to avoid data resurrection.
The issue which made me start this discussion is related to storage utilization. I tried to do a major compaction on all tables on all nodes from all clusters hoping this will free up some storage space.
For this i used this procedure but i'm not sure I understood all steps very well:  https://docs.scylladb.com/kb/tombstones-flush/
I followed the next steps :
 
1. On Node 1 I checked when the last repair was started and i've set gc_grace_seconds to the time since last repair was started . For example if on Node 1 last repair started this Saturday at 00:00 and I would do a major compaction now , i'd set GC grace to  222360   seconds.
2. I ran nodetool flush to put the data from RAM in SSTables on disk. ( on Node 1).
3. I ran nodetool compact  ( on Node 1) and waited to finish. 
4. I repeated the steps 1-3 on the rest of the nodes sequentially (Node 2 and Node 3).
5. I checked the results and was very happy , the comapaction shrunk the tables with up to 50%.
Before compaction:
UN Node 1    115.76 GB 
UN Node 2    83.23 GB 
UN Node 3    89 GB 
After compaction :
UN Node 1     55.27   GB 
UN Node 2     51.64   GB 
UN Node 3     54.54  GB
6. Change back the GC grace to the default original value of 864000 on one of the nodes from cluster ( knowing that the change will propagate across all nodes via gossip).
Immediately  I noticed i ran into a bug. Because of this bug, altering the GC grace on a table, will reset default TTL to 0 on that table so after step "1" I had default TTL set to 0 on all tables from  keyspace.
7. Change back the default TTL to original values on all tables on one node from cluster ( knowing that the change will propagate across entire cluster).
I checked the changes and all seemed to be ok. For steps 6 and 7 i used 2 .cql scripts containing "ALTER" commands for each table.
Example :
[root@ip-10-251--183 centos]# head -4 set_default_ttl.cql
ALTER TABLE cldtx.account_notification WITH default_time_to_live = 0;
ALTER TABLE cldtx.application_graph WITH default_time_to_live = 31622400;
ALTER TABLE cldtx.application_notification WITH default_time_to_live = 0;
ALTER TABLE cldtx.application_notification_by_datacenter WITH default_time_to_live = 0;
[root@ip-10-251--183 centos]#
[root@ip-10-251--183 centos]#
[root@ip-10-251--183 centos]# head -3 set_gc_grace.cql
ALTER TABLE cldtx.account_notification WITH gc_grace_seconds = 864000;
ALTER TABLE cldtx.application_graph WITH gc_grace_seconds = 864000;
ALTER TABLE cldtx.application_notification WITH gc_grace_seconds = 864000;

Today all my enthusiasm was gone when i noticed that the storage utilization returned to levels similar to ones before the major compaction. 
After the repairs from this weekend the tables grew a lot: 
For example on the cluster I mentioned above now we have :
UN   Node 1    93.76 GB  
UN   Node 2    95.75 GB   
UN  Node 3     100.39 GB

Question is : Do you have any idea why this happened ? Is this another bug ? Please note that we still use an outdated Scylla version ( 2.3.1) .
Did I do something wrong which led to data resurrection ?

Thanks,
Andrei

Avi Kivity

unread,
Aug 7, 2021, 2:26:29 PMAug 7
to scyllad...@googlegroups.com, Andrei

It will be super helpful to upgrade. That's much more productive than trying to chase a bug which may will have been fixed.


2.3.1 was released on October 2018, that's almost three years ago.

Avi Kivity

unread,
Aug 17, 2021, 11:31:47 AMAug 17
to faux nom, ScyllaDB users

4.4.4. Note that you should upgrade one minor version at a time - 2.3 -> 3.0 -> 3.2 -> 3.3 etc.

On 17/08/2021 16.32, faux nom wrote:
Hi Avi, thanks for the input. I agree with you. I will try to an upgrade on some of our lower environments and see the results.
Our HW setup is the following :
Model vCPU Memory (GiB) Networking Performance Storage (TB)
i3.large 2 15.25 Up to 10 Gigabit 1 x 0.475 NVMe SSD

The OS is CentOS Linux release 7.5.1804 (Core)

Given this setup, which version do you recommend to upgrade to ?

Andrei

unread,
Aug 23, 2021, 12:35:27 PMAug 23
to ScyllaDB users
Hi Avi. I upgraded all the way up to 4.3.6 on a test environment, one version at a time as you advised and everything looks normal. First positive thing I noticed is that one bug is killed - the bug which was causing the default TTL to reset to "0" when gc_grace_seconds was altered.
If no issues can be observed with this version, can I update directly from 2.3.1 to 4.3.6 ? ( there are 10 versions in between it it's a bit tiresome to walk through all of them) 

Avi Kivity

unread,
Aug 23, 2021, 12:47:11 PMAug 23
to scyllad...@googlegroups.com, Andrei

I advise against jump-upgrading.


We have a policy of dropping wire compatibility with 2 year old branches. 2.3 was released three years ago, so the nodes may even not be able to talk to each other.


You could upgrade a major release at a time (2.3 -> 3.0 -> 4.4). These have the code to talk to each other. But we only ever test upgrading across one minor release, so you'll be the first one to test those code paths. It is brave, but you don't want to be brave in production. It will also be hard to assist you if something goes wrong.

Tomer Sandler

unread,
Aug 23, 2021, 5:35:28 PMAug 23
to scyllad...@googlegroups.com, Andrei
Another option would be to deploy a NEW cluster with Scylla 4.4.<latest> and perform data migration to the new cluster.
This can be a hot/cold migration - you can read more about migration strategies here



--
Tomer Sandler
ScyllaDB

Andrei

unread,
Aug 23, 2021, 6:52:39 PMAug 23
to ScyllaDB users
Thanks Tomer, i'll take that option also into consideration. 
Few more questions :
The biggest concern with jump-upgrading is that nodes with different versions may not be able to communicate with each other and this can cause downtime for the cluster.
But as soon as all nodes have the same version there shouldn't be any other issues right ? 

On this environment where i tested 4.3.6 today, we had 2.3.1 previously. Some time ago I followed a procedure from Scylla docs and ran a major compaction on all nodes from the cluster. The space usage decreased by ~50% on each node but as soon as the repair jobs kicked in during weekend the space usage increased back to the original values before the compaction so i thought this is another bug...
Now, on the same test environment, with the new version, I ran a major compaction again and here are the results :

With v4.3.6 , BEFORE compaction :

|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.22.31.89  86.65 GB   256          ?       4b8022cc-f829-4474-a6c9-33e9429e9154  1c
UN  10.22.31.5   99 GB      256          ?       3717e251-4d1b-4b37-b137-fbad4f028213  1a
UN  10.22.31.38  76.72 GB   256          ?       feabcd56-21d9-4bec-acf7-089c6a6002dd  1b
UN  10.22.31.22  80.79 GB   256          ?       79fb2737-a871-4b78-aa57-0be915d08797  1a

With v4.3.6, AFTER compaction:

--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.22.31.89  9.82 GB    256          ?       4b8022cc-f829-4474-a6c9-33e9429e9154  1c
UN  10.22.31.5   10.34 GB   256          ?       3717e251-4d1b-4b37-b137-fbad4f028213  1a
UN  10.22.31.38  9.25 GB    256          ?       feabcd56-21d9-4bec-acf7-089c6a6002dd  1b
UN  10.22.31.22  9.44 GB    256          ?       79fb2737-a871-4b78-aa57-0be915d08797  1a

With v4.3.6, after compaction and after repair

--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.22.31.89  9.81 GB    256          ?       4b8022cc-f829-4474-a6c9-33e9429e9154  1c
UN  10.22.31.5   10.33 GB   256          ?       3717e251-4d1b-4b37-b137-fbad4f028213  1a
UN  10.22.31.38  9.24 GB    256          ?       feabcd56-21d9-4bec-acf7-089c6a6002dd  1b
UN  10.22.31.22  9.44 GB    256          ?       79fb2737-a871-4b78-aa57-0be915d08797  1a

Immediately after compaction I ran repair on each node and it seems the space usage usage remained the same.
So after compaction we can see an ~90% reduction in space usage. Is this normal ? Is this how bad the state of the cluster was before upgrade & compaction ?
I was concerned we lost data after this compaction but I checked the app which uses this cluster and everything seems normal, we still have 1 year worth of data, the data is accessed properly.

Tomer Sandler

unread,
Aug 24, 2021, 12:22:04 AMAug 24
to scyllad...@googlegroups.com
Scylla 3.0 introduced support for `mc` sstable format, which consumes less disk space. Scylla 4.3 introduced support for `md` sstable format.
After upgrades running `nodetool upgradesstables` or just firing a major compaction like you did, will compact the sstables and the newly created will be with the newly supported sstable file format.
This is just 1 storage footprint optimization I can think of, there are probably some more.

It would be VERY hard to go over 3 years of developments spreading over tens of patch, minor and major version releases.
You can dive into our release notes and read about all that happened from 2.2 until 4.4 (which is A LOT).

Tomer Sandler

unread,
Aug 24, 2021, 12:26:51 AMAug 24
to scyllad...@googlegroups.com
BTW, clearing snapshots, `nodetool cleanup` (upon adding/removing nodes from the cluster), and choosing the "right" compaction strategy can also dramatically affect storage utilization.
Some reading:
- https://ultrabug.fr/Tech%20Blog/2019/2019-03-29-scylla-four-ways-to-optimize-your-disk-space-consumption/
https://scylladb.medium.com/capacity-planning-with-style-57eddbbf92d1

Andrei

unread,
Aug 24, 2021, 11:00:45 AMAug 24
to ScyllaDB users

Thanks Tome for the explanation and thanks a lot Scylla team! I tried a  "jump-upgrade" from 2.3.1 --> 3.0 --> 4.3.6 and everything looks ok.
After this upgrade everything works better and the issues we faced are gone. 
Reply all
Reply to author
Forward
0 new messages