New node's disk will become full and unable to boot up when scale scylla cluster from 1 rack to 2 racks.

87 views
Skip to first unread message

Cong Guo

<cong.guo@smartnews.com>
unread,
Aug 9, 2022, 2:49:58 AM8/9/22
to ScyllaDB users
Hi,

Originally, I ran a scylla cluster with 10 nodes in the same rack(A) on k8s via scylla-operator, then I plan to scale the racks from 1 to 2, so I added new rack(B) and apply the change. But unfortunately, the replica distribution policy will try to allocate at least 1 replica in a rack, and node in rackB needs to started one by one, so the new node's disk will become full quickly and cannot join the cluster.

Steps:
1. modify the CR from rack A to racks A & B.
2. apply the change.
3. new node in rack B is spawned and start to receive data streaming from nodes in rack A.
4. new node in rack B crashed for the short of disk storage. seems all data in rack A will try to stream a replica to the new node in rack B.

Do you know what is the right way to scale out more racks without above issue?

```

spec:

  agentRepository: docker.io/scylladb/scylla-manager-agent

  agentVersion: 2.5.0

  alternator:

    port: 8000

    writeIsolation: only_rmw_uses_lwt

  automaticOrphanedNodeCleanup: true

  cpuset: true

  datacenter:

    name: us-west-1

    racks:

    - agentResources:

        limits:

          cpu: 50m

          memory: 100M

        requests:

          cpu: 50m

          memory: 100M

      members: 1

      name: a

      .......

    - agentResources:

        limits:

          cpu: 50m

          memory: 100M

        requests:

          cpu: 50m

          memory: 100M

      members: 1

      name: b

      ...

```

Thanks, 
Cong

Avi Kivity

<avi@scylladb.com>
unread,
Aug 9, 2022, 3:33:30 AM8/9/22
to scylladb-users@googlegroups.com, Asias He

I don't think there is a good way now.


Perhaps we should modify NetworkTopologyStrategy with an active_racks parameter (or maybe inactive_racks) that excludes a rack from being used. But then we'll need to stream data when it changes, because replication topology will be affected.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/e5e98925-2f3a-49eb-ba61-27e0618d391bn%40googlegroups.com.

Cong Guo

<cong.guo@smartnews.com>
unread,
Aug 9, 2022, 4:13:44 AM8/9/22
to ScyllaDB users
I see, as I understand new nodes will claim token range randomly and so evenly, if scylla prevent the new nodes in the new rack being used until they are all ready to receive streaming data from old nodes, there will be some intermediate states hard to handle. thanks a lot.

Michael Wohlwend

<micha-1@fantasymail.de>
unread,
Aug 9, 2022, 5:42:48 AM8/9/22
to ScyllaDB users, 'Cong Guo' via ScyllaDB users
I have tried this too, changing the rack layout of a cluster.
This just doesn't work, the single node in the new rack gets a whole replica
and even if the discs are large enough, afterwards the node cannot handle its
replica alone.

So, either you just configure more than one rack from the beginning if there is
even only a small chance to be able to get more racks or (and this requires
more hardware) create a second datacenter with the correct racks and enough
nodes per rack to handle the data. After integrating the datacenter you can
rmove the old one and add those nodes to the new datacenter.

Maybe one should be able to completly define the new at once, before streaming.
Then start the streaming to the new rack. This would distribute the data on
more nodes from the beginning and there won't be a single node with a whole
replica

Michael




horschi

<horschi@gmail.com>
unread,
Aug 9, 2022, 6:06:47 AM8/9/22
to scylladb-users@googlegroups.com
Cassandra had an option called auto_bootstrap. I think it exists in scylla too. Perhaps the nodes in the new Rack could be started with auto_bootstrap=false. Once all nodes in the rack are started, the data streaming can be triggered using nodetool rebuild. 

Problem (beyond the fact I have not tested it): This violates consistency, as the new rack is missing data. Ideally, it would also be possible to disable reads for those nodes.


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.

Cong Guo

<cong.guo@smartnews.com>
unread,
Aug 9, 2022, 6:13:57 AM8/9/22
to ScyllaDB users
Thank you Michael. Yes, create a new cluster and migrate the data is a solution so far. And I will try auto_bootstrap=false as well.

Avi Kivity

<avi@scylladb.com>
unread,
Aug 9, 2022, 6:16:17 AM8/9/22
to scylladb-users@googlegroups.com, horschi, Asias He

We killed auto_bootstrap since it was a synonym for "I want to lose data".


I think we'll have to defer this until we have Raft for topology since it involves streaming from multiple nodes. The procedure might look like this:


1. disable the new rack

2. start adding nodes, they don't receive any data since they're on the wrong rack

3. enable the new rack

4. scylladb starts streaming to the new rack, but serving reads only from the old racks (like regular bootstrap but s/node/rack/)

5. streaming completes and the new nodes start serving reads

Cong Guo

<cong.guo@smartnews.com>
unread,
Aug 9, 2022, 6:23:39 AM8/9/22
to ScyllaDB users
Understand, can't wait to try the new feature, thank you Avi and Christian.

Jean Carlo

<jean.jeancarl48@gmail.com>
unread,
Aug 9, 2022, 4:11:08 PM8/9/22
to scylladb-users@googlegroups.com
Hello,


There is another solution, create a new DC, but do not replicate data. This new DC should have your racks and nodes configure. But empty at the moment of bootstrap. Because no keyspace will replicate on them. 


Once all nodes in new DC is up and running, you can run a rebuild in every new node getting data from old dc. This will stream the data and you shouldn't have the problem of disk space.

My 2 cents 


Jean Carlo

<jean.jeancarl48@gmail.com>
unread,
Aug 9, 2022, 4:15:00 PM8/9/22
to scylladb-users@googlegroups.com
I forgot to say, once you are done with the rebuild, decommission the old dc

Cong Guo

<cong.guo@smartnews.com>
unread,
Aug 9, 2022, 10:03:09 PM8/9/22
to ScyllaDB users
That's great, will try it, thank you jean!
Reply all
Reply to author
Forward
0 new messages