[QUEUED scylladb next] Merge 'Cluster features on raft: new procedure for joining group 0' from Piotr Dulikowski

58 views

Skip to first unread message

Commit Bot

<bot@cloudius-systems.com>

unread,

Sep 27, 2023, 3:58:04 AM9/27/23

to scylladb-dev@googlegroups.com, Kamil Braun

From: Kamil Braun <kbr...@scylladb.com>
Committer: Kamil Braun <kbr...@scylladb.com>
Branch: next

Merge 'Cluster features on raft: new procedure for joining group 0' from Piotr Dulikowski

This PR implements a new procedure for joining nodes to group 0, based
on the description in the "Cluster features on Raft (v2)" document. This
is a continuation of the previous PRs related to cluster features on
raft (https://github.com/scylladb/scylladb/pull/14722,
https://github.com/scylladb/scylladb/pull/14232), and the last piece
necessary to replace cluster feature checks in gossip.

Current implementation relies on gossip shadow round to fetch the set of
enabled features, determine whether the node supports all of the enabled
features, and joins only if it is safe. As we are moving management of
cluster features to group 0, we encounter a problem: the contents of
group 0 itself may depend on features, hence it is not safe to join it
unless we perform the feature check which depends on information in
group 0. Hence, we have a dependency cycle.

In order to solve this problem, the algorithm for joining group 0 is
modified, and verification of features and other parameters is offloaded
to an existing node in group 0. Instead of directly asking the discovery
leader to unconditionally add the node to the configuration with
`GROUP0_MODIFY_CONFIG`, two different RPCs are added:
`JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as
follows:

- The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It
sends a bunch of information describing the node, including supported
cluster features. The discovery leader verifies some of the parameters
and adds the node in the `none` state to `system.topology`.
- The topology coordinator picks up the request for the node to be
joined (i.e. the node in `none` state), verifies its properties -
including cluster features - and then informs the new node about the
decision via `JOIN_NODE_RESPONSE`: whether the new node will be joined
or not.
- If the new node is supposed to be joined, the topology
coordinator will add it to group 0 and the new node will start
its group 0 server.
- Otherwise, the request is removed from `system.topology` and
the new node is shut down.

The procedure is designed to be idempotent. A node can safely retry
joining the cluster even if it shut down while in the middle of the
previous attempt and restarted with a different configuration - a random
UUID called "join cookie" is used to differentiate the attempts.

More details about the RPC are described in `topology-over-raft.md`.

Fixes: #15152

Closes scylladb/scylladb#15196

* github.com:scylladb/scylladb:
tests: mark test_blocked_bootstrap as skipped
storage_service: do not check features in shadow round
storage_service: remove raft_{boostrap,replace}
topology_coordinator: relax the check in enable_features
raft_group0: insert replaced node info before server setup
storage_service: use join node rpc to join the cluster
topology_coordinator: handle joining nodes
topology_state_machine: add join_group0 state
storage_service: add join node RPC handlers
raft: expose current_leader in raft::server
storage_service: extract wait_for_live_nodes_timeout constant
raft_group0: abstract out node joining handshake
storage_service: pass raft_topology_change_enabled on rpc init
rpc: add new join handshake verbs
docs: document the new join procedure
topology_state_machine: add supported_features to replica_state
storage_service: check destination host ID in raft verbs
group_state_machine: take reference to raft address map
raft_group0: expose joined_group0

---
diff --git a/configure.py b/configure.py
--- a/configure.py
+++ b/configure.py
@@ -1318,6 +1318,7 @@ def find_ninja():
'idl/position_in_partition.idl.hh',
'idl/experimental/broadcast_tables_lang.idl.hh',
'idl/storage_service.idl.hh',
+ 'idl/join_node.idl.hh',
'idl/utils.idl.hh',
]

diff --git a/db/system_keyspace.cc b/db/system_keyspace.cc
--- a/db/system_keyspace.cc
+++ b/db/system_keyspace.cc
@@ -2502,6 +2502,11 @@ future<service::topology> system_keyspace::load_topology_state() {
ignored_ids = decode_nodes_ids(deserialize_set_column(*topology(), row, "ignore_nodes"));
}

+ std::set<sstring> supported_features;
+ if (row.has("supported_features")) {
+ supported_features = decode_features(deserialize_set_column(*topology(), row, "supported_features"));
+ }
+
if (row.has("topology_request")) {
auto req = service::topology_request_from_string(row.get_as<sstring>("topology_request"));
ret.requests.emplace(host_id, req);
@@ -2572,7 +2577,7 @@ future<service::topology> system_keyspace::load_topology_state() {
if (map) {
map->emplace(host_id, service::replica_state{
nstate, std::move(datacenter), std::move(rack), std::move(release_version),
- ring_slice, shard_count, ignore_msb});
+ ring_slice, shard_count, ignore_msb, std::move(supported_features)});
}
}

@@ -2644,9 +2649,11 @@ future<service::topology> system_keyspace::load_topology_state() {
some_row.get_as<sstring>("global_topology_request"));
ret.global_request.emplace(req);
}
- }

- ret.features = decode_topology_features_state(std::move(rs));
+ if (some_row.has("enabled_features")) {
+ ret.enabled_features = decode_features(deserialize_set_column(*topology(), some_row, "enabled_features"));
+ }
+ }

co_return ret;
}
diff --git a/docs/dev/topology-over-raft.md b/docs/dev/topology-over-raft.md
--- a/docs/dev/topology-over-raft.md
+++ b/docs/dev/topology-over-raft.md
@@ -224,3 +224,65 @@ There are also a few static columns for cluster-global properties:
- `unpublished_cdc_generations` - the IDs of the committed yet unpublished CDC generations
- `global_topology_request` - if set, contains one of the supported global topology requests
- `new_cdc_generation_data_uuid` - used in `commit_cdc_generation` state, the time UUID of the generation to be committed
+
+# Join procedure
+
+In topology on raft mode, new nodes need to go through a new handshake procedure
+before they can join the cluster, including joining group 0. The handshake
+happens during raft discovery and replaced the old `GROUP0_MODIFY_CONFIG`.
+
+Two RPC verbs are introduced:
+
+- `JOIN_NODE_REQUEST`
+ Sent by the node that wishes to join the cluster. It contains some basic
+ information about the node that will have to be verified. Some of them
+ can be verified by the node that receives the RPC - which can result in
+ an immediate failure of the RPC - but others need to be verified
+ by the topology coordinator.
+
+- `JOIN_NODE_RESPONSE`
+ Sent by the topology coordinator back to the joining node. It contains
+ coordinator's decision about the node - whether the request was accepted
+ or rejected.
+
+The procedure is not retryable. If the joining node crashes before finishing it,
+it might get rejected after restart. In order to retry adding the node, the data
+directory must be deleted first.
+
+*The procedure*
+
+If the node didn't join group 0, it sends `JOIN_NODE_REQUEST` to any existing
+node in the cluster. The receiving node can either:
+- Accept the request and tell the new node to wait for `JOIN_NODE_RESPONSE`.
+- Reject the request. This can happen if:
+ - The request does not satisfy some validity checks done by the receiving node
+ (e.g. cluster name does not match),
+ - There already is an existing request to join this node (which can happen if
+ the node crashed during the previous attempt to join),
+ - The node is already a part of the cluster,
+ - The node was removed from the cluster.
+
+The new node should not process `JOIN_NODE_RESPONSE` RPC until it is explicitly
+ordered by node that handles `JOIN_NODE_REQUEST`. This is necessary so that
+it doesn't accidentally process a join response from a previous attempt, if
+there was any. Similarly, the group 0 server must not be started on the new node
+until the `JOIN_NODE_REQUEST` RPC succeeds, otherwise, if it crashes during
+the prodecude and disables some features before restart it might receive some
+commands it is not prepared to understand.
+
+The topology coordinator will read the request from `system.topology`. It will
+perform additional verification that couldn't be done by the recipient
+of `JOIN_NODE_REQUEST` (e.g. check whether the node supports all cluster
+features). Then:
+- If verification was successful, the node will be transitioned to `join_group0`
+ state, then added to group 0 and `JOIN_NODE_RESPONSE` will be sent by
+ the topology coordinator. Afterwards, the usual bootstrap/replace procedure
+ continues.
+- If verification was unsuccessful, topology coordinator will move the node
+ to `left` state and then send a `JOIN_NODE_RESPONSE` to the joining node,
+ rejecting it.
+
+In case of failures like timeouts, connection issues, etc., the topology
+coordinator keeps retrying. Eventually, it can give up and just move the new
+node to the `left` state, either due to a timeout or an operator intervention,
+but this is not implemented yet.
diff --git a/idl/join_node.idl.hh b/idl/join_node.idl.hh
--- a/idl/join_node.idl.hh
+++ b/idl/join_node.idl.hh
@@ -0,0 +1,56 @@
+/*
+ * Copyright 2023-present ScyllaDB
+ */
+
+/*
+ * SPDX-License-Identifier: AGPL-3.0-or-later
+ */
+
+namespace service {
+
+struct join_node_request_params {
+ raft::server_id host_id;
+ std::optional<raft::server_id> replaced_id;
+ std::vector<sstring> ignore_nodes;
+ sstring cluster_name;
+ sstring snitch_name;
+ sstring datacenter;
+ sstring rack;
+ sstring release_version;
+ uint32_t num_tokens;
+ uint32_t shard_count;
+ uint32_t ignore_msb;
+ std::vector<sstring> supported_features;
+};
+
+struct join_node_request_result {
+ struct ok {};
+ struct rejected {
+ sstring reason;
+ };
+
+ std::variant<
+ service::join_node_request_result::ok,
+ service::join_node_request_result::rejected
+ > result;
+};
+
+struct join_node_response_params {
+ struct accepted {};
+
+ struct rejected {
+ sstring reason;
+ };
+
+ std::variant<
+ service::join_node_response_params::accepted,
+ service::join_node_response_params::rejected
+ > response;
+};
+
+struct join_node_response_result {};
+
+verb join_node_request (raft::server_id dst_id, service::join_node_request_params) -> service::join_node_request_result;
+verb join_node_response (raft::server_id dst_id, service::join_node_response_params) -> service::join_node_response_result;
+
+}
diff --git a/idl/storage_service.idl.hh b/idl/storage_service.idl.hh
--- a/idl/storage_service.idl.hh
+++ b/idl/storage_service.idl.hh
@@ -49,8 +49,8 @@ struct raft_topology_snapshot {

struct raft_topology_pull_params {};

-verb raft_topology_cmd (raft::term_t term, uint64_t cmd_index, service::raft_topology_cmd) -> service::raft_topology_cmd_result;
-verb [[cancellable]] raft_pull_topology_snapshot (service::raft_topology_pull_params) -> service::raft_topology_snapshot;
-verb [[cancellable]] tablet_stream_data (locator::global_tablet_id);
-verb [[cancellable]] tablet_cleanup (locator::global_tablet_id);
+verb raft_topology_cmd (raft::server_id dst_id, raft::term_t term, uint64_t cmd_index, service::raft_topology_cmd) -> service::raft_topology_cmd_result;
+verb [[cancellable]] raft_pull_topology_snapshot (raft::server_id dst_id, service::raft_topology_pull_params) -> service::raft_topology_snapshot;
+verb [[cancellable]] tablet_stream_data (raft::server_id dst_id, locator::global_tablet_id);
+verb [[cancellable]] tablet_cleanup (raft::server_id dst_id, locator::global_tablet_id);
}
diff --git a/main.cc b/main.cc
--- a/main.cc
+++ b/main.cc
@@ -1587,7 +1587,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
});

supervisor::notify("starting storage service", true);
- ss.local().init_messaging_service_part(sys_dist_ks).get();
+ ss.local().init_messaging_service_part(sys_dist_ks, raft_topology_change_enabled).get();
auto stop_ss_msg = defer_verbose_shutdown("storage service messaging", [&ss] {
ss.local().uninit_messaging_service_part().get();
});
diff --git a/message/messaging_service.cc b/message/messaging_service.cc
--- a/message/messaging_service.cc
+++ b/message/messaging_service.cc
@@ -48,6 +48,7 @@
#include "full_position.hh"
#include "db/per_partition_rate_limit_info.hh"
#include "service/topology_state_machine.hh"
+#include "service/raft/join_node.hh"
#include "idl/consistency_level.dist.hh"
#include "idl/tracing.dist.hh"
#include "idl/result.dist.hh"
@@ -77,6 +78,7 @@
#include "idl/per_partition_rate_limit_info.dist.hh"
#include "idl/storage_proxy.dist.hh"
#include "idl/storage_service.dist.hh"
+#include "idl/join_node.dist.hh"
#include "message/rpc_protocol_impl.hh"
#include "idl/consistency_level.dist.impl.hh"
#include "idl/tracing.dist.impl.hh"
@@ -118,6 +120,7 @@
#include "idl/forward_request.dist.hh"
#include "idl/forward_request.dist.impl.hh"
#include "idl/storage_service.dist.impl.hh"
+#include "idl/join_node.dist.impl.hh"

namespace netw {

@@ -566,6 +569,8 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
case messaging_verb::GROUP0_MODIFY_CONFIG:
case messaging_verb::GET_GROUP0_UPGRADE_STATE:
case messaging_verb::RAFT_TOPOLOGY_CMD:
+ case messaging_verb::JOIN_NODE_REQUEST:
+ case messaging_verb::JOIN_NODE_RESPONSE:
// See comment above `TOPOLOGY_INDEPENDENT_IDX`.
// DO NOT put any 'hot' (e.g. data path) verbs in this group,
// only verbs which are 'rare' and 'cheap'.
diff --git a/message/messaging_service.hh b/message/messaging_service.hh
--- a/message/messaging_service.hh
+++ b/message/messaging_service.hh
@@ -184,7 +184,9 @@ enum class messaging_verb : int32_t {
RAFT_PULL_TOPOLOGY_SNAPSHOT = 65,
TABLET_STREAM_DATA = 66,
TABLET_CLEANUP = 67,
- LAST = 68,
+ JOIN_NODE_REQUEST = 68,
+ JOIN_NODE_RESPONSE = 69,
+ LAST = 70,
};

} // namespace netw
diff --git a/raft/server.cc b/raft/server.cc
--- a/raft/server.cc
+++ b/raft/server.cc
@@ -91,6 +91,7 @@ class server_impl : public rpc_server, public server {
std::pair<index_t, term_t> log_last_idx_term() override;
void elapse_election() override;
bool is_leader() override;
+ raft::server_id current_leader() const override;
void tick() override;
raft::server_id id() const override;
void set_applier_queue_max_size(size_t queue_max_size) override;
@@ -1645,6 +1646,10 @@ bool server_impl::is_leader() {
return _fsm->is_leader();
}

+raft::server_id server_impl::current_leader() const {
+ return _fsm->current_leader();
+}
+
void server_impl::elapse_election() {
while (_fsm->election_elapsed() < ELECTION_TIMEOUT) {
_fsm->tick();
diff --git a/raft/server.hh b/raft/server.hh
--- a/raft/server.hh
+++ b/raft/server.hh
@@ -228,6 +228,11 @@ public:
// The information is only relevant for the current_term() only
virtual bool is_leader() = 0;

+ // Returns ID of the server that is thought to be the current leader.
+ // Returns an empty ID if there is an ongoing election.
+ // The information is only relevant for the current_term() only.
+ virtual raft::server_id current_leader() const = 0;
+
// The function should be called periodically to advance logical clock.
virtual void tick() = 0;

diff --git a/service/raft/group0_state_machine.cc b/service/raft/group0_state_machine.cc
--- a/service/raft/group0_state_machine.cc
+++ b/service/raft/group0_state_machine.cc
@@ -34,6 +34,7 @@
#include "db/system_keyspace.hh"
#include "service/storage_proxy.hh"
#include "service/raft/raft_group0_client.hh"
+#include "service/raft/raft_address_map.hh"
#include "partition_slice_builder.hh"
#include "timestamp.hh"
#include "utils/overloaded_functor.hh"
@@ -199,16 +200,25 @@ future<> group0_state_machine::load_snapshot(raft::snapshot_id id) {
_ss._topology_state_machine.event.broadcast();
}

-future<> group0_state_machine::transfer_snapshot(gms::inet_address from, raft::snapshot_descriptor snp) {
+future<> group0_state_machine::transfer_snapshot(raft::server_id from_id, raft::snapshot_descriptor snp) {
+ // FIXME: The translation will ultimately be done by messaging_service
+ auto from_ip = _address_map.find(from_id);
+ if (!from_ip.has_value()) {
+ // This is virtually impossible. We've just received the
+ // snapshot from the sender and must have updated our
+ // address map with its IP address.
+ const auto msg = format("Failed to apply snapshot from {}: ip address of the sender is not found", from_ip);
+ co_await coroutine::return_exception(raft::transport_error(msg));
+ }
try {
// Note that this may bring newer state than the group0 state machine raft's
// log, so some raft entries may be double applied, but since the state
// machine is idempotent it is not a problem.

auto holder = _gate.hold();

- slogger.trace("transfer snapshot from {} index {} snp id {}", from, snp.idx, snp.id);
- netw::messaging_service::msg_addr addr{from, 0};
+ slogger.trace("transfer snapshot from {} index {} snp id {}", from_ip, snp.idx, snp.id);
+ netw::messaging_service::msg_addr addr{*from_ip, 0};
auto& as = _abort_source;

// (Ab)use MIGRATION_REQUEST to also transfer group0 history table mutation besides schema tables mutations.
@@ -219,7 +229,7 @@ future<> group0_state_machine::transfer_snapshot(gms::inet_address from, raft::s
on_internal_error(slogger, "Expected MIGRATION_REQUEST to return canonical mutations");
}

- auto topology_snp = co_await ser::storage_service_rpc_verbs::send_raft_pull_topology_snapshot(&_mm._messaging, addr, as, service::raft_topology_pull_params{});
+ auto topology_snp = co_await ser::storage_service_rpc_verbs::send_raft_pull_topology_snapshot(&_mm._messaging, addr, as, from_id, service::raft_topology_pull_params{});

auto history_mut = extract_history_mutation(*cm, _sp.data_dictionary());

diff --git a/service/raft/group0_state_machine.hh b/service/raft/group0_state_machine.hh
--- a/service/raft/group0_state_machine.hh
+++ b/service/raft/group0_state_machine.hh
@@ -24,6 +24,9 @@ class storage_proxy;
class storage_service;
struct group0_state_machine_merger;

+template <typename C> class raft_address_map_t;
+using raft_address_map = raft_address_map_t<seastar::lowres_clock>;
+
struct schema_change {
// Mutations of schema tables (such as `system_schema.keyspaces`, `system_schema.tables` etc.)
// e.g. computed from a DDL statement (keyspace/table/type create/drop/alter etc.)
@@ -84,17 +87,19 @@ class group0_state_machine : public raft_state_machine {
migration_manager& _mm;
storage_proxy& _sp;
storage_service& _ss;
+ raft_address_map& _address_map;
seastar::gate _gate;
abort_source _abort_source;

future<> merge_and_apply(group0_state_machine_merger& merger);
public:
- group0_state_machine(raft_group0_client& client, migration_manager& mm, storage_proxy& sp, storage_service& ss) : _client(client), _mm(mm), _sp(sp), _ss(ss) {}
+ group0_state_machine(raft_group0_client& client, migration_manager& mm, storage_proxy& sp, storage_service& ss, raft_address_map& address_map)
+ : _client(client), _mm(mm), _sp(sp), _ss(ss), _address_map(address_map) {}
future<> apply(std::vector<raft::command_cref> command) override;
future<raft::snapshot_id> take_snapshot() override;
void drop_snapshot(raft::snapshot_id id) override;
future<> load_snapshot(raft::snapshot_id id) override;
- future<> transfer_snapshot(gms::inet_address from, raft::snapshot_descriptor snp) override;
+ future<> transfer_snapshot(raft::server_id from_id, raft::snapshot_descriptor snp) override;
future<> abort() override;
};

diff --git a/service/raft/join_node.hh b/service/raft/join_node.hh
--- a/service/raft/join_node.hh
+++ b/service/raft/join_node.hh
@@ -0,0 +1,62 @@
+/*
+ * Copyright 2023-present ScyllaDB
+ */
+
+/*
+ * SPDX-License-Identifier: AGPL-3.0-or-later
+ */
+
+#pragma once
+
+#include <set>
+#include <seastar/core/sstring.hh>
+#include "raft/raft.hh"
+
+namespace service {
+
+struct join_node_request_params {
+ raft::server_id host_id;
+ std::optional<raft::server_id> replaced_id;
+ std::vector<sstring> ignore_nodes;
+ sstring cluster_name;
+ sstring snitch_name;
+ sstring datacenter;
+ sstring rack;
+ sstring release_version;
+ uint32_t num_tokens;
+ uint32_t shard_count;
+ uint32_t ignore_msb;
+ std::vector<sstring> supported_features;
+};
+
+struct join_node_request_result {
+ // Request was successfully placed and will be processed
+ // by the topology coordinator.
+ struct ok {};
+
+ // The request was immediately rejected, most likely due to some
+ // parameters being incorrect or incompatible with the cluster.
+ struct rejected {
+ sstring reason;
+ };
+
+ std::variant<ok, rejected> result;
+};
+
+struct join_node_response_params {
+ // The topology coordinator accepts and wants to add the joining node
+ // to group 0 and to the cluster in general.
+ struct accepted {};
+
+ // The topology coordinator rejects the node, most likely due to some
+ // parameters being incorrect or incompatible with the cluster.
+ struct rejected {
+ sstring reason;
+ };
+
+ std::variant<accepted, rejected> response;
+};
+
+struct join_node_response_result {};
+
+}
diff --git a/service/raft/raft_group0.cc b/service/raft/raft_group0.cc
--- a/service/raft/raft_group0.cc
+++ b/service/raft/raft_group0.cc
@@ -7,6 +7,7 @@
*/
#include <source_location>

+#include "service/raft/group0_fwd.hh"
#include "service/raft/raft_group0.hh"
#include "service/raft/raft_rpc.hh"
#include "service/raft/raft_sys_table_storage.hh"
@@ -197,7 +198,7 @@ const raft::server_id& raft_group0::load_my_id() {

raft_server_for_group raft_group0::create_server_for_group0(raft::group_id gid, raft::server_id my_id, service::storage_service& ss, cql3::query_processor& qp,
service::migration_manager& mm) {
- auto state_machine = std::make_unique<group0_state_machine>(_client, mm, qp.proxy(), ss);
+ auto state_machine = std::make_unique<group0_state_machine>(_client, mm, qp.proxy(), ss, _raft_gr.address_map());
auto rpc = std::make_unique<group0_rpc>(_raft_gr.direct_fd(), *state_machine, _ms.local(), _raft_gr.address_map(), gid, my_id);
// Keep a reference to a specific RPC class.
auto& rpc_ref = *rpc;
@@ -414,7 +415,7 @@ future<> raft_group0::leadership_monitor_fiber() {
}
}

-future<> raft_group0::join_group0(std::vector<gms::inet_address> seeds, bool as_voter, service::storage_service& ss, cql3::query_processor& qp, service::migration_manager& mm,
+future<> raft_group0::join_group0(std::vector<gms::inet_address> seeds, shared_ptr<service::group0_handshaker> handshaker, service::storage_service& ss, cql3::query_processor& qp, service::migration_manager& mm,
db::system_keyspace& sys_ks) {
assert(this_shard_id() == 0);
assert(!joined_group0());
@@ -460,6 +461,8 @@ future<> raft_group0::join_group0(std::vector<gms::inet_address> seeds, bool as_
// In a fresh cluster this will trigger an empty snapshot transfer which is redundant but correct.
// See #14066.
nontrivial_snapshot = true;
+ } else {
+ co_await handshaker->pre_server_start(g0_info);
}
// Bootstrap the initial configuration
co_await raft_sys_table_storage(qp, group0_id, my_id)
@@ -481,15 +484,9 @@ future<> raft_group0::join_group0(std::vector<gms::inet_address> seeds, bool as_
break;
}

- auto timeout = db::timeout_clock::now() + std::chrono::milliseconds{1000};
- netw::msg_addr peer(g0_info.ip_addr);
- try {
- // TODO: aborts?
- co_await ser::group0_rpc_verbs::send_group0_modify_config(&_ms.local(), peer, timeout, group0_id, {{my_addr, as_voter}}, {});
+ // TODO: aborts?
+ if (co_await handshaker->post_server_start(g0_info)) {
break;
- } catch (std::runtime_error& e) {
- // Retry
- group0_log.warn("failed to modify config at peer {}: {}. Retrying.", g0_info.id, e);
}

// Try again after a pause
@@ -508,6 +505,41 @@ future<> raft_group0::join_group0(std::vector<gms::inet_address> seeds, bool as_
group0_log.info("server {} joined group 0 with group id {}", my_id, group0_id);
}

+shared_ptr<service::group0_handshaker> raft_group0::make_legacy_handshaker(bool can_vote) {
+ struct legacy_handshaker : public group0_handshaker {
+ service::raft_group0& _group0;
+ netw::messaging_service& _ms;
+ bool _can_vote;
+
+ legacy_handshaker(service::raft_group0& group0, netw::messaging_service& ms, bool can_vote)
+ : _group0(group0)
+ , _ms(ms)
+ , _can_vote(can_vote)
+ {}
+
+ future<> pre_server_start(const group0_info& info) override {
+ // Nothing to do in this step
+ co_return;
+ }
+
+ future<bool> post_server_start(const group0_info& g0_info) override {
+ netw::msg_addr peer(g0_info.ip_addr);
+ auto timeout = db::timeout_clock::now() + std::chrono::milliseconds{1000};
+ auto my_id = _group0.load_my_id();
+ raft::server_address my_addr{my_id, {}};
+ try {
+ co_await ser::group0_rpc_verbs::send_group0_modify_config(&_ms, peer, timeout, g0_info.group0_id, {{my_addr, _can_vote}}, {});
+ co_return true;
+ } catch (std::runtime_error& e) {
+ group0_log.warn("failed to modify config at peer {}: {}. Retrying.", g0_info.id, e);
+ co_return false;
+ }
+ };
+ };
+
+ return make_shared<legacy_handshaker>(*this, _ms.local(), can_vote);
+}
+
struct group0_members {
const raft::server& _group0_server;
const raft_address_map& _address_map;
@@ -623,7 +655,7 @@ future<> raft_group0::setup_group0_if_exist(db::system_keyspace& sys_ks, service
}

future<> raft_group0::setup_group0(
- db::system_keyspace& sys_ks, const std::unordered_set<gms::inet_address>& initial_contact_nodes,
+ db::system_keyspace& sys_ks, const std::unordered_set<gms::inet_address>& initial_contact_nodes, shared_ptr<group0_handshaker> handshaker,
std::optional<replace_info> replace_info, service::storage_service& ss, cql3::query_processor& qp, service::migration_manager& mm) {
if (!co_await use_raft()) {
co_return;
@@ -634,24 +666,6 @@ future<> raft_group0::setup_group0(
co_return;
}

- std::vector<gms::inet_address> seeds;
- for (auto& addr: initial_contact_nodes) {
- if (addr != _gossiper.get_broadcast_address()) {
- seeds.push_back(addr);
- }
- }
-
- group0_log.info("setup_group0: joining group 0...");
- co_await join_group0(std::move(seeds), false /* non-voter */, ss, qp, mm, sys_ks);
- group0_log.info("setup_group0: successfully joined group 0.");
-
- // Start group 0 leadership monitor fiber.
- _leadership_monitor = leadership_monitor_fiber();
-
- utils::get_local_injector().inject("stop_after_joining_group0", [&] {
- throw std::runtime_error{"injection: stop_after_joining_group0"};
- });
-
if (replace_info) {
// Insert the replaced node's (Raft ID, IP address) pair into `raft_address_map`.
// In general, the mapping won't be obtained through the regular gossiping route:
@@ -669,6 +683,24 @@ future<> raft_group0::setup_group0(
_raft_gr.address_map().opt_add_entry(replace_info->raft_id, replace_info->ip_addr);
}

+ std::vector<gms::inet_address> seeds;
+ for (auto& addr: initial_contact_nodes) {
+ if (addr != _gossiper.get_broadcast_address()) {
+ seeds.push_back(addr);
+ }
+ }
+
+ group0_log.info("setup_group0: joining group 0...");
+ co_await join_group0(std::move(seeds), std::move(handshaker), ss, qp, mm, sys_ks);
+ group0_log.info("setup_group0: successfully joined group 0.");
+
+ // Start group 0 leadership monitor fiber.
+ _leadership_monitor = leadership_monitor_fiber();
+
+ utils::get_local_injector().inject("stop_after_joining_group0", [&] {
+ throw std::runtime_error{"injection: stop_after_joining_group0"};
+ });
+
// Enter `synchronize` upgrade state in case the cluster we're joining has recently enabled Raft
// and is currently in the middle of `upgrade_to_group0()`. For that procedure to finish
// every member of group 0 (now including us) needs to enter `synchronize` state.
@@ -1604,7 +1636,8 @@ future<> raft_group0::do_upgrade_to_group0(group0_upgrade_state start_state, ser

if (!joined_group0()) {
upgrade_log.info("Joining group 0...");
- co_await join_group0(co_await _sys_ks.load_peers(), true, ss, qp, mm, _sys_ks);
+ auto handshaker = make_legacy_handshaker(true); // Voter
+ co_await join_group0(co_await _sys_ks.load_peers(), std::move(handshaker), ss, qp, mm, _sys_ks);
} else {
upgrade_log.info(
"We're already a member of group 0."
diff --git a/service/raft/raft_group0.hh b/service/raft/raft_group0.hh
--- a/service/raft/raft_group0.hh
+++ b/service/raft/raft_group0.hh
@@ -64,6 +64,29 @@ private:
persistent_discovery(discovery_peer my_addr, const peer_list&, cql3::query_processor&);
};

+// Abstracts away the details of a handshake used to join an existing group 0.
+// None of its methods are called if the node creates a new group 0.
+class group0_handshaker {
+public:
+ virtual ~group0_handshaker() {}
+
+ // Called after initial discovery run is finished, but before starting
+ // the raft server. This is only called once and is not retried in case
+ // it or `post_server_start` fails.
+ virtual future<> pre_server_start(const group0_info& info) = 0;
+
+ // Performs a handshake that should result in this node being added to group 0.
+ // Called after the raft server is started.
+ //
+ // - If the function returns true, the node has been added to group 0
+ // successfully.
+ // - If the function returns false, this step has failed but should be
+ // retried.
+ // - An exception is interpreted as an irrecoverable error and causes
+ // setup_group0 to fail.
+ virtual future<bool> post_server_start(const group0_info& info) = 0;
+};
+
class raft_group0 {
seastar::gate _shutdown_gate;
seastar::abort_source& _abort_source;
@@ -139,7 +162,7 @@ public:
// Cannot be called twice.
//
// Also make sure to call `finish_setup_after_join` after the node has joined the cluster and entered NORMAL state.
- future<> setup_group0(db::system_keyspace&, const std::unordered_set<gms::inet_address>& initial_contact_nodes,
+ future<> setup_group0(db::system_keyspace&, const std::unordered_set<gms::inet_address>& initial_contact_nodes, shared_ptr<group0_handshaker> handshaker,
std::optional<replace_info>, service::storage_service& ss, cql3::query_processor& qp, service::migration_manager& mm);

// Call during the startup procedure before networking is enabled.
@@ -226,6 +249,12 @@ public:
// completes and current state of group0 is not RECOVERY.
future<> remove_from_raft_config(raft::server_id id);

+ // Creates a group0 adder which uses the legacy GROUP0_MODIFY_CONFIG RPC.
+ // It is meant to be used as a fallback when a proper handshake procedure
+ // cannot be used (e.g. when completing the upgrade or group0 procedures
+ // or when joining an old cluster which does not support JOIN_NODE RPC).
+ shared_ptr<group0_handshaker> make_legacy_handshaker(bool can_vote);
+
raft_group0_client& client() {
return _client;
}
@@ -236,6 +265,9 @@ public:
return _raft_gr.group0();
}

+ // Returns true after the group 0 server has been started.
+ bool joined_group0() const;
+
const raft_address_map& address_map() const;
private:
static void init_rpc_verbs(raft_group0& shard0_this);
@@ -244,7 +276,6 @@ private:
// Stop the group 0 server and remove it from the raft_group_registry.
future<> stop_group0();

- bool joined_group0() const;
future<bool> raft_upgrade_complete() const;

// Handle peer_exchange RPC
@@ -281,6 +312,10 @@ private:
//
// Uses `seeds` as contact points to discover other servers which will be part of group 0.
//
+ // If the server becomes the discovery leader, meaning that is it elected as the server
+ // which creates group 0, then it will become a voter. Otherwise it will use the `handshaker`
+ // to ask the discovery leader to join the node to the group, and the voting status
+ // depends on how exactly this is implemented.
// `as_voter` determines whether the server joins as a voter. If `false`, it will join
// as a non-voter with one exception: if it becomes the 'discovery leader', meaning that
// it is elected as the server which creates group 0, it will become a voter.
@@ -292,14 +327,14 @@ private:
// Persists group 0 ID on disk at the end so subsequent restarts of Scylla process can detect that group 0
// has already been joined and the server initialized.
//
- // `as_voter` should be initially `false` when joining an existing cluster; calling `become_voter`
- // will cause it to become a voter. `as_voter` should be `true` when upgrading, when we create
- // a group 0 using all nodes in the cluster.
+ // When joining an existing cluster, the provided handshaker should result in the node being added
+ // as a non-voter. When the cluster is being upgraded and group 0 is being created, the node
+ // should be joined as a voter.
//
// Preconditions: Raft local feature enabled
// and we haven't initialized group 0 yet after last Scylla start (`joined_group0()` is false).
// Postcondition: `joined_group0()` is true.
- future<> join_group0(std::vector<gms::inet_address> seeds, bool as_voter, service::storage_service& ss, cql3::query_processor& qp, service::migration_manager& mm, db::system_keyspace& sys_ks);
+ future<> join_group0(std::vector<gms::inet_address> seeds, shared_ptr<group0_handshaker> handshaker, service::storage_service& ss, cql3::query_processor& qp, service::migration_manager& mm, db::system_keyspace& sys_ks);

// Start an existing Raft server for the cluster-wide group 0.
// Assumes the server was already added to the group earlier so we don't attempt to join it again.
diff --git a/service/raft/raft_rpc.cc b/service/raft/raft_rpc.cc
--- a/service/raft/raft_rpc.cc
+++ b/service/raft/raft_rpc.cc
@@ -175,15 +175,7 @@ future<raft::read_barrier_reply> raft_rpc::execute_read_barrier(raft::server_id
}

future<raft::snapshot_reply> raft_rpc::apply_snapshot(raft::server_id from, raft::install_snapshot snp) {
- auto ip_addr = _address_map.find(from);
- if (!ip_addr.has_value()) {
- // This is virtually impossible. We've just received the
- // snapshot from the sender and must have updated our
- // address map with its IP address.
- const auto msg = format("Failed to apply snapshot from {}: ip address of the sender is not found", from);
- co_return coroutine::exception(std::make_exception_ptr(raft::transport_error(msg)));
- }
- co_await _sm.transfer_snapshot(*ip_addr, snp.snp);
+ co_await _sm.transfer_snapshot(from, snp.snp);
co_return co_await raft_with_gate(_shutdown_gate, [&] {
return _client->apply_snapshot(from, std::move(snp));
});
diff --git a/service/raft/raft_state_machine.hh b/service/raft/raft_state_machine.hh
--- a/service/raft/raft_state_machine.hh
+++ b/service/raft/raft_state_machine.hh
@@ -16,7 +16,7 @@ namespace service {
// Snapshot transfer is delegated to a state machine implementation
class raft_state_machine : public raft::state_machine {
public:
- virtual future<> transfer_snapshot(gms::inet_address from, raft::snapshot_descriptor snp) = 0;
+ virtual future<> transfer_snapshot(raft::server_id from_id, raft::snapshot_descriptor snp) = 0;
};

} // end of namespace service
diff --git a/service/storage_service.cc b/service/storage_service.cc
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -11,6 +11,8 @@

#include "storage_service.hh"
#include "dht/boot_strapper.hh"
+#include <boost/range/adaptor/filtered.hpp>
+#include <boost/range/adaptor/transformed.hpp>
#include <seastar/core/distributed.hh>
#include <seastar/util/defer.hh>
#include <seastar/coroutine/as_future.hh>
@@ -83,6 +85,8 @@
#include "idl/storage_service.dist.hh"
#include "service/storage_proxy.hh"
#include "service/raft/raft_address_map.hh"
+#include "service/raft/join_node.hh"
+#include "idl/join_node.dist.hh"
#include "protocol_server.hh"
#include "types/set.hh"
#include "node_ops/node_ops_ctl.hh"
@@ -105,6 +109,8 @@ namespace service {

static logging::logger slogger("storage_service");

+static constexpr std::chrono::seconds wait_for_live_nodes_timeout{30};
+
storage_service::storage_service(abort_source& abort_source,
distributed<replica::database>& db, gms::gossiper& gossiper,
sharded<db::system_keyspace>& sys_ks,
@@ -339,13 +345,13 @@ future<> storage_service::topology_state_load() {
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state();

co_await _feature_service.container().invoke_on_all([&] (gms::feature_service& fs) {
- return fs.enable(boost::copy_range<std::set<std::string_view>>(_topology_state_machine._topology.features.enabled_features));
+ return fs.enable(boost::copy_range<std::set<std::string_view>>(_topology_state_machine._topology.enabled_features));
});

// Update the legacy `enabled_features` key in `system.scylla_local`.
// It's OK to update it after enabling features because `system.topology` now
// is the source of truth about enabled features.
- co_await _sys_ks.local().save_local_enabled_features(_topology_state_machine._topology.features.enabled_features, false);
+ co_await _sys_ks.local().save_local_enabled_features(_topology_state_machine._topology.enabled_features, false);

const auto& am = _group0->address_map();
auto id2ip = [this, &am] (raft::server_id id) -> future<gms::inet_address> {
@@ -425,6 +431,8 @@ future<> storage_service::topology_state_load() {
return read_new_t::no;
}
switch (*state) {
+ case topology::transition_state::join_group0:
+ [[fallthrough]];
case topology::transition_state::tablet_migration:
[[fallthrough]];
case topology::transition_state::commit_cdc_generation:
@@ -1036,7 +1044,7 @@ class topology_coordinator {
auto result = utils::fb_utilities::is_me(*ip) ?
co_await _raft_topology_cmd_handler(_sys_dist_ks, _term, cmd_index, cmd) :
co_await ser::storage_service_rpc_verbs::send_raft_topology_cmd(
- &_messaging, netw::msg_addr{*ip}, _term, cmd_index, cmd);
+ &_messaging, netw::msg_addr{*ip}, id, _term, cmd_index, cmd);
if (result.status == raft_topology_cmd_result::command_status::fail) {
co_await coroutine::exception(std::make_exception_ptr(
std::runtime_error(::format("failed status returned from {}/{}", id, *ip))));
@@ -1388,15 +1396,14 @@ class topology_coordinator {
}

// Preconditions:
- // - There are no nodes that are trying to join and there are no topology
- // operations in progress
+ // - There are no topology operations in progress
// - `features_to_enable` represents a set of features that are currently
// marked as supported by all normal nodes and it is not empty
future<> enable_features(group0_guard guard, std::set<sstring> features_to_enable) {
- if (!_topo_sm._topology.new_nodes.empty() || !_topo_sm._topology.transition_nodes.empty()) {
+ if (!_topo_sm._topology.transition_nodes.empty()) {
on_internal_error(slogger,
- "topology coordinator attempted to enable features even though there are"
- " joining nodes or topology operations in progress");
+ "topology coordinator attempted to enable features even though there is"
+ " a topology operations in progress");
}

if (utils::get_local_injector().enter("raft_topology_suppress_enabling_features")) {
@@ -1409,16 +1416,22 @@ class topology_coordinator {
co_return;
}

- // If we are here, then we noticed that all nodes support some features
- // that are not enabled yet. Perform a global barrier to make sure that:
+ // If we are here, then we noticed that all normal nodes support some
+ // features that are not enabled yet. Perform a global barrier to make
+ // sure that:
//
- // 1. All nodes saw (and persisted) a view of the system.topology table
- // that is equal to what the topology coordinator sees (or newer,
+ // 1. All normal nodes saw (and persisted) a view of the system.topology
+ // table that is equal to what the topology coordinator sees (or newer,
// but in that case updating the topology state will fail),
- // 2. None of the nodes is restarting at the moment and trying to
+ // 2. None of the normal nodes is restarting at the moment and trying to
// downgrade (this is done by a special check in the barrier handler).
//
- // After we get a successful confirmation from each node, we have
+ // It's sufficient to only include normal nodes because:
+ //
+ // - There are no transitioning nodes due to the precondition,
+ // - New and left nodes are not part of group 0.
+ //
+ // After we get a successful confirmation from each normal node, we have
// a guarantee that they won't attempt to revoke support for those
// features. That's because we do not allow nodes to boot without
// a feature that is supported by all nodes in the cluster, even if
@@ -1612,7 +1625,7 @@ class topology_coordinator {
slogger.info("raft topology: Initiating tablet streaming of {} to {}", gid, trinfo.pending_replica);
auto dst = trinfo.pending_replica.host;
return ser::storage_service_rpc_verbs::send_tablet_stream_data(&_messaging,
- netw::msg_addr(id2ip(dst)), _as, gid);
+ netw::msg_addr(id2ip(dst)), _as, raft::server_id(dst.uuid()), gid);
})) {
transition_to(locator::tablet_transition_stage::write_both_read_new);
}
@@ -1628,7 +1641,7 @@ class topology_coordinator {
locator::tablet_replica dst = locator::get_leaving_replica(tmap.get_tablet_info(gid.tablet), trinfo);
slogger.info("raft topology: Initiating tablet cleanup of {} on {}", gid, dst);
return ser::storage_service_rpc_verbs::send_tablet_cleanup(&_messaging,
- netw::msg_addr(id2ip(dst.host)), _as, gid);
+ netw::msg_addr(id2ip(dst.host)), _as, raft::server_id(dst.host.uuid()), gid);
})) {
transition_to(locator::tablet_transition_stage::end_migration);
}
@@ -1740,7 +1753,7 @@ class topology_coordinator {
return std::make_pair(true, std::move(guard));
}

- if (!_topo_sm._topology.features.calculate_not_yet_enabled_features().empty()) {
+ if (!_topo_sm._topology.calculate_not_yet_enabled_features().empty()) {
return std::make_pair(true, std::move(guard));
}

@@ -1765,7 +1778,7 @@ class topology_coordinator {
co_return true;
}

- if (auto feats = _topo_sm._topology.features.calculate_not_yet_enabled_features(); !feats.empty()) {
+ if (auto feats = _topo_sm._topology.calculate_not_yet_enabled_features(); !feats.empty()) {
co_await enable_features(std::move(guard), std::move(feats));
co_return true;
}
@@ -1778,6 +1791,45 @@ class topology_coordinator {
}

switch (*tstate) {
+ case topology::transition_state::join_group0: {
+ auto node = get_node_to_work_on(std::move(guard));
+ node = co_await finish_accepting_node(std::move(node));
+ switch (node.rs->state) {
+ case node_state::bootstrapping: {
+ assert(node.rs->ring);
+ auto tmptr = get_token_metadata_ptr();
+ auto [gen_uuid, guard, mutation] = co_await prepare_and_broadcast_cdc_generation_data(
+ tmptr, take_guard(std::move(node)), bootstrapping_info{node.rs->ring->tokens, *node.rs});
+
+ topology_mutation_builder builder(guard.write_timestamp());
+
+ // Write the new CDC generation data through raft.
+ builder.set_transition_state(topology::transition_state::commit_cdc_generation)
+ .set_new_cdc_generation_data_uuid(gen_uuid);
+ auto reason = ::format(
+ "bootstrap: insert CDC generation data (UUID: {})", gen_uuid);
+ co_await update_topology_state(std::move(guard), {std::move(mutation), builder.build()}, reason);
+ }
+ break;
+ case node_state::replacing: {
+ assert(node.rs->ring);
+
+ topology_mutation_builder builder(node.guard.write_timestamp());
+
+ builder.set_transition_state(topology::transition_state::tablet_draining)
+ .set_version(_topo_sm._topology.version + 1);
+ co_await update_topology_state(take_guard(std::move(node)), {builder.build()},
+ "replace: transition to tablet_draining");
+ }
+ break;
+ default:
+ on_internal_error(slogger,
+ format("raft topology: topology is in join_group0 state, but the node"
+ " being worked on ({}) is in unexpected state '{}'; should be"
+ " either 'bootstrapping' or 'replacing'", node.id, node.rs->state));
+ }
+ }
+ break;
case topology::transition_state::commit_cdc_generation: {
// make sure all nodes know about new topology and have the new CDC generation data
// (we require all nodes to be alive for topo change for now)
@@ -2028,8 +2080,47 @@ class topology_coordinator {
slogger.info("raft topology: coordinator fiber found a node to work on id={} state={}", node.id, node.rs->state);

switch (node.rs->state) {
- case node_state::normal:
case node_state::none: {
+ if (_topo_sm._topology.normal_nodes.empty()) {
+ slogger.info("raft topology: skipping join node handshake for the first node in the cluster");
+ } else {
+ auto validation_result = validate_joining_node(node);
+ if (auto* reject = std::get_if<join_node_response_params::rejected>(&validation_result)) {
+ // Transition to left
+ topology_mutation_builder builder(node.guard.write_timestamp());
+ builder.with_node(node.id)
+ .del("topology_request")
+ .set("node_state", node_state::left);
+ auto reason = ::format("bootstrap: node rejected");
+ co_await update_topology_state(std::move(node.guard), {builder.build()}, reason);
+
+ slogger.info("raft topology: rejected node moved to left state {}", node.id);
+
+ // Keep trying to send the rejection to the node, give up after some time.
+ const auto send_reject_deadline = lowres_clock::now() + std::chrono::seconds(30);
+ while (true) {
+ try {
+ co_await respond_to_joining_node(node.id, join_node_response_params{
+ .response = std::move(validation_result),
+ });
+ break;
+ } catch (const std::runtime_error& e) {
+ slogger.warn("raft topology: attempt to send rejection response to {} failed: {}.", node.id, e.what());
+ }
+ if (lowres_clock::now() > send_reject_deadline) {
+ co_await sleep_abortable(std::chrono::seconds(1), _as);
+ } else {
+ slogger.warn("raft topology: failed to deliver rejection response to {} within {}s. Will not retry.", node.id, 30);
+ break;
+ }
+ }
+
+ break;
+ }
+ }
+ }
+ [[fallthrough]];
+ case node_state::normal: {
// if the state is none there have to be either 'join' or 'replace' request
// if the state is normal there have to be either 'leave', 'remove' or 'rebuild' request
topology_mutation_builder builder(node.guard.write_timestamp());
@@ -2043,19 +2134,14 @@ class topology_coordinator {
auto bootstrap_tokens = dht::boot_strapper::get_random_bootstrap_tokens(
tmptr, num_tokens, dht::check_token_endpoint::yes);

- auto [gen_uuid, guard, mutation] = co_await prepare_and_broadcast_cdc_generation_data(
- tmptr, take_guard(std::move(node)), bootstrapping_info{bootstrap_tokens, *node.rs});
-
- // Write chosen tokens and CDC generation data through raft.
- builder.set_transition_state(topology::transition_state::commit_cdc_generation)
- .set_new_cdc_generation_data_uuid(gen_uuid)
+ // Write chosen tokens through raft.
+ builder.set_transition_state(topology::transition_state::join_group0)
.with_node(node.id)
.set("node_state", node_state::bootstrapping)
.del("topology_request")
.set("tokens", bootstrap_tokens);
- auto reason = ::format(
- "bootstrap: assign tokens and insert CDC generation data (UUID: {})", gen_uuid);
- co_await update_topology_state(std::move(guard), {std::move(mutation), builder.build()}, reason);
+ auto reason = ::format("bootstrap: accept node and assign tokens");
+ co_await update_topology_state(std::move(node.guard), {builder.build()}, reason);
break;
}
case topology_request::leave:
@@ -2087,13 +2173,13 @@ class topology_coordinator {
auto it = _topo_sm._topology.normal_nodes.find(replaced_id);
assert(it != _topo_sm._topology.normal_nodes.end());
assert(it->second.ring && it->second.state == node_state::normal);
- builder.set_transition_state(topology::transition_state::tablet_draining)
- .set_version(_topo_sm._topology.version + 1)
+ builder.set_transition_state(topology::transition_state::join_group0)
.with_node(node.id)
.set("node_state", node_state::replacing)
.del("topology_request")
.set("tokens", it->second.ring->tokens);
- co_await update_topology_state(take_guard(std::move(node)), {builder.build()}, "start replace");
+ co_await update_topology_state(take_guard(std::move(node)), {builder.build()},
+ "replace: accept node and take ownership of the replaced node's tokens");
break;
}
case topology_request::rebuild: {
@@ -2194,6 +2280,59 @@ class topology_coordinator {
}
};

+ std::variant<join_node_response_params::accepted, join_node_response_params::rejected>
+ validate_joining_node(const node_to_work_on& node) {
+ if (node.rs->state == node_state::replacing) {
+ auto replaced_id = std::get<replace_param>(node.req_param.value()).replaced_id;
+ if (!_topo_sm._topology.normal_nodes.contains(replaced_id)) {
+ return join_node_response_params::rejected {
+ .reason = ::format("Cannot replace node {} because it is not in the 'normal' state", replaced_id),
+ };
+ }
+ }
+
+ std::vector<sstring> unsupported_features;
+ const auto& supported_features = node.rs->supported_features;
+ std::ranges::set_difference(node.topology->enabled_features, supported_features, std::back_inserter(unsupported_features));
+ if (!unsupported_features.empty()) {
+ slogger.warn("raft topology: node {} does not understand some features: {}", node.id, unsupported_features);
+ return join_node_response_params::rejected{
+ .reason = format("Feature check failed. The node does not support some features that are enabled by the cluster: {}",
+ unsupported_features),
+ };
+ }
+
+ return join_node_response_params::accepted {};
+ }
+
+ future<node_to_work_on> finish_accepting_node(node_to_work_on&& node) {
+ if (_topo_sm._topology.normal_nodes.empty()) {
+ // This is the first node, it joins without the handshake.
+ co_return std::move(node);
+ }
+
+ auto id = node.id;
+
+ assert(!_topo_sm._topology.transition_nodes.empty());
+ if (!_raft.get_configuration().contains(id)) {
+ co_await _raft.modify_config({raft::config_member({id, {}}, {})}, {});
+ }
+
+ release_node(std::move(node));
+ co_await respond_to_joining_node(id, join_node_response_params{
+ .response = join_node_response_params::accepted{},
+ });
+ co_return retake_node(co_await start_operation(), id);
+ }
+
+ future<> respond_to_joining_node(raft::server_id id, join_node_response_params&& params) {
+ auto ip = id2ip(locator::host_id(id.uuid()));
+ co_await ser::join_node_rpc_verbs::send_join_node_response(
+ &_messaging, netw::msg_addr(ip), id,
+ std::move(params)
+ );
+ }
+
// Returns true if the state machine was transitioned into tablet migration path.
future<bool> maybe_start_tablet_migration(group0_guard);

@@ -2387,98 +2526,103 @@ std::unordered_set<raft::server_id> storage_service::find_raft_nodes_from_hoeps(
return ids;
}

-future<> storage_service::raft_replace(raft::server& raft_server, raft::server_id replaced_id, gms::inet_address replaced_ip) {
- auto ignore_nodes_strs = utils::split_comma_separated_list(_db.local().get_config().ignore_dead_nodes_for_replace());
- std::list<locator::host_id_or_endpoint> ignore_nodes_params;
- for (const auto& n : ignore_nodes_strs) {
- ignore_nodes_params.emplace_back(n);
- }
+canonical_mutation storage_service::build_mutation_from_join_params(const join_node_request_params& params, service::group0_guard& guard) {
+ topology_mutation_builder builder(guard.write_timestamp());
+ auto& node_builder = builder.with_node(params.host_id)
+ .set("node_state", node_state::none)
+ .set("datacenter", params.datacenter)
+ .set("rack", params.rack)
+ .set("release_version", params.release_version)
+ .set("num_tokens", params.num_tokens)
+ .set("shard_count", params.shard_count)
+ .set("ignore_msb", params.ignore_msb)
+ .set("supported_features", boost::copy_range<std::set<sstring>>(params.supported_features));

- // Read barrier to access the latest topology. Quorum of nodes has to be alive.
- co_await raft_server.read_barrier(&_abort_source);
+ if (params.replaced_id) {
+ std::list<locator::host_id_or_endpoint> ignore_nodes_params;
+ for (const auto& n : params.ignore_nodes) {
+ ignore_nodes_params.emplace_back(n);
+ }

- auto it = _topology_state_machine._topology.find(raft_server.id());
- if (it && it->second.state != node_state::replacing) {
- throw std::runtime_error(::format("Cannot do \"replace address\" operation with a node that is in state: {}", it->second.state));
+ auto ignored_ids = find_raft_nodes_from_hoeps(ignore_nodes_params);
+
+ node_builder
+ .set("topology_request", topology_request::replace)
+ .set("replaced_id", *params.replaced_id)
+ .set("ignore_nodes", ignored_ids);
+ } else {
+ node_builder
+ .set("topology_request", topology_request::join);
}

- // add myself to topology with request to replace
- while (!_topology_state_machine._topology.contains(raft_server.id())) {
- auto guard = co_await _group0->client().start_operation(&_abort_source);
+ return builder.build();
+}

- auto it = _topology_state_machine._topology.normal_nodes.find(replaced_id);
- if (it == _topology_state_machine._topology.normal_nodes.end()) {
- throw std::runtime_error(::format("Cannot replace node {}/{} because it is not in the 'normal' state", replaced_ip, replaced_id));
- }
+class join_node_rpc_handshaker : public service::group0_handshaker {
+private:
+ service::storage_service& _ss;
+ const join_node_request_params& _req;

- auto ignored_ids = find_raft_nodes_from_hoeps(ignore_nodes_params);
+public:
+ join_node_rpc_handshaker(service::storage_service& ss, const join_node_request_params& req)
+ : _ss(ss)
+ , _req(req)
+ {}

- slogger.info("raft topology: adding myself to topology for replace: {} replacing {}, ignored nodes: {}", raft_server.id(), replaced_id, ignored_ids);
- auto& rs = it->second;
- topology_mutation_builder builder(guard.write_timestamp());
- builder.with_node(raft_server.id())
- .set("node_state", node_state::none)
- .set("datacenter", rs.datacenter)
- .set("rack", rs.rack)
- .set("release_version", version::release())
- .set("topology_request", topology_request::replace)
- .set("replaced_id", replaced_id)
- .set("ignore_nodes", ignored_ids)
- .set("num_tokens", _db.local().get_config().num_tokens())
- .set("shard_count", smp::count)
- .set("ignore_msb", _db.local().get_config().murmur3_partitioner_ignore_msb_bits())
- .set("supported_features", _feature_service.supported_feature_set());
- topology_change change{{builder.build()}};
- group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, ::format("replace {}/{}: add myself ({}) to topology", replaced_id, replaced_ip, raft_server.id()));
- try {
- co_await _group0->client().add_entry(std::move(g0_cmd), std::move(guard), &_abort_source);
- } catch (group0_concurrent_modification&) {
- slogger.info("raft topology: replace: concurrent operation is detected, retrying.");
- }
+ future<> pre_server_start(const group0_info& g0_info) override {
+ slogger.info("raft topology: join: sending the join request to {}", g0_info.ip_addr);
+
+ auto result = co_await ser::join_node_rpc_verbs::send_join_node_request(
+ &_ss._messaging.local(), netw::msg_addr(g0_info.ip_addr), g0_info.id, _req);
+ std::visit(overloaded_functor {
+ [this] (const join_node_request_result::ok&) {
+ slogger.info("raft topology: join: request to join placed, waiting"
+ " for the response from the topology coordinator");
+
+ _ss._join_node_request_done.set_value();
+ },
+ [] (const join_node_request_result::rejected& rej) {
+ throw std::runtime_error(
+ format("the topology coordinator rejected request to join the cluster: {}", rej.reason));
+ },
+ }, result.result);
+
+ co_return;
}

- co_return;
-}
+ future<bool> post_server_start(const group0_info& g0_info) override {
+ // Processing of the response is done in `join_node_response_handler`.
+ // Wait for it to complete.
+ co_await _ss._join_node_response_done.get_shared_future(lowres_clock::now() + std::chrono::seconds(30));
+ slogger.info("raft topology: join: success");
+ co_return true;
+ }
+};

-future<> storage_service::raft_bootstrap(raft::server& raft_server) {
- // We try to find ourself in the topology without doing read barrier
- // first to not require quorum of live nodes during regular boot. But
- // if we are not in the topology it either means this is the first boot
- // or we failed during bootstrap so do a read barrier (which requires
- // quorum to be alive) and re-check.
- if (!_topology_state_machine._topology.contains(raft_server.id())) {
+future<> storage_service::raft_initialize_discovery_leader(raft::server& raft_server, const join_node_request_params& params) {
+ if (_topology_state_machine._topology.is_empty()) {
co_await raft_server.read_barrier(&_abort_source);
}

- while (!_topology_state_machine._topology.contains(raft_server.id())) {
- slogger.info("raft topology: adding myself to topology: {}", raft_server.id());
- // Current topology does not contains this node. Bootstrap is needed!
+ while (_topology_state_machine._topology.is_empty()) {
+ if (params.replaced_id.has_value()) {
+ throw std::runtime_error(::format("Cannot perform a replace operation because this is the first node in the cluster"));
+ }
+
+ slogger.info("raft topology: adding myself as the first node to the topology");
auto guard = co_await _group0->client().start_operation(&_abort_source);
+
+ auto insert_join_request_mutation = build_mutation_from_join_params(params, guard);
+
+ // We are the first node and we define the cluster.
+ // Set the enabled_features field to our features.
topology_mutation_builder builder(guard.write_timestamp());
- builder.with_node(raft_server.id())
- .set("node_state", node_state::none)
- .set("datacenter", _snitch.local()->get_datacenter())
- .set("rack", _snitch.local()->get_rack())
- .set("release_version", version::release())
- .set("topology_request", topology_request::join)
- .set("num_tokens", _db.local().get_config().num_tokens())
- .set("shard_count", smp::count)
- .set("ignore_msb", _db.local().get_config().murmur3_partitioner_ignore_msb_bits())
- .set("supported_features", _feature_service.supported_feature_set());
- if (_topology_state_machine._topology.is_empty()) {
- // We see ourselves as the first node. Try to immediately enable
- // the set of features that we currently support.
- //
- // After this entry is successfully applied, the features will be
- // implicitly enabled.
- //
- // This is a temporary workaround until we have a proper feature check
- // using the upcoming JOIN_NODE handshake, which will also explicitly
- // enable features.
- builder.add_enabled_features(_feature_service.supported_feature_set());
- }
- topology_change change{{builder.build()}};
- group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard, "bootstrap: add myself to topology");
+ builder.add_enabled_features(boost::copy_range<std::set<sstring>>(params.supported_features));
+ auto enable_features_mutation = builder.build();
+
+ topology_change change{{std::move(enable_features_mutation), std::move(insert_join_request_mutation)}};
+ group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard,
+ "bootstrap: adding myself as the first node to the topology");
try {
co_await _group0->client().add_entry(std::move(g0_cmd), std::move(guard), &_abort_source);
} catch (group0_concurrent_modification&) {
@@ -2501,12 +2645,11 @@ future<> storage_service::update_topology_with_local_metadata(raft::server& raft
}

auto& replica_state = it->second;
- const auto& replica_supported_features = _topology_state_machine._topology.features.normal_supported_features[raft_server.id()];

return replica_state.shard_count == local_shard_count
&& replica_state.ignore_msb == local_ignore_msb
&& replica_state.release_version == local_release_version
- && replica_supported_features == local_supported_features;
+ && replica_state.supported_features == local_supported_features;
};

// We avoid performing a read barrier if we're sure that our metadata stored in topology
@@ -2544,8 +2687,8 @@ future<> storage_service::update_topology_with_local_metadata(raft::server& raft
// Fortunately, there is no risk that this feature was marked as enabled
// because it requires that the current node responded to a barrier
// request - which will fail in this situation.
- const auto& enabled_features = _topology_state_machine._topology.features.enabled_features;
- const auto unsafe_to_disable_features = _topology_state_machine._topology.features.calculate_not_yet_enabled_features();
+ const auto& enabled_features = _topology_state_machine._topology.enabled_features;
+ const auto unsafe_to_disable_features = _topology_state_machine._topology.calculate_not_yet_enabled_features();
_feature_service.check_features(enabled_features, unsafe_to_disable_features);

slogger.info("raft topology: updating topology with local metadata");
@@ -2642,9 +2785,11 @@ future<> storage_service::join_token_ring(sharded<db::system_distributed_keyspac
co_await check_for_endpoint_collision(initial_contact_nodes, loaded_peer_features);
} else {
auto local_features = _feature_service.supported_feature_set();
- slogger.info("Checking remote features with gossip, initial_contact_nodes={}", initial_contact_nodes);
+ slogger.info("Performing gossip shadow round, initial_contact_nodes={}", initial_contact_nodes);
co_await _gossiper.do_shadow_round(initial_contact_nodes);
- _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ if (!_raft_topology_change_enabled) {
+ _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ }
_gossiper.check_snitch_name_matches(_snitch.local()->get_name());
// Check if the node is already removed from the cluster
auto local_host_id = get_token_metadata().get_my_id();
@@ -2796,13 +2941,36 @@ future<> storage_service::join_token_ring(sharded<db::system_distributed_keyspac
});

slogger.info("Waiting for nodes {} to be alive", sync_nodes);
- co_await _gossiper.wait_alive(sync_nodes, std::chrono::seconds{30});
+ co_await _gossiper.wait_alive(sync_nodes, wait_for_live_nodes_timeout);
slogger.info("Nodes {} are alive", sync_nodes);
}

assert(_group0);
+
+ join_node_request_params join_params {
+ .host_id = _group0->load_my_id(),
+ .cluster_name = _db.local().get_config().cluster_name(),
+ .snitch_name = _db.local().get_snitch_name(),
+ .datacenter = _snitch.local()->get_datacenter(),
+ .rack = _snitch.local()->get_rack(),
+ .release_version = version::release(),
+ .num_tokens = _db.local().get_config().num_tokens(),
+ .shard_count = smp::count,
+ .ignore_msb = _db.local().get_config().murmur3_partitioner_ignore_msb_bits(),
+ .supported_features = boost::copy_range<std::vector<sstring>>(_feature_service.supported_feature_set()),
+ };
+
+ if (raft_replace_info) {
+ join_params.replaced_id = raft_replace_info->raft_id;
+ join_params.ignore_nodes = utils::split_comma_separated_list(_db.local().get_config().ignore_dead_nodes_for_replace());
+ }
+
// if the node is bootstrapped the functin will do nothing since we already created group0 in main.cc
- co_await _group0->setup_group0(_sys_ks.local(), initial_contact_nodes, raft_replace_info, *this, *_qp, _migration_manager.local());
+ ::shared_ptr<group0_handshaker> handshaker = _raft_topology_change_enabled
+ ? ::make_shared<join_node_rpc_handshaker>(*this, join_params)
+ : _group0->make_legacy_handshaker(false);
+ co_await _group0->setup_group0(_sys_ks.local(), initial_contact_nodes, std::move(handshaker),
+ raft_replace_info, *this, *_qp, _migration_manager.local());

raft::server* raft_server = co_await [this] () -> future<raft::server*> {
if (!_raft_topology_change_enabled) {
@@ -2838,12 +3006,11 @@ future<> storage_service::join_token_ring(sharded<db::system_distributed_keyspac
supervisor::notify("starting system distributed keyspace");
co_await sys_dist_ks.invoke_on_all(&db::system_distributed_keyspace::start);

- if (is_replacing()) {
- assert(raft_replace_info);
- co_await raft_replace(*raft_server, raft_replace_info->raft_id, raft_replace_info->ip_addr);
- } else {
- co_await raft_bootstrap(*raft_server);
- }
+ // Nodes that are not discovery leaders have their join request inserted
+ // on their behalf by an existing node in the cluster during the handshake.
+ // Discovery leaders on the other need to insert the join request themselves,
+ // we do that here.
+ co_await raft_initialize_discovery_leader(*raft_server, join_params);

// Wait until we enter one of the final states
co_await _topology_state_machine.event.when([this, raft_server] {
@@ -3733,9 +3900,9 @@ future<> storage_service::drain_on_shutdown() {
_drain_finished.get_future() : do_drain();
}

-future<> storage_service::init_messaging_service_part(sharded<db::system_distributed_keyspace>& sys_dist_ks) {
- return container().invoke_on_all([&sys_dist_ks] (storage_service& local) {
- return local.init_messaging_service(sys_dist_ks);
+future<> storage_service::init_messaging_service_part(sharded<db::system_distributed_keyspace>& sys_dist_ks, bool raft_topology_change_enabled) {
+ return container().invoke_on_all([&sys_dist_ks, raft_topology_change_enabled] (storage_service& local) {
+ return local.init_messaging_service(sys_dist_ks, raft_topology_change_enabled);
});
}

@@ -3948,9 +4115,11 @@ future<> storage_service::check_for_endpoint_collision(std::unordered_set<gms::i
bool found_bootstrapping_node = false;
auto local_features = _feature_service.supported_feature_set();
do {
- slogger.info("Checking remote features with gossip");
+ slogger.info("Performing gossip shadow round");
_gossiper.do_shadow_round(initial_contact_nodes).get();
- _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ if (!_raft_topology_change_enabled) {
+ _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ }
_gossiper.check_snitch_name_matches(_snitch.local()->get_name());
auto addr = get_broadcast_address();
if (!_gossiper.is_safe_for_bootstrap(addr)) {
@@ -4026,10 +4195,12 @@ storage_service::prepare_replacement_info(std::unordered_set<gms::inet_address>
}

// make magic happen
- slogger.info("Checking remote features with gossip");
+ slogger.info("Performing gossip shadow round");
co_await _gossiper.do_shadow_round(initial_contact_nodes);
- auto local_features = _feature_service.supported_feature_set();
- _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ if (!_raft_topology_change_enabled) {
+ auto local_features = _feature_service.supported_feature_set();
+ _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ }

// now that we've gossiped at least once, we should be able to find the node we're replacing
if (replace_host_id) {
@@ -5778,8 +5949,8 @@ future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(shar
// be rare, but it can happen and we can detect it right here.
std::exception_ptr ex;
try {
- const auto& enabled_features = _topology_state_machine._topology.features.enabled_features;
- const auto unsafe_to_disable_features = _topology_state_machine._topology.features.calculate_not_yet_enabled_features();
+ const auto& enabled_features = _topology_state_machine._topology.enabled_features;
+ const auto unsafe_to_disable_features = _topology_state_machine._topology.calculate_not_yet_enabled_features();
_feature_service.check_features(enabled_features, unsafe_to_disable_features);
} catch (const gms::unsupported_feature_exception&) {
ex = std::current_exception();
@@ -6187,20 +6358,285 @@ future<> storage_service::cleanup_tablet(locator::global_tablet_id tablet) {
});
}

-void storage_service::init_messaging_service(sharded<db::system_distributed_keyspace>& sys_dist_ks) {
+future<join_node_request_result> storage_service::join_node_request_handler(join_node_request_params params) {
+ join_node_request_result result;
+ slogger.info("raft topology: received request to join from host_id: {}", params.host_id);
+
+ if (params.cluster_name != _db.local().get_config().cluster_name()) {
+ result.result = join_node_request_result::rejected{
+ .reason = ::format("Cluster name check failed. This node cannot join the cluster "
+ "because it expected cluster name \"{}\" and not \"{}\"",
+ params.cluster_name,
+ _db.local().get_config().cluster_name()),
+ };
+ co_return result;
+ }
+
+ if (params.snitch_name != _db.local().get_snitch_name()) {
+ result.result = join_node_request_result::rejected{
+ .reason = ::format("Snitch name check failed. This node cannot join the cluster "
+ "because it uses \"{}\" and not \"{}\"",
+ params.snitch_name,
+ _db.local().get_snitch_name()),
+ };
+ co_return result;
+ }
+
+ co_await _topology_state_machine.event.when([this] {
+ // The first node defines the cluster and inserts its entry to the
+ // `system.topology` without checking anything. It is unlikely but
+ // possible that the `join_node_request_handler` fires before the first
+ // node inserts its entry, therefore we might need to wait
+ // until that happens, here.
+ return !_topology_state_machine._topology.is_empty();
+ });
+
+ auto& g0_server = _group0->group0_server();
+ if (params.replaced_id && *params.replaced_id == g0_server.current_leader()) {
+ // There is a peculiar case that can happen if the leader is killed
+ // and then replaced very quickly:
+ //
+ // - Cluster with nodes `A`, `B`, `C` - `A` is the topology
+ // coordinator/group0 leader,
+ // - `A` is killed,
+ // - New node `D` attempts to replace `A` with the same IP as `A`,
+ // sends `join_node_request` rpc to node `B`,
+ // - Node `B` handles the RPC and wants to perform group0 operation
+ // and wants to perform a barrier - still thinks that `A`
+ // is the leader and is alive, sends an RPC to its IP,
+ // - `D` accidentally receives the request that was meant to `A`
+ // but throws an exception because of host_id mismatch,
+ // - Failure is propagated back to `B`, and then to `D` - and `D`
+ // fails the replace operation.
+ //
+ // We can try to detect if this failure might happen: if the new node
+ // is going to replace but the ID of the replaced node is the same
+ // as the leader, wait for a short while until a reelection happens.
+ // If replaced ID == leader ID, then this indicates either the situation
+ // above or an operator error (actually trying to replace a live node).
+
+ const auto timeout = std::chrono::seconds(10);
+
+ slogger.warn("raft topology: the node {} which was requested to be"
+ " replaced has the same ID as the current group 0 leader ({});"
+ " this looks like an attempt to join a node with the same IP"
+ " as a leader which might have just crashed; waiting for"
+ " a reelection",
+ params.host_id, g0_server.current_leader());
+
+ abort_source as;
+ timer<lowres_clock> t;
+ t.set_callback([&as] {
+ as.request_abort();
+ });
+ t.arm(timeout);
+
+ try {
+ while (!g0_server.current_leader() || *params.replaced_id == g0_server.current_leader()) {
+ // FIXME: Wait for the next term instead of sleeping in a loop
+ // Waiting for state change is not enough because a new leader
+ // might be chosen without us going through the candidate state.
+ co_await sleep_abortable(std::chrono::milliseconds(100), as);
+ }
+ } catch (abort_requested_exception&) {
+ slogger.warn("raft topology: the node {} tries to replace the"
+ " current leader {} but the leader didn't change within"
+ " {}s. Rejecting the node",
+ params.host_id,
+ *params.replaced_id,
+ std::chrono::duration_cast<std::chrono::seconds>(timeout).count());
+
+ result.result = join_node_request_result::rejected{
+ .reason = format(
+ "It is only allowed to replace dead nodes, however the"
+ " node that was requested to be replaced is still seen"
+ " as the group0 leader after {}s, which indicates that"
+ " it might be still alive. You are either trying to replace"
+ " a live node or trying to replace a node very quickly"
+ " after it went down and reelection didn't happen within"
+ " the timeout. Refusing to continue",
+ std::chrono::duration_cast<std::chrono::seconds>(timeout).count()),
+ };
+ co_return result;
+ }
+ }
+
+ while (true) {
+ auto guard = co_await _group0->client().start_operation(&_abort_source);
+
+ if (const auto *p = _topology_state_machine._topology.find(params.host_id)) {
+ const auto& rs = p->second;
+ if (rs.state == node_state::left) {
+ slogger.warn("raft topology: the node {} attempted to join",
+ " but it was removed from the cluster. Rejecting"
+ " the node",
+ params.host_id);
+ result.result = join_node_request_result::rejected{
+ .reason = "The node has already been removed from the cluster",
+ };
+ } else {
+ slogger.warn("raft topology: the node {} attempted to join",
+ " again after an unfinished attempt but it is no longer"
+ " allowed to do so. Rejecting the node",
+ params.host_id);
+ result.result = join_node_request_result::rejected{
+ .reason = "The node requested to join before but didn't finish the procedure. "
+ "Please clear the data directory and restart.",
+ };
+ }
+ co_return result;
+ }
+
+ auto mutation = build_mutation_from_join_params(params, guard);
+
+ topology_change change{{std::move(mutation)}};
+ group0_command g0_cmd = _group0->client().prepare_command(std::move(change), guard,
+ format("raft topology: placing join request for {}", params.host_id));
+ try {
+ co_await _group0->client().add_entry(std::move(g0_cmd), std::move(guard), &_abort_source);
+ break;
+ } catch (group0_concurrent_modification&) {
+ slogger.info("raft topology: join_node_request: concurrent operation is detected, retrying.");
+ }
+ }
+
+ slogger.info("raft topology: placed join request for {}", params.host_id);
+
+ // Success
+ result.result = join_node_request_result::ok {};
+ co_return result;
+}
+
+future<join_node_response_result> storage_service::join_node_response_handler(join_node_response_params params) {
+ assert(this_shard_id() == 0);
+
+ // Usually this handler will only run once, but there are some cases where we might get more than one RPC,
+ // possibly happening at the same time, e.g.:
+ //
+ // - Another node becomes the topology coordinator while the old one waits for the RPC,
+ // - Topology coordinator finished the RPC but failed to update the group 0 state.
+
+ // Serialize handling the responses.
+ auto lock = co_await get_units(_join_node_response_handler_mutex, 1);
+
+ // Wait until we sent and completed the join_node_request RPC
+ co_await _join_node_request_done.get_shared_future(_abort_source);
+
+ if (_join_node_response_done.available()) {
+ // We already handled this RPC. No need to retry it. Return immediately for idempotence.
+ slogger.info("raft topology: the node got join_node_response RPC for the second time, ignoring");
+ co_return join_node_response_result{};
+ }
+
+ co_return co_await std::visit(overloaded_functor {
+ [&] (const join_node_response_params::accepted& acc) -> future<join_node_response_result> {
+ // Do a read barrier to read/initialize the topology state
+ auto& raft_server = _group0->group0_server();
+ co_await raft_server.read_barrier(&_abort_source);
+
+ // Calculate nodes to ignore
+ // TODO: ignore_dead_nodes setting for bootstrap
+ std::unordered_set<raft::server_id> ignored_ids;
+ auto my_request_it =
+ _topology_state_machine._topology.req_param.find(_group0->load_my_id());
+ if (my_request_it != _topology_state_machine._topology.req_param.end()) {
+ if (auto* replace = std::get_if<service::replace_param>(&my_request_it->second)) {
+ ignored_ids = replace->ignored_ids;
+ ignored_ids.insert(replace->replaced_id);
+ }
+ }
+
+ // After this RPC finishes, repair or streaming will be run, and
+ // both of them require this node to see the normal nodes as UP.
+ // This condition might not be true yet as this information is
+ // propagated through gossip. In order to reduce the chance of
+ // repair/streaming failure, wait here until we see normal nodes
+ // as UP (or the timeout elapses).
+ const auto& amap = _group0->address_map();
+ std::vector<gms::inet_address> sync_nodes;
+ // FIXME: https://github.com/scylladb/scylladb/issues/12279
+ // Keep trying to translate host IDs to IPs until all are available in gossip
+ // Ultimately, we should take this information from token_metadata
+ const auto sync_nodes_resolve_deadline = lowres_clock::now() + wait_for_live_nodes_timeout;
+ while (true) {
+ sync_nodes.clear();
+ std::vector<raft::server_id> untranslated_ids;
+ for (const auto& [id, _] : _topology_state_machine._topology.normal_nodes) {
+ if (ignored_ids.contains(id)) {
+ continue;
+ }
+ if (auto ip = amap.find(id)) {
+ sync_nodes.push_back(*ip);
+ } else {
+ untranslated_ids.push_back(id);
+ }
+ }
+
+ if (!untranslated_ids.empty()) {
+ if (lowres_clock::now() > sync_nodes_resolve_deadline) {
+ throw std::runtime_error(format(
+ "Failed to obtain IP addresses of nodes that should be seen"
+ " as alive within {}s",
+ std::chrono::duration_cast<std::chrono::seconds>(wait_for_live_nodes_timeout).count()));
+ }
+
+ static logger::rate_limit rate_limit{std::chrono::seconds(1)};
+ slogger.log(log_level::warn, rate_limit, "raft topology: cannot map nodes {} to ips, retrying.",
+ untranslated_ids);
+
+ co_await sleep_abortable(std::chrono::milliseconds(5), _abort_source);
+ } else {
+ break;
+ }
+ }
+
+ slogger.info("raft topology: coordinator accepted request to join, "
+ "waiting for nodes {} to be alive before responding and continuing",
+ sync_nodes);
+ co_await _gossiper.wait_alive(sync_nodes, wait_for_live_nodes_timeout);
+ slogger.info("raft topology: nodes {} are alive", sync_nodes);
+
+ // Unblock waiting join_node_rpc_handshaker::post_server_start,
+ // which will start the raft server and continue
+ _join_node_response_done.set_value();
+
+ co_return join_node_response_result{};
+ },
+ [&] (const join_node_response_params::rejected& rej) -> future<join_node_response_result> {
+ auto eptr = std::make_exception_ptr(std::runtime_error(
+ format("the topology coordinator rejected request to join the cluster: {}", rej.reason)));
+ _join_node_response_done.set_exception(std::move(eptr));
+
+ co_return join_node_response_result{};
+ },
+ }, params.response);
+}
+
+void storage_service::init_messaging_service(sharded<db::system_distributed_keyspace>& sys_dist_ks, bool raft_topology_change_enabled) {
_messaging.local().register_node_ops_cmd([this] (const rpc::client_info& cinfo, node_ops_cmd_request req) {
auto coordinator = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
return container().invoke_on(0, [coordinator, req = std::move(req)] (auto& ss) mutable {
return ss.node_ops_cmd_handler(coordinator, std::move(req));
});
});
- ser::storage_service_rpc_verbs::register_raft_topology_cmd(&_messaging.local(), [this, &sys_dist_ks] (raft::term_t term, uint64_t cmd_index, raft_topology_cmd cmd) {
- return container().invoke_on(0, [&sys_dist_ks, cmd = std::move(cmd), term, cmd_index] (auto& ss) {
+ auto handle_raft_rpc = [&cont = container()] (raft::server_id dst_id, auto handler) {
+ return cont.invoke_on(0, [dst_id, handler = std::move(handler)] (auto& ss) mutable {
+ if (!ss._group0 || !ss._group0->joined_group0()) {
+ throw std::runtime_error("The node did not join group 0 yet");
+ }
+ if (ss._group0->load_my_id() != dst_id) {
+ throw raft_destination_id_not_correct(ss._group0->load_my_id(), dst_id);
+ }
+ return handler(ss);
+ });
+ };
+ ser::storage_service_rpc_verbs::register_raft_topology_cmd(&_messaging.local(), [&sys_dist_ks, handle_raft_rpc] (raft::server_id dst_id, raft::term_t term, uint64_t cmd_index, raft_topology_cmd cmd) {
+ return handle_raft_rpc(dst_id, [&sys_dist_ks, cmd = std::move(cmd), term, cmd_index] (auto& ss) {
return ss.raft_topology_cmd_handler(sys_dist_ks, term, cmd_index, cmd);
});
});
- ser::storage_service_rpc_verbs::register_raft_pull_topology_snapshot(&_messaging.local(), [this] (raft_topology_pull_params params) {
- return container().invoke_on(0, [] (auto& ss) -> future<raft_topology_snapshot> {
+ ser::storage_service_rpc_verbs::register_raft_pull_topology_snapshot(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, raft_topology_pull_params params) {
+ return handle_raft_rpc(dst_id, [] (storage_service& ss) -> future<raft_topology_snapshot> {
if (!ss._raft_topology_change_enabled) {
co_return raft_topology_snapshot{};
}
@@ -6248,22 +6684,35 @@ void storage_service::init_messaging_service(sharded<db::system_distributed_keys
};
});
});
- ser::storage_service_rpc_verbs::register_tablet_stream_data(&_messaging.local(), [this] (locator::global_tablet_id tablet) {
- return container().invoke_on(0, [tablet] (auto& ss) -> future<> {
+ ser::storage_service_rpc_verbs::register_tablet_stream_data(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, locator::global_tablet_id tablet) {
+ return handle_raft_rpc(dst_id, [tablet] (auto& ss) {
return ss.stream_tablet(tablet);
});
});
- ser::storage_service_rpc_verbs::register_tablet_cleanup(&_messaging.local(), [this] (locator::global_tablet_id tablet) {
- return container().invoke_on(0, [tablet] (auto& ss) -> future<> {
+ ser::storage_service_rpc_verbs::register_tablet_cleanup(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, locator::global_tablet_id tablet) {
+ return handle_raft_rpc(dst_id, [tablet] (auto& ss) {
return ss.cleanup_tablet(tablet);
});
});
+ if (raft_topology_change_enabled) {
+ ser::join_node_rpc_verbs::register_join_node_request(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, service::join_node_request_params params) {
+ return handle_raft_rpc(dst_id, [params = std::move(params)] (auto& ss) mutable {
+ return ss.join_node_request_handler(std::move(params));
+ });
+ });
+ ser::join_node_rpc_verbs::register_join_node_response(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, service::join_node_response_params params) {
+ return handle_raft_rpc(dst_id, [params = std::move(params)] (auto& ss) mutable {
+ return ss.join_node_response_handler(std::move(params));
+ });
+ });
+ }
}

future<> storage_service::uninit_messaging_service() {
return when_all_succeed(
_messaging.local().unregister_node_ops_cmd(),
- ser::storage_service_rpc_verbs::unregister(&_messaging.local())
+ ser::storage_service_rpc_verbs::unregister(&_messaging.local()),
+ ser::join_node_rpc_verbs::unregister(&_messaging.local())
).discard_result();
}

diff --git a/service/storage_service.hh b/service/storage_service.hh
--- a/service/storage_service.hh
+++ b/service/storage_service.hh
@@ -88,6 +88,12 @@ class storage_service;
class storage_proxy;
class migration_manager;
class raft_group0;
+class group0_info;
+
+struct join_node_request_params;
+struct join_node_request_result;
+struct join_node_response_params;
+struct join_node_response_result;

enum class disk_error { regular, commit };

@@ -180,7 +186,7 @@ public:

// Needed by distributed<>
future<> stop();
- void init_messaging_service(sharded<db::system_distributed_keyspace>& sys_dist_ks);
+ void init_messaging_service(sharded<db::system_distributed_keyspace>& sys_dist_ks, bool raft_topology_change_enabled);
future<> uninit_messaging_service();

future<> load_tablet_metadata();
@@ -329,7 +335,7 @@ public:
* API.
* \see init_server_without_the_messaging_service_part
*/
- future<> init_messaging_service_part(sharded<db::system_distributed_keyspace>& sys_dist_ks);
+ future<> init_messaging_service_part(sharded<db::system_distributed_keyspace>& sys_dist_ks, bool raft_topology_change_enabled);
/*!
* \brief Uninit the messaging service part of the service.
*/
@@ -794,10 +800,9 @@ private:

future<raft_topology_cmd_result> raft_topology_cmd_handler(sharded<db::system_distributed_keyspace>& sys_dist_ks, raft::term_t term, uint64_t cmd_index, const raft_topology_cmd& cmd);

- future<> raft_bootstrap(raft::server&);
+ future<> raft_initialize_discovery_leader(raft::server&, const join_node_request_params& params);
future<> raft_decomission();
future<> raft_removenode(locator::host_id host_id, std::list<locator::host_id_or_endpoint> ignore_nodes_params);
- future<> raft_replace(raft::server&, raft::server_id, gms::inet_address);
future<> raft_rebuild(sstring source_dc);
future<> raft_check_and_repair_cdc_streams();
future<> update_topology_with_local_metadata(raft::server&);
@@ -812,6 +817,16 @@ private:
// Applies received raft snapshot to local state machine persistent storage
// raft_group0_client::_read_apply_mutex must be held
future<> merge_topology_snapshot(raft_topology_snapshot snp);
+
+ canonical_mutation build_mutation_from_join_params(const join_node_request_params& params, service::group0_guard& guard);
+
+ future<join_node_request_result> join_node_request_handler(join_node_request_params params);
+ future<join_node_response_result> join_node_response_handler(join_node_response_params params);
+ shared_promise<> _join_node_request_done;
+ shared_promise<> _join_node_response_done;
+ semaphore _join_node_response_handler_mutex{1};
+
+ friend class join_node_rpc_handshaker;
};

}
diff --git a/service/topology_state_machine.cc b/service/topology_state_machine.cc
--- a/service/topology_state_machine.cc
+++ b/service/topology_state_machine.cc
@@ -9,6 +9,8 @@

#include "topology_state_machine.hh"

+#include <boost/range/adaptors.hpp>
+
namespace service {

logging::logger tsmlogger("topology_state_machine");
@@ -36,11 +38,11 @@ bool topology::contains(raft::server_id id) {
left_nodes.contains(id);
}

-std::set<sstring> topology_features::calculate_not_yet_enabled_features() const {
+std::set<sstring> calculate_not_yet_enabled_features(const std::set<sstring>& enabled_features, const auto& supported_features) {
std::set<sstring> to_enable;
bool first = true;

- for (const auto& [id, supported_features] : normal_supported_features) {
+ for (const auto& supported_features : supported_features) {
if (!first && to_enable.empty()) {
break;
}
@@ -62,6 +64,22 @@ std::set<sstring> topology_features::calculate_not_yet_enabled_features() const
return to_enable;
}

+std::set<sstring> topology_features::calculate_not_yet_enabled_features() const {
+ return ::service::calculate_not_yet_enabled_features(
+ enabled_features,
+ normal_supported_features | boost::adaptors::map_values);
+}
+
+std::set<sstring> topology::calculate_not_yet_enabled_features() const {
+ return ::service::calculate_not_yet_enabled_features(
+ enabled_features,
+ normal_nodes
+ | boost::adaptors::map_values
+ | boost::adaptors::transformed([] (const replica_state& rs) -> const std::set<sstring>& {
+ return rs.supported_features;
+ }));
+}
+
size_t topology::size() const {
return normal_nodes.size() + transition_nodes.size() + new_nodes.size();
}
@@ -75,6 +93,7 @@ std::ostream& operator<<(std::ostream& os, const fencing_token& fencing_token) {
}

static std::unordered_map<topology::transition_state, sstring> transition_state_to_name_map = {
+ {topology::transition_state::join_group0, "join group0"},
{topology::transition_state::commit_cdc_generation, "commit cdc generation"},
{topology::transition_state::write_both_read_old, "write both read old"},
{topology::transition_state::write_both_read_new, "write both read new"},
diff --git a/service/topology_state_machine.hh b/service/topology_state_machine.hh
--- a/service/topology_state_machine.hh
+++ b/service/topology_state_machine.hh
@@ -81,6 +81,7 @@ struct replica_state {
std::optional<ring_slice> ring; // if engaged contain the set of tokens the node owns together with their state
size_t shard_count;
uint8_t ignore_msb;
+ std::set<sstring> supported_features;
};

struct topology_features {
@@ -96,6 +97,7 @@ struct topology_features {

struct topology {
enum class transition_state: uint16_t {
+ join_group0,
commit_cdc_generation,
tablet_draining,
write_both_read_old,
@@ -140,8 +142,8 @@ struct topology {
// The IDs of the commited yet unpublished CDC generations sorted by timestamps.
std::vector<cdc::generation_id_v2> unpublished_cdc_generations;

- // Describes the state of the features of normal nodes
- topology_features features;
+ // Set of features that are considered to be enabled by the cluster.
+ std::set<sstring> enabled_features;

// Find only nodes in non 'left' state
const std::pair<const raft::server_id, replica_state>* find(raft::server_id id) const;
@@ -151,6 +153,9 @@ struct topology {
size_t size() const;
// Are there any non-left nodes?
bool is_empty() const;
+
+ // Calculates a set of features that are supported by all normal nodes but not yet enabled.
+ std::set<sstring> calculate_not_yet_enabled_features() const;
};

struct raft_topology_snapshot {
diff --git a/test/topology_experimental_raft/test_blocked_bootstrap.py b/test/topology_experimental_raft/test_blocked_bootstrap.py
--- a/test/topology_experimental_raft/test_blocked_bootstrap.py
+++ b/test/topology_experimental_raft/test_blocked_bootstrap.py
@@ -11,6 +11,7 @@
logger = logging.getLogger(__name__)

+...@pytest.mark.skip(reason = "can't make it work with the new join procedure, without error recovery")
@pytest.mark.asyncio
async def test_blocked_bootstrap(manager: ManagerClient):
"""

Commit Bot

<bot@cloudius-systems.com>

unread,

Sep 28, 2023, 5:45:45 AM9/28/23

to scylladb-dev@googlegroups.com, Kamil Braun

- The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then:
- If the node is accepted, the coordinator transitions it to `boostrap`/`replace` state and transitions the topology to `join_group0` state. The node is added to group 0 and then `JOIN_NODE_RESPONSE` is sent to it with information that the node was accepted.
- Otherwise, the node is moved to `left` state, told by the coordinator via `JOIN_NODE_RESPONSE` that it was rejected and it shuts down.

The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - `JOIN_NODE_REQUEST` will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).

More details about the procedure and the RPC are described in `topology-over-raft.md`.

@@ -1310,6 +1310,7 @@ def find_ninja():

@@ -83,6 +83,8 @@

#include "idl/storage_service.dist.hh"
#include "service/storage_proxy.hh"
#include "service/raft/raft_address_map.hh"
+#include "service/raft/join_node.hh"
+#include "idl/join_node.dist.hh"
#include "protocol_server.hh"
#include "types/set.hh"
#include "node_ops/node_ops_ctl.hh"

@@ -105,6 +107,8 @@ namespace service {

static logging::logger slogger("storage_service");

+static constexpr std::chrono::seconds wait_for_live_nodes_timeout{30};
+
storage_service::storage_service(abort_source& abort_source,
distributed<replica::database>& db, gms::gossiper& gossiper,
sharded<db::system_keyspace>& sys_ks,

@@ -339,13 +343,13 @@ future<> storage_service::topology_state_load() {

_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state();

co_await _feature_service.container().invoke_on_all([&] (gms::feature_service& fs) {
- return fs.enable(boost::copy_range<std::set<std::string_view>>(_topology_state_machine._topology.features.enabled_features));
+ return fs.enable(boost::copy_range<std::set<std::string_view>>(_topology_state_machine._topology.enabled_features));
});

// Update the legacy `enabled_features` key in `system.scylla_local`.
// It's OK to update it after enabling features because `system.topology` now
// is the source of truth about enabled features.
- co_await _sys_ks.local().save_local_enabled_features(_topology_state_machine._topology.features.enabled_features, false);
+ co_await _sys_ks.local().save_local_enabled_features(_topology_state_machine._topology.enabled_features, false);

const auto& am = _group0->address_map();
auto id2ip = [this, &am] (raft::server_id id) -> future<gms::inet_address> {

@@ -425,6 +429,8 @@ future<> storage_service::topology_state_load() {

return read_new_t::no;
}
switch (*state) {
+ case topology::transition_state::join_group0:
+ [[fallthrough]];
case topology::transition_state::tablet_migration:
[[fallthrough]];
case topology::transition_state::commit_cdc_generation:

@@ -1036,7 +1042,7 @@ class topology_coordinator {

auto result = utils::fb_utilities::is_me(*ip) ?
co_await _raft_topology_cmd_handler(_sys_dist_ks, _term, cmd_index, cmd) :
co_await ser::storage_service_rpc_verbs::send_raft_topology_cmd(
- &_messaging, netw::msg_addr{*ip}, _term, cmd_index, cmd);
+ &_messaging, netw::msg_addr{*ip}, id, _term, cmd_index, cmd);
if (result.status == raft_topology_cmd_result::command_status::fail) {
co_await coroutine::exception(std::make_exception_ptr(
std::runtime_error(::format("failed status returned from {}/{}", id, *ip))));

@@ -1388,15 +1394,14 @@ class topology_coordinator {

}

// Preconditions:
- // - There are no nodes that are trying to join and there are no topology
- // operations in progress
+ // - There are no topology operations in progress
// - `features_to_enable` represents a set of features that are currently
// marked as supported by all normal nodes and it is not empty
future<> enable_features(group0_guard guard, std::set<sstring> features_to_enable) {
- if (!_topo_sm._topology.new_nodes.empty() || !_topo_sm._topology.transition_nodes.empty()) {
+ if (!_topo_sm._topology.transition_nodes.empty()) {
on_internal_error(slogger,
- "topology coordinator attempted to enable features even though there are"
- " joining nodes or topology operations in progress");
+ "topology coordinator attempted to enable features even though there is"
+ " a topology operations in progress");
}

if (utils::get_local_injector().enter("raft_topology_suppress_enabling_features")) {

@@ -1409,16 +1414,22 @@ class topology_coordinator {

@@ -1612,7 +1623,7 @@ class topology_coordinator {

slogger.info("raft topology: Initiating tablet streaming of {} to {}", gid, trinfo.pending_replica);
auto dst = trinfo.pending_replica.host;
return ser::storage_service_rpc_verbs::send_tablet_stream_data(&_messaging,
- netw::msg_addr(id2ip(dst)), _as, gid);
+ netw::msg_addr(id2ip(dst)), _as, raft::server_id(dst.uuid()), gid);
})) {
transition_to(locator::tablet_transition_stage::write_both_read_new);
}

@@ -1628,7 +1639,7 @@ class topology_coordinator {

locator::tablet_replica dst = locator::get_leaving_replica(tmap.get_tablet_info(gid.tablet), trinfo);
slogger.info("raft topology: Initiating tablet cleanup of {} on {}", gid, dst);
return ser::storage_service_rpc_verbs::send_tablet_cleanup(&_messaging,
- netw::msg_addr(id2ip(dst.host)), _as, gid);
+ netw::msg_addr(id2ip(dst.host)), _as, raft::server_id(dst.host.uuid()), gid);
})) {
transition_to(locator::tablet_transition_stage::end_migration);
}

@@ -1740,7 +1751,7 @@ class topology_coordinator {

return std::make_pair(true, std::move(guard));
}

- if (!_topo_sm._topology.features.calculate_not_yet_enabled_features().empty()) {
+ if (!_topo_sm._topology.calculate_not_yet_enabled_features().empty()) {
return std::make_pair(true, std::move(guard));
}

@@ -1765,7 +1776,7 @@ class topology_coordinator {

co_return true;
}

- if (auto feats = _topo_sm._topology.features.calculate_not_yet_enabled_features(); !feats.empty()) {
+ if (auto feats = _topo_sm._topology.calculate_not_yet_enabled_features(); !feats.empty()) {
co_await enable_features(std::move(guard), std::move(feats));
co_return true;
}

@@ -1778,6 +1789,45 @@ class topology_coordinator {

@@ -2028,8 +2078,47 @@ class topology_coordinator {

@@ -2043,19 +2132,14 @@ class topology_coordinator {

auto bootstrap_tokens = dht::boot_strapper::get_random_bootstrap_tokens(
tmptr, num_tokens, dht::check_token_endpoint::yes);

- auto [gen_uuid, guard, mutation] = co_await prepare_and_broadcast_cdc_generation_data(
- tmptr, take_guard(std::move(node)), bootstrapping_info{bootstrap_tokens, *node.rs});
-
- // Write chosen tokens and CDC generation data through raft.
- builder.set_transition_state(topology::transition_state::commit_cdc_generation)
- .set_new_cdc_generation_data_uuid(gen_uuid)
+ // Write chosen tokens through raft.
+ builder.set_transition_state(topology::transition_state::join_group0)
.with_node(node.id)
.set("node_state", node_state::bootstrapping)
.del("topology_request")
.set("tokens", bootstrap_tokens);
- auto reason = ::format(
- "bootstrap: assign tokens and insert CDC generation data (UUID: {})", gen_uuid);
- co_await update_topology_state(std::move(guard), {std::move(mutation), builder.build()}, reason);
+ auto reason = ::format("bootstrap: accept node and assign tokens");
+ co_await update_topology_state(std::move(node.guard), {builder.build()}, reason);
break;
}
case topology_request::leave:

@@ -2087,13 +2171,13 @@ class topology_coordinator {

auto it = _topo_sm._topology.normal_nodes.find(replaced_id);
assert(it != _topo_sm._topology.normal_nodes.end());
assert(it->second.ring && it->second.state == node_state::normal);
- builder.set_transition_state(topology::transition_state::tablet_draining)
- .set_version(_topo_sm._topology.version + 1)
+ builder.set_transition_state(topology::transition_state::join_group0)
.with_node(node.id)
.set("node_state", node_state::replacing)
.del("topology_request")
.set("tokens", it->second.ring->tokens);
- co_await update_topology_state(take_guard(std::move(node)), {builder.build()}, "start replace");
+ co_await update_topology_state(take_guard(std::move(node)), {builder.build()},
+ "replace: accept node and take ownership of the replaced node's tokens");
break;
}
case topology_request::rebuild: {

@@ -2194,6 +2278,59 @@ class topology_coordinator {

@@ -2387,98 +2524,106 @@ std::unordered_set<raft::server_id> storage_service::find_raft_nodes_from_hoeps(

+ auto ignored_ids = find_raft_nodes_from_hoeps(ignore_nodes_params);

- auto it = _topology_state_machine._topology.find(raft_server.id());
- if (it && it->second.state != node_state::replacing) {
- throw std::runtime_error(::format("Cannot do \"replace address\" operation with a node that is in state: {}", it->second.state));

+ // Group 0 has been started. Allow the join_node_response to be handled.
+ _ss._join_node_group0_started.set_value();

-future<> storage_service::raft_bootstrap(raft::server& raft_server) {
- // We try to find ourself in the topology without doing read barrier
- // first to not require quorum of live nodes during regular boot. But
- // if we are not in the topology it either means this is the first boot
- // or we failed during bootstrap so do a read barrier (which requires
- // quorum to be alive) and re-check.
- if (!_topology_state_machine._topology.contains(raft_server.id())) {
+ // Processing of the response is done in `join_node_response_handler`.
+ // Wait for it to complete.
+ co_await _ss._join_node_response_done.get_shared_future(lowres_clock::now() + std::chrono::seconds(30));
+ slogger.info("raft topology: join: success");
+ co_return true;
+ }
+};

@@ -2501,12 +2646,11 @@ future<> storage_service::update_topology_with_local_metadata(raft::server& raft

}

auto& replica_state = it->second;
- const auto& replica_supported_features = _topology_state_machine._topology.features.normal_supported_features[raft_server.id()];

return replica_state.shard_count == local_shard_count
&& replica_state.ignore_msb == local_ignore_msb
&& replica_state.release_version == local_release_version
- && replica_supported_features == local_supported_features;
+ && replica_state.supported_features == local_supported_features;
};

// We avoid performing a read barrier if we're sure that our metadata stored in topology

@@ -2544,8 +2688,8 @@ future<> storage_service::update_topology_with_local_metadata(raft::server& raft

// Fortunately, there is no risk that this feature was marked as enabled
// because it requires that the current node responded to a barrier
// request - which will fail in this situation.
- const auto& enabled_features = _topology_state_machine._topology.features.enabled_features;
- const auto unsafe_to_disable_features = _topology_state_machine._topology.features.calculate_not_yet_enabled_features();
+ const auto& enabled_features = _topology_state_machine._topology.enabled_features;
+ const auto unsafe_to_disable_features = _topology_state_machine._topology.calculate_not_yet_enabled_features();
_feature_service.check_features(enabled_features, unsafe_to_disable_features);

slogger.info("raft topology: updating topology with local metadata");

@@ -2642,9 +2786,11 @@ future<> storage_service::join_token_ring(sharded<db::system_distributed_keyspac

co_await check_for_endpoint_collision(initial_contact_nodes, loaded_peer_features);
} else {
auto local_features = _feature_service.supported_feature_set();
- slogger.info("Checking remote features with gossip, initial_contact_nodes={}", initial_contact_nodes);
+ slogger.info("Performing gossip shadow round, initial_contact_nodes={}", initial_contact_nodes);
co_await _gossiper.do_shadow_round(initial_contact_nodes);
- _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ if (!_raft_topology_change_enabled) {
+ _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ }
_gossiper.check_snitch_name_matches(_snitch.local()->get_name());
// Check if the node is already removed from the cluster
auto local_host_id = get_token_metadata().get_my_id();

@@ -2796,13 +2942,36 @@ future<> storage_service::join_token_ring(sharded<db::system_distributed_keyspac

@@ -2838,12 +3007,11 @@ future<> storage_service::join_token_ring(sharded<db::system_distributed_keyspac

supervisor::notify("starting system distributed keyspace");
co_await sys_dist_ks.invoke_on_all(&db::system_distributed_keyspace::start);

- if (is_replacing()) {
- assert(raft_replace_info);
- co_await raft_replace(*raft_server, raft_replace_info->raft_id, raft_replace_info->ip_addr);
- } else {
- co_await raft_bootstrap(*raft_server);
- }
+ // Nodes that are not discovery leaders have their join request inserted
+ // on their behalf by an existing node in the cluster during the handshake.
+ // Discovery leaders on the other need to insert the join request themselves,
+ // we do that here.
+ co_await raft_initialize_discovery_leader(*raft_server, join_params);

// Wait until we enter one of the final states
co_await _topology_state_machine.event.when([this, raft_server] {

@@ -3733,9 +3901,9 @@ future<> storage_service::drain_on_shutdown() {

_drain_finished.get_future() : do_drain();
}

-future<> storage_service::init_messaging_service_part(sharded<db::system_distributed_keyspace>& sys_dist_ks) {
- return container().invoke_on_all([&sys_dist_ks] (storage_service& local) {
- return local.init_messaging_service(sys_dist_ks);
+future<> storage_service::init_messaging_service_part(sharded<db::system_distributed_keyspace>& sys_dist_ks, bool raft_topology_change_enabled) {
+ return container().invoke_on_all([&sys_dist_ks, raft_topology_change_enabled] (storage_service& local) {
+ return local.init_messaging_service(sys_dist_ks, raft_topology_change_enabled);
});
}

@@ -3948,9 +4116,11 @@ future<> storage_service::check_for_endpoint_collision(std::unordered_set<gms::i

bool found_bootstrapping_node = false;
auto local_features = _feature_service.supported_feature_set();
do {
- slogger.info("Checking remote features with gossip");
+ slogger.info("Performing gossip shadow round");
_gossiper.do_shadow_round(initial_contact_nodes).get();
- _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ if (!_raft_topology_change_enabled) {
+ _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ }
_gossiper.check_snitch_name_matches(_snitch.local()->get_name());
auto addr = get_broadcast_address();
if (!_gossiper.is_safe_for_bootstrap(addr)) {

@@ -4026,10 +4196,12 @@ storage_service::prepare_replacement_info(std::unordered_set<gms::inet_address>

}

// make magic happen
- slogger.info("Checking remote features with gossip");
+ slogger.info("Performing gossip shadow round");
co_await _gossiper.do_shadow_round(initial_contact_nodes);
- auto local_features = _feature_service.supported_feature_set();
- _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ if (!_raft_topology_change_enabled) {
+ auto local_features = _feature_service.supported_feature_set();
+ _gossiper.check_knows_remote_features(local_features, loaded_peer_features);
+ }

// now that we've gossiped at least once, we should be able to find the node we're replacing
if (replace_host_id) {

@@ -5778,8 +5950,8 @@ future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(shar

// be rare, but it can happen and we can detect it right here.
std::exception_ptr ex;
try {
- const auto& enabled_features = _topology_state_machine._topology.features.enabled_features;
- const auto unsafe_to_disable_features = _topology_state_machine._topology.features.calculate_not_yet_enabled_features();
+ const auto& enabled_features = _topology_state_machine._topology.enabled_features;
+ const auto unsafe_to_disable_features = _topology_state_machine._topology.calculate_not_yet_enabled_features();
_feature_service.check_features(enabled_features, unsafe_to_disable_features);
} catch (const gms::unsupported_feature_exception&) {
ex = std::current_exception();

@@ -6187,20 +6359,285 @@ future<> storage_service::cleanup_tablet(locator::global_tablet_id tablet) {

@@ -6248,22 +6685,39 @@ void storage_service::init_messaging_service(sharded<db::system_distributed_keys

};
});
});
- ser::storage_service_rpc_verbs::register_tablet_stream_data(&_messaging.local(), [this] (locator::global_tablet_id tablet) {
- return container().invoke_on(0, [tablet] (auto& ss) -> future<> {
+ ser::storage_service_rpc_verbs::register_tablet_stream_data(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, locator::global_tablet_id tablet) {
+ return handle_raft_rpc(dst_id, [tablet] (auto& ss) {
return ss.stream_tablet(tablet);
});
});
- ser::storage_service_rpc_verbs::register_tablet_cleanup(&_messaging.local(), [this] (locator::global_tablet_id tablet) {
- return container().invoke_on(0, [tablet] (auto& ss) -> future<> {
+ ser::storage_service_rpc_verbs::register_tablet_cleanup(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, locator::global_tablet_id tablet) {
+ return handle_raft_rpc(dst_id, [tablet] (auto& ss) {
return ss.cleanup_tablet(tablet);
});
});
+ if (raft_topology_change_enabled) {
+ ser::join_node_rpc_verbs::register_join_node_request(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, service::join_node_request_params params) {
+ return handle_raft_rpc(dst_id, [params = std::move(params)] (auto& ss) mutable {
+ return ss.join_node_request_handler(std::move(params));
+ });
+ });

+ ser::join_node_rpc_verbs::register_join_node_response(&_messaging.local(), [this] (raft::server_id dst_id, service::join_node_response_params params) {
+ return container().invoke_on(0, [dst_id, params = std::move(params)] (auto& ss) mutable -> future<join_node_response_result> {
+ co_await ss._join_node_group0_started.get_shared_future(ss._abort_source);

+ if (ss._group0->load_my_id() != dst_id) {
+ throw raft_destination_id_not_correct(ss._group0->load_my_id(), dst_id);
+ }

+ co_return co_await ss.join_node_response_handler(std::move(params));

@@ -812,6 +817,17 @@ private:

// Applies received raft snapshot to local state machine persistent storage
// raft_group0_client::_read_apply_mutex must be held
future<> merge_topology_snapshot(raft_topology_snapshot snp);
+
+ canonical_mutation build_mutation_from_join_params(const join_node_request_params& params, service::group0_guard& guard);
+
+ future<join_node_request_result> join_node_request_handler(join_node_request_params params);
+ future<join_node_response_result> join_node_response_handler(join_node_response_params params);
+ shared_promise<> _join_node_request_done;

+ shared_promise<> _join_node_group0_started;

Commit Bot

<bot@cloudius-systems.com>

unread,

Sep 28, 2023, 7:55:29 PM9/28/23

to scylladb-dev@googlegroups.com, Kamil Braun

From: Kamil Braun <kbr...@scylladb.com>
Committer: Kamil Braun <kbr...@scylladb.com>