Gossip implementation and node liveness

110 views

Skip to first unread message

Unmesh Joshi

unread,

Jan 8, 2021, 7:41:26 AM1/8/21

to CockroachDB

Hi,

I was going through CockroachDb gossip protocol implementation.

One of the things that I was trying to understand in CockroachDb gossip design is why it maintains client connections to specific cluster nodes.

In other designs like SWIM gossip or Cassandra, the implementations relies on chosing a random cluster node for every gossip round.

Was there any specific reason for not choosing a random node for each gossip round? I can see that it allows maintaining per client versions to avoid one extra message round to exchange info version numbers to determine delta of gossip state (e.g. Cassandra needs a three way handshake to pass version numbers, and Hashicorp memberlist does a full state sync periodically without any kind of versioning), but not sure if that was the only reason, so wanted to confirm)

Wont this design be more vulnerable to network partitions?

I also see that node liveness is also persisted using the standard raft backed persistence for key ranges. But node address resolution still seems to rely on node addresses being gossiped. (particularly getLivenessLocked method seems to be consulting only in memory gossip state), and persisted liveness is not consulted?

Thanks,

Unmesh

Andrei Matei

unread,

Jan 8, 2021, 12:15:26 PM1/8/21

to Unmesh Joshi, Peter Mattis, CockroachDB

+ Peter

I was going through CockroachDb gossip protocol implementation.
One of the things that I was trying to understand in CockroachDb gossip design is why it maintains client connections to specific cluster nodes.
In other designs like SWIM gossip or Cassandra, the implementations relies on chosing a random cluster node for every gossip round.
Was there any specific reason for not choosing a random node for each gossip round? I can see that it allows maintaining per client versions to avoid one extra message round to exchange info version numbers to determine delta of gossip state (e.g. Cassandra needs a three way handshake to pass version numbers, and Hashicorp memberlist does a full state sync periodically without any kind of versioning), but not sure if that was the only reason, so wanted to confirm)
Wont this design be more vulnerable to network partitions?

I think the main reason for maintaining long-live connections is because the gossip network tries to "optimize" itself into a spanning tree - so the number of connections is supposed to be minimized (at least somewhat). Mayber Peter knows more.

I also see that node liveness is also persisted using the standard raft backed persistence for key ranges. But node address resolution still seems to rely on node addresses being gossiped. (particularly getLivenessLocked method seems to be consulting only in memory gossip state), and persisted liveness is not consulted?

Right. The authoritative liveness source is the KV store; each node "heartbeats" its liveness record every few seconds. The (always stale) information is disseminated through gossip though, so that nodes have it on hand. Most code paths that are interested in the status (or address, etc) of a different node use this disseminated data, for expediency. Operations where races and staleness are not acceptable do consult the KV store though. In particular, when a node wants to take leases away from another node that it believes to be "dead", it does so using a conditional put (atomic compare-and-swap) on the target's liveness record.

Hope this helps,

- Andrei

Thanks,
Unmesh

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cockroach-db/d74b7ce9-ac7d-420f-ad46-8b40919138bbn%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages