Quorum: which nodes have a right to vote?

Luca Morandini

unread,

Jun 14, 2023, 2:11:54 AM6/14/23

to us...@couchdb.apache.org

Folks,

A student (I teach CouchDB as part of a Cloud Computing course), pointed
out that, on a 4-node, 3-replica cluster, the database should stop
accepting requests when 2 nodes are down.

His rationale is: the quorum (assuming its default value of 2) in principle
can be reached, but since some of the shards are not present on both nodes,
the quorum of replicas cannot be reached even when there are still 2 nodes
standing.

This did not chime with my experience, hence I did a little experiment:
- set a cluster with 4 nodes and cluster parameters set to
"q":8,"n":3,"w":2,"r":2;
- created a database;
- added a few documents;
- stopped 2 nodes out of 4;
- added another 10,000 documents without a hiccup.

I checked the two surviving nodes, and there were 6 shard files
representing the 8 shards in each node: 4 shards were replicated, and 4
were not.
Therefore, about 5,000 of the write operations must have hit the
un-replicated shards.

In other words, who has the vote in a quorum election: all the nodes, or
only the nodes that host the shard with the sought document?

Cheers,

Luca Morandini

Robert Newson

unread,

Jun 14, 2023, 3:23:42 AM6/14/23

to user

Hi,

There are no votes, no elections and there are no leader nodes.

CouchDB chooses availability over consistency and will accept reads/writes even if only one node (that hosts the shard ranges being read/written) is up.

In a 3-node, 3-replica cluster, where every node hosts a copy of every shard, any single node can be up to allow all reads and writes to succeed.

Every node in the cluster can coordinate a read or write. The coordinator creates N concurrent and independent read/write requests and sends them to the appropriate nodes (that the shard map indicates for that document id). The coordinator waits for a quorum of replies before merging those replies into the http response to the client, up to the request timeout parameter. If at least one write occurred CouchDB will return a 202 status code, if quorum was reached a 201 is returned, 200 is returned for reads (whether quorum reached or not, the difference is you'd get a faster reply if quorum is reached, otherwise you're waiting for the timeout).

When couchdb believes nodes to be down, the quorum is implicitly lowered to avoid the latency penalty.

In your scenario the two offline nodes would not get the writes at the time, for obvious reasons, but once up again they will receive those writes from the surviving nodes, restoring the expected N level of redundancy.

B.

Luca Morandini

unread,

Jun 14, 2023, 4:23:34 AM6/14/23

to us...@couchdb.apache.org

On Wed, 14 Jun 2023 at 17:23, Robert Newson <rne...@apache.org> wrote:

>
> There are no votes, no elections and there are no leader nodes.
>

As I see it, when there is a quorum to reach there is an implicit voting,
but never mind.

When couchdb believes nodes to be down, the quorum is implicitly lowered to
> avoid the latency penalty.
>

So, it is kind of a "soft quorum".

Going back to my original question: only the nodes that host the shards are
queried, but when there are not enough surviving nodes the quorum is
lowered.

As a corollary, I assume that when at least one shard is no
longer reachable (no one of the surviving nodes hosts it) the cluster stops
accepting requests on that database: is that so?

Thanks for the answer,

Luca Morandini

Robert Newson

unread,

Jun 14, 2023, 11:20:37 AM6/14/23

to user

Hi,

They are not votes. we are simply waiting to hear the first two of the total of three expected responses before returning a response to the client. No node will revert its write if another node fails its write. each of the three nodes might return a different status code due to ordering (e.g, one node might return 201 for a write and another node might return 409, which ends up adding a conflict into the document).

couchdb uses the shard map to know which nodes should host the document (based solely on its id) and directs reads and writes to those nodes only, correct.

If no node that hosts the range is available you will get an error trying to read or write a document in that range and no write happens; there is no "hinted handoff" in our variant of dynamo.

If the above happens for all ranges then, yes, all reads and writes for that database will fail.

B.

Reply all

Reply to author

Forward