Leadership transfer & protecting against disruptive servers

Konstantin Osipov

unread,

Apr 16, 2021, 2:03:49 PM4/16/21

to raft...@googlegroups.com

Hi,

I found this problem when reading etcd raft source code.
Maybe there is no issue, or maybe it's not visible in a real
world cockroachdb or etcd deploy. This makes it unclear where to
report it, so I decided to take my chances with this list.

According to Raft PhD 3.10, when a leader is requested
to transfer leadership to another cluster member, it stays in
leader role until the new member
- runs an election
- becomes a leader
- and sends the old leader AppendEntries with a new term,
converting it to a follower.

If this doesn't happen within the election timeout, the leadership
transfer is aborted.

However, it seems the PhD did not consider how this is going
to work with sticky leadership. Imagine a cluster of two nodes,
and one node is trying to transfer leadership to another.
After sending TimeoutNow, the lead transferee will become a candidate
and start an election. However, it will get no votes, since
the current leader will not vote for it according to 4.2.3 Disruptive
Servers.

While it may sound trivial to amend 4.2.3 to have an exception for
3.10, I wonder what is the point for the old leader to stay in the
awkward "transferring leadership" role after sending TimeoutNow RPC?
If it converts to a follower instead, it will achieve the desired
effect and will not require the concept of "aborting" leadership
transfer after an election timeout.

Thanks,

--
Konstantin Osipov, Moscow, Russia

Henrik Ingo

unread,

Apr 16, 2021, 3:48:57 PM4/16/21

to raft...@googlegroups.com

Kostja

At least in Mongodb a similar handover was implemented in order to minimize the amount of time the cluster has no leader. You essentially want the new leader to be ready to take over the exact same moment the old one stops servicing client requests.

Next level is to try to implement some synchronization around the last AppendEntries call to try to minimize the amount of writes that fail to commit due to the planned failover.

Henrik

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/20210416180344.GA143650%40starling.

Oren Eini (Ayende Rahien)

unread,

Apr 16, 2021, 3:53:46 PM4/16/21

to raft...@googlegroups.com

The reason you need to have the leader stay until deposed is that you are running the risk of the chosen successor failing.

The way to handle that is to have the new candidate announce that this is a forced election, and sticky leadership should be ignored. That would make the rest of the system behave properly.

The good thing about it is that sans failure, that is very predictable process, so anyone who is implementing leadership transfer has to implement this, otherwise, with sticky leadership, that won't work at all, as you noted.

Konstantin Osipov

unread,

Apr 19, 2021, 9:10:16 AM4/19/21

to 'Oren Eini (Ayende Rahien)' via raft-dev

* 'Oren Eini (Ayende Rahien)' via raft-dev <raft...@googlegroups.com> [21/04/16 22:54]:

> The reason you need to have the leader stay until deposed is
> that you are running the risk of the chosen successor failing.
> The way to handle that is to have the new candidate announce
> that this is a forced election, and sticky leadership should be
> ignored. That would make the rest of the system behave properly.
> The good thing about it is that sans failure, that is very
> predictable process, so anyone who is implementing leadership
> transfer has to implement this, otherwise, with sticky
> leadership, that won't work at all, as you noted.

Thanks Oren.

Just to clarify, this wasn't exactly my question. Obviously, if
forced election succeeds, the leader will step down.

I was exploring what happens if the forced election fails, and
whether or not aborting the transfer on a timeout is the best
option in this case.

I suggested an alternative: convert the leader to a follower right
after sending TimeoutNow. That would eliminate the need for abort.

Henrik, in his reply, objected that the goal here is to minimize
the window when the log is closed for new appends. I should add
that another goal is to not accidentally lose appended records
during a transfer. This is achievable if the leader first syncs up
with the quorum of replicas (not *one* replica), and only then
sends TimeoutNow.

Even if the transfer fails, the quorum will have the latest log,
so will preserve it through a normal election which will sure
follow after a timeout. So it seemed a simpler alternative.

Meanwhile I've somewhat found a reason for being able to abort a
transfer on timeout. We might want to be able to do it anyway,
e.g. if a follower we tried to transfer leadership to happens to
be slow to sync the log, we're better off choosing another one.

Thanks again,

Oren Eini (Ayende Rahien)

unread,

Apr 19, 2021, 3:08:11 PM4/19/21

to raft...@googlegroups.com

inline

On Mon, Apr 19, 2021 at 4:10 PM Konstantin Osipov <kos...@scylladb.com> wrote:

* 'Oren Eini (Ayende Rahien)' via raft-dev <raft...@googlegroups.com> [21/04/16 22:54]:

> The reason you need to have the leader stay until deposed is
> that you are running the risk of the chosen successor failing.
> The way to handle that is to have the new candidate announce
> that this is a forced election, and sticky leadership should be
> ignored. That would make the rest of the system behave properly.
> The good thing about it is that sans failure, that is very
> predictable process, so anyone who is implementing leadership
> transfer has to implement this, otherwise, with sticky
> leadership, that won't work at all, as you noted.

Thanks Oren.

Just to clarify, this wasn't exactly my question. Obviously, if
forced election succeeds, the leader will step down.

I was exploring what happens if the forced election fails, and
whether or not aborting the transfer on a timeout is the best
option in this case.

The only way that this can fail is if the new candidate couldn't get elected.

In that case, the old leader already stepped down, so someone else will get the leadership.

There is no real need for a timeout, except to abort the election cycle when new votes are requested.

I suggested an alternative: convert the leader to a follower right
after sending TimeoutNow. That would eliminate the need for abort.

The timing here goes something like this.

Leader stepping down. It chose the most up to date follower and tell it to run a forced election.

Then it steps down.

The follower runs an immediate election, without doing a trial election. That means that usually, the election is done in 2 - 3 ping times.

Henrik, in his reply, objected that the goal here is to minimize
the window when the log is closed for new appends. I should add
that another goal is to not accidentally lose appended records
during a transfer. This is achievable if the leader first syncs up
with the quorum of replicas (not *one* replica), and only then
sends TimeoutNow.

Even if the transfer fails, the quorum will have the latest log,
so will preserve it through a normal election which will sure
follow after a timeout. So it seemed a simpler alternative.

Meanwhile I've somewhat found a reason for being able to abort a
transfer on timeout. We might want to be able to do it anyway,
e.g. if a follower we tried to transfer leadership to happens to
be slow to sync the log, we're better off choosing another one.

Thanks again,

--
Konstantin Osipov, Moscow, Russia

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/20210419131011.GA109663%40starling.

Reply all

Reply to author

Forward