trying to make senes of the leadership transfer extension

123 views
Skip to first unread message

Shaya Potter

unread,
Jun 20, 2021, 10:57:55 AM6/20/21
to raft-dev
The leadership transfer extension described in section 3.10 of the dissertation is driving me a little nuts trying to understand.

The basic premise as I understand it, is that if we make sure a target node is up to date with all the committed entries (i.e. stop accepting new ones, and drop in flight entries on floor, to prevent them from being committed), then if we "time it out" so that it will initiate an election on its own immediately, it should conceptually win the election as it will be as up to date as need be to win it.

however, this assumes that its request vote rpc's will be accepted.

If the current leader is still heart beating away, however, that won't happen.

per section 4.2.4

"If a leader can reliably send heartbeats to its own configuration, then neither it nor itsfollowers will adopt a higher term: they will not time out to start any new elections, andthey will ignore any RequestVote messages with a higher term from other servers. Thus,the leader will not be forced to step down"

so. the only way to do this is for the leader to stop heart beating. for a period of time > election timeout.  But if it did that, then any node would start an election and unsure how timeout now helps anything.

am I missing something?

Ozan T

unread,
Jun 20, 2021, 12:16:47 PM6/20/21
to raft...@googlegroups.com
Shaya, 

I agree with you. I believe it says exactly what you described but also please check out section 4.2.3 Disruptive Servers :

Raft’s solution uses heartbeats to determine when a valid leader exists. In Raft, a leader is considered active if it is able to maintain heartbeats to its followers (otherwise, another server will start an election). Thus, servers should not be able to disrupt a leader whose cluster is receiving heartbeats. We modify the RequestVote RPC to achieve this: if a server receives a RequestVote request within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. It can either drop the request, reply with a vote denial, or delay the request; the result is essentially the same. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. 

However, it helps avoid disruptions from servers not in Cnew: while a leader is able to get heartbeats to its cluster, it will not be deposed by larger term numbers. This change conflicts with the leadership transfer mechanism as described in Chapter 3, in which a server legitimately starts an election without waiting an election timeout. In that case, RequestVote messages should be processed by other servers even when they believe a current cluster leader exists. Those RequestVote requests can include a special flag to indicate this behavior (“I have permission to disrupt the leader—it told me to!”).


Here, there is a suggestion for this problem: adding a special flag to requests so followers will not reject new candidates' requests. Hope this helps.

Ozan.



Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/07158aa1-709a-45d7-8b69-851d1dca9ed9n%40googlegroups.com.

Shaya Potter

unread,
Jun 20, 2021, 5:36:50 PM6/20/21
to raft...@googlegroups.com
(i apologize for the disclaimer at the bottom, this was recently added by IT and I have expressed by displeasure)

I had a different/related idea today.  I dislike the request vote with flag, as it put the trust on the node, when I feel trust should only be on the leader.  i.e. imagine we timed out node X, but between it processing the timeout now request, it hibernated.  when it came back, it continued.  it shouldn't be able to disrupt the cluster with an election.

so my idea is: Imagine just like cfg change log messages there was a leader transfer OKd log message.  i.e. "we are trying to transfer leadership to node X"  what will happen here, is that no further log messages (besides heartbeats) will be sent by the leader at all (until some leader transfer timeout is reached, that indicates leader transfer failed, and then it continues processing as normal)

due to the raft consensus, we know that node X will be up to date with leader when it recieves the commit of this log message and that the majority of the nodes (including the current leader) will have received it.

for all nodes besides node X, when they receive the log message they will flag that node X is allowed to request vote even if they have a valid leader.  they will remove this "flag" if the leader sends them another normal (non heartbeat) log message (i.e. it timed out the transfer) or if another valid election happens in between (i.e. heartbeat died)

on commit to node X, this would be the equivalent of a timeout now operation as described in paper, and node X would declare an election right away and since the flags are set on the other nodes (or at least a majority of them), it should be able to win.  If it can't win, leader will probably lose quorum which will cause a normal election to happen, but in the "good" case, node X will become leader (up to date and declared the election first and deemed valid), on accepting the new leader, again the existing nodes will remove the "flag" that allowed leader transfer to it as not necessary anymore.  In the bad cae, some other node will end up leader (or possibly original leader again), but thats also possible in how the paper described it.

in addition, as long as the leader's heartbeat continues to operate, it should remain able to process read only requests as those won't generate log entries, but it will have to error out any write requests.

the reason I like this, is that it doesn't require any real additional programming to the raft algorithm.

thoughts?  good/bad?  others have described the exact same thing before? (the nature of CS, we are always rediscovering the wheel).

You received this message because you are subscribed to a topic in the Google Groups "raft-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raft-dev/wvxFwkrZuQg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raft-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raft-dev/CAKWBmysm%2Bf%3DhAt%2Bcd%3Dy5jSn1oubsKPj3UgXC%2BvoxE%2BJ76gC2Vw%40mail.gmail.com.

Ozan T

unread,
Jun 22, 2021, 3:15:56 AM6/22/21
to raft...@googlegroups.com
If I understand correctly, you assume majority of the cluster applied that entry to their states before or at the same time with node X.  It may not be the case. e.g

1 - Leader replicates transfer entry(say entry index = 100) to all followers.
2 - Leader decides it can commit it, sends a new appendreq to followers with commit index (100).
3 - Node X receives this appendreq before others, applies entry index=100 to its state,  it starts an election but other nodes will not grant their votes as they didn't get "commit index=100" message yet. What should node X do now? Retry? How long?

Another case, a config change entry that takes action immediately when it's written to the log. If you meant that, it solves the problem above but now you should write code for that, check each entry if it is a transfer entry etc.
Also, what will happen when you replay logs, you should handle this case and prevent node X start an election after every restart.

Either way, coupling raft log and election may cost you more work than you expect. Leader transfer is a very rare operation. A disruption like you mentioned (hibernate on leader transfer), should be very very rare as well. When you are that unlucky(it can happen anytime, not just on leader transfer), users already expect unavailability determined by configured election timeout.

If I misunderstood something, sorry about that. Your idea may work, I just feel like you may end up having more cases to handle in your algorithm this way.

Ozan.


Shaya Potter

unread,
Jun 22, 2021, 4:40:31 AM6/22/21
to raft...@googlegroups.com
i dont think it's an assumption. raft protocol is that a message will
not be committed by leader until it gets confirmation from majority of
cluster that they appended message to their log.

node X will not start election on it's reception of the append, it
will only start election on it's receive of the commit of that log
entry. i.e. leader has determined that majority has acknowledged the
append.

I don't think it has to be retry, as it doesn't have to be perfect.
Even the scheme listed in the dissertation acknowledges that it can't
be perfect, with that said, I think its overall better than the scheme
in the paper, as as misbehaving node in this scheme will not have a
special way to misbehave, but will be just be a normal misbehaving
node. i.e. raft is supposed to not care about clock skew between
nodes, a node request voting with a flag that says "treat me special
and ignore your current leader" I dont think shares that property
anymore. I believe my scheme does.

to recap, per the paper, in order to do a leader transfer, have to
make sure the targeted node is up to date with the current leader. my
scheme accomplishes that by refusing future log entries (at least
until a timeout when current leader gives up on leadership transfer),
but maintaining the heartbeat (so no one else will try to become
leader). by waiting for the commit of this new / special log entry
combined with the fact that there wont be any log entries after it, we
guarantee that it will be up to date. I believe even in the scheme
described in the paper, one would have to refuse future log entries on
the old/current leader until a timeout occurred (and irrelevant once
leadership transferred, as then that be inherent).

by only declaring the election on the targeted node after it receives
the commit of the log entry from the leader (i.e. the majority has
accepted, but not committed the log entry), it knows that it "can"
(not for sure) win the election and hence knows that it can declare an
election. I don't believe this log entry would have to be rolled back
on non commital, as it only changes metadata state that is reset
through normal raft actions that occur after its set.

so what can happen when a leader appends a log entry to log

1) the leader never get confirmation from majority of cluster and just
hangs (no different than any other cluster losing quorum), its
possible then that a new leader will be elected as part of the
majority it doesn't have access to, but that's normal raft, so nothing
new)
2) leader gets quorum on the entry and commits it

what can then happen if it committed it

1) targeted node gets commit message and declares an election
2) targeted node never gets commit message and leader times out the
transfer and continues processing new entries to log, to the cluster,
this just turned the entry into an effective no op log entry with no
new entries coming in for a period. not different than any normal
cluster.

what happens if targeted node get commit message and declares an election

1) it has connectivity to the entire (or majority of cluster that got
the leader's initial message) and hence will win the election
2) it does not have connectivity to the entire cluster and fails to
win the election as cant request votes from enough members that got
the old leader's transfer leader log entry.

in case #2, it loses, it disrupted the cluster, and another election
will be held and someone else will win. This isn't fundamentally
different than what the paper describes can happen, and seems to be a
natural outcome in raft. i.e. transfer leader doesn't have to
guarantee that X becomes the leader, it has to provide the environment
that in a healthy cluster it should become the leader.

am I missing something? to me this seems relatively simple and
straight forward. in every case, either raft already handles the
situation in a defined manner and my scheme doesn't change anything or
my scheme provides a simple success/failure model that also fits
within normal raft.

Shaya Potter

unread,
Jun 22, 2021, 5:26:42 AM6/22/21
to raft...@googlegroups.com
reread your response and I didn't fully understand it the first time
(and realize you inferred what I meant), though hopefully the recap
clarified things.

I'm not sure on replay of log entries there's a problem. why? commit
doesn't cause an immediate election, it treats the next timeout check
as an actual timeout, it occurs in the periodic function. Assuming
the transfer log entry wasn't the last log entry, any future log
entries committed after that entry in replay should wipe the
transfer/timeout flag away, so periodic function will never trigger an
election, as I assume (perhaps incorrectly) that periodic function
should be triggered during a replay.

it might be possible to put some other safety checks in, but unsure
they are necessary.


On Tue, Jun 22, 2021 at 10:16 AM Ozan T <oza...@gmail.com> wrote:
>

Ozan T

unread,
Jun 22, 2021, 11:33:38 PM6/22/21
to raft...@googlegroups.com
I see you mention some implementation details of yours. Points that I mentioned may not apply to your implementation, I can't comment on that. I'll be happy if I'm wrong with my concerns. For either implementation, the worst-case scenario is same: unavailability for a period determined by election timeout. As you mention "periodic function, not immediate elections" for your implementation, I guess you're already okay with an unavailability window. (Method in raft phd tries to finalize leader transfer as fast as possible with the immediate election on TimeoutNow message). 

I just think you may end up with more code to maintain, more cases to think about and higher failure rates compared to a method described in the thesis. Because the method in the thesis is simpler than yours as it won't use the log and state machine. Whenever you use the state machine for something, you should consider timing and replay. Timing, in your case, followers should apply that special entry before the target node, otherwise they won't grant their vote to target node. Failure is okay here but I can see that happening a lot in a standard raft implementation, "immediate elections" would cause that a lot. This may not apply to your implementation, I don't know. About replay, now you should always think about it, you may not need to write code for it for your implementation, I don't know it as well. I just say that method in the thesis avoids these issues (or possible issues).  

It's often best to follow previous experiences, personally, I'd follow something like in the thesis after checking out existing implementations which I believe the majority implements a method something similar to one in the thesis. But experimenting is also good, hope it works for you :)

Reply all
Reply to author
Forward
0 new messages