candidate problem

106 views
Skip to first unread message

Pedro Teixeira

unread,
Oct 24, 2014, 4:40:28 AM10/24/14
to raft...@googlegroups.com
Hi,

I've read the paper some 10 times and I'm now implementing a node.js-only version of Raft, backed by LevelDB:
https://github.com/pgte/skiff

I've thought of a problematic scenario:

1. node A is a follower
2. node A becomes isolated from the network
3. because of that, node A election timeout fires
4. node A becomes a candidate
5. node A increments currentTerm
6. because node A can't contact any other nodes, election timeout fires again
7. go to 5, always incrementing current term (according to the paper)
8. eventually node A gets connected to the network
9. node A currentTerm is much higher than any term of any other node, never accepting any AppendEntries call because of that

Because of this, node A might never get functional again.
I think the paper is clear about step 5, where the node increments the currentTerm when restarting another election cycle, or am I reading it wrong?


-Pedro

Hugues Evrard

unread,
Oct 24, 2014, 5:09:59 AM10/24/14
to raft...@googlegroups.com, Pedro Teixeira
Hi Pedro,

On 10/24/14 10:40, Pedro Teixeira wrote:
> [...]
> 8. eventually node A gets connected to the network
> 9. node A currentTerm is much higher than any term of any other node,
> never accepting any AppendEntries call because of that
>
> Because of this, node A might never get functional again.
> I think the paper is clear about step 5, where the node increments the
> currentTerm when restarting another election cycle, or am I reading it
> wrong?
>

You may have missed these two details:
- all messages contains the term of the sender
- whenever a node receives a message, if the sender's term is higher
than its own term, then it updates its own term to the sender's one
and convert to follower state.
(See the paper Fig.2, "Rules for servers -> All Servers, item 2 of the
list, and section 5.1 "Raft basics" of the paper).

Therefore, when node A gets connected to the network again, all nodes it
contacts will quickly update to node A's term.

--
Hugues Evrard
INRIA / LIG - Team CONVECS

Pedro Teixeira

unread,
Oct 24, 2014, 5:13:09 AM10/24/14
to raft...@googlegroups.com, pedro.t...@gmail.com, hugues...@inria.fr
That's it, grokked it, thanks for the prompt response!

Hugues Evrard

unread,
Oct 24, 2014, 6:50:41 AM10/24/14
to raft...@googlegroups.com
On 10/24/14 11:13, Pedro Teixeira wrote:
> That's it, grokked it, thanks for the prompt response!
>

You're welcome !

Your question made me wonder what would happen if a term counter
overflowed, which made me wonder in turns how much time it would take
before it overflows.

If we consider an isolated node that keeps on doing election timeouts
and incrementing its counter, the number of years before its counter
overflows is :

( 2^(nb bits to store counter) * (election timeout delay in seconds) )
/ ( 60 * 60 * 24 * 365 )

For election timeouts of 0.15 seconds :

counter stored on 32 bits : 20.42887793 years =~ 20 years
counter stored on 64 bits : 87741362603.4 years =~ 87 billions years

So if you're putting Raft in a critical system whose nodes run on 32
bits and may run for decades, you may need a term resetting protocol to
avoid your (grand-)children some debugging headaches ;-)

Pedro Teixeira

unread,
Oct 24, 2014, 7:45:06 AM10/24/14
to raft...@googlegroups.com, hugues...@inria.fr
Thankfully in JavaScript the maximum int is 2^53, so that is not a problem... :)

you fu

unread,
Aug 28, 2017, 2:57:47 AM8/28/17
to raft-dev, pedro.t...@gmail.com, hugues...@inria.fr
then all node are follower,begin next election,because node A lost not less than others,so cant' win election,am i right?
Reply all
Reply to author
Forward
0 new messages