Designing a recurring event with timeouts using channels

148 views
Skip to first unread message

Arya

unread,
Apr 8, 2021, 2:15:27 PM4/8/21
to golang-nuts
For my work I need to make some modifications to an in-house raft library (https://raft.github.io/). Without delving too deep into the algorithm, one of the stages in the algorithm is a "candidate" goroutine that performs an assessment of  whether it should be able to transition to a "leader" state. 

There are some rules around how the candidate performs this assessment and one of the rules is to periodically run an "election" where the candidate seeks votes from its peers and if it receives the majority of votes, it declares itself a leader. The algorithm is a bit more nuanced to this but for the sake of simplification I am focusing just on the part where the election proceeds as follows:

1. Starts the election process.
2. Sets an election timer, which periodically expires, if the candidate has not received any votes during that period, it will reset the deadline and start the election all over again. 

With that in mind I modeled my implementation as follows:

func (f *candidate) startElection(ctx context.Context) <-chan struct{} {
       //Channel that signals to the caller when the election is done 
       electionCompletionChan := make(chan struct{})
        go func() {
              //Instead of busy-looping, running the poll logic at predefined intervals
              ticker := time.NewTicker(config.PollDuration * time.Second)
              defer ticker.Stop()
              //Reset the election deadline. SetDeadline takes the current time and sets a random offset                //in the future
              f.raftTimer.SetDeadline(time.Now())
              //callElection makes the RPC to its peers requesting votes and upon completion, 
              //returns its status to the voteStatusChan channel
              voteStatusChan := f.callElection()
             //Polling  
             for tick := range ticker.C {
              select {
                   //The caller called cancel on the context
                    case <-ctx.Done():
                           log.Println("Cancelling election")
                           //Send an empty struct signaling the end of the election process
                           electionCompletionChan <- struct{}{}
                           return
                   //Checking whether all the votes have been received 
                  case voteStatus := <-voteStatusChan:
                    {
                           switch voteStatus {
                               //Majority of peers agreed that the candidate should be the leader  
                               case voter.Leader:
                                       //Flip the state to leader
                                       f.leaderTrigger <- LeaderTrigger{}
                                       //Conclude the election process
                                       electionCompletionChan <- struct{}{}
                                       return
                               default:
                                      //Split vote or loss: Do nothing, re-election will be triggered
                           }
                     }
                  default:
                        //Check whether the election deadline was exceeded
                        if tick.After(f.raftTimer.GetDeadline()) {
                             //Reset the deadline
                              f.raftTimer.SetDeadline(tick)
                             //Once again make the callElection to redo the election
                              voteStatusChan = f.callElection()
                       }
            }
       }
    }()
    return electionCompletionChan
}


There are a few things that I am worried about in the above given snippet. The first thing is whether I am in alignment with golang's idioms while modeling this process using channels and the second thing is where I am resetting the voteStatusChan by creating channels within the poll loop. Something about creating new channels for every tick of the timer seems to be wasteful (I don't have a particularly nuanced understanding of how unbounded channel creations will tax the garbage collector or if there are other dangers lurking in the corner by taking this approach)

It will be great to get some feedback on what I may be doing wrong here.

Thanks

Jesper Louis Andersen

unread,
Apr 9, 2021, 10:14:35 AM4/9/21
to Arya, golang-nuts
On Thu, Apr 8, 2021 at 8:15 PM Arya <s...@kitengo.io> wrote:
There are a few things that I am worried about in the above given snippet. The first thing is whether I am in alignment with golang's idioms while modeling this process using channels and the second thing is where I am resetting the voteStatusChan by creating channels within the poll loop. Something about creating new channels for every tick of the timer seems to be wasteful (I don't have a particularly nuanced understanding of how unbounded channel creations will tax the garbage collector or if there are other dangers lurking in the corner by taking this approach)


While I have a somewhat overarching idea of the Raft protocol, I can't comment on its correctness because I'm not too well versed in the particular code base.

However, I can comment on the worries: a channel does cost you garbage, that will eventually require collection. However, if the GC pressure this provides is fairly small it is unlikely to have any impact. There's clearly a limit at which it becomes too expensive, but a few channels created per second is highly unlikely to be measurable. You should mostly be worried if the channel creation is in a tight loop and the tick resolution is lower than milliseconds (say). A large tick window, and situations where there is human interaction in the loop shouldn't pose too much of a worry. The other case is if a pathological reelection loop starts generating thousands of channels, but again, this is unlikely. In particular, one-shot channels are somewhat common where a channel is used for one single interaction, so generating a few shouldn't be a problem. Another way of looking at this is: how much GC pressure does the rest of the program provide? It might be that other parts of the program dominate GC pressure, and thus we can slightly modify Amdahl's law and argue that's where your effort should be.

In the above code, I don't really see any place channels are formed, except when you run an election. If memory serves, Raft doesn't run too many leader elections under normal operations.

As for the channel approach: channels have an advantage over a lot of "smarter" or "faster" approaches in that they are often easier to reason about in code, and for a consensus algorithm, you should probably worry about correctness before speed: it's easy to create a system which is very very very fast and also incorrect, to slightly paraphrase Joe Armstrong. Were I to create a Raft protocol from scratch in Go, I'd probably reach for a channel approach first (and I would definitely sit with a TLA+ implementation for reference).

The key thing I'd work on is to establish a frame of reference through some careful measurement. There is a quantitative point where you hit "good enough for our use case(tm)", and as long as you stay within that, you are likely to succeed, provided your solution is correct. If you have measurement, you know if you are operating within the boundaries of the frame or not. Count number of leader elections. Count time to elect a leader and dump it into some histogram approximation. Count the number of failed elections. Bumping a counter or updating a HDRHistogram should be cheap. And thinking in terms of system observability is generally valuable.

--
J.

Arya

unread,
Apr 9, 2021, 4:05:51 PM4/9/21
to golang-nuts
This was very helpful. I was somewhat apprehensive about using channels for this since I noticed a few other raft implementations have leveraged "shared-memory" instead of message passing to model the protocol. Validation of my approach really helps with making progress on this front. It didn't occur to me that I can use HDRHistogram to analyze some of the core metrics so I learnt something there. 

As far as the TLA+ spec for raft goes, I found a spec by one of the creators of the protocol (https://github.com/ongardie/raft.tla/blob/master/raft.tla). Hoping that will suffice with validating the correctness of the implementation.

Thanks for your thoughtful feedback.

Reply all
Reply to author
Forward
0 new messages