Handling slow I/O

91 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Jan 9, 2018, 7:49:11 AM1/9/18
to raft...@googlegroups.com
I have a case where the leader is sending entries to followers.
The followers write the entries to stable storage and answer to the leader.
A follower may experience slow I/O, which lead the leader to timeout after 300 ms and consider itself no longer the leader if enough followers hit this condition at the same time. 
We handle this by having the follower let the leader know that it got the entries and that this is in progress.
This way, during this particular period (writing entries to disk), we defer the timeout on both the follower and leader.

Given the reason (slow I/O), this avoid elections that would just cause us to end up in the same place, but I wanted to know if this violates any of the safety features.



Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

Karl Nilsson

unread,
Jan 9, 2018, 8:17:05 AM1/9/18
to raft...@googlegroups.com
Not sure I follow why the leader would time out. Do you set a timer on the leader itself? Normally only followers and candidates set election timers. Or do you mean one of the other followers is timing out because the leader is blocking on an an rpc call to some other follower? 

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Oren Eini (Ayende Rahien)

unread,
Jan 9, 2018, 8:45:46 AM1/9/18
to raft...@googlegroups.com
The leader has a timeout so it knows if it didn't get positive responses from followers within timeout, it will step down. 

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Karl Nilsson

unread,
Jan 9, 2018, 9:42:45 AM1/9/18
to raft...@googlegroups.com
What made you add this behaviour? if a leader isn't able to send rpcs to ensure it's leadership one of the followers will time-out anyway and start an election. I don't see why a leader would voluntarily step down without having seen a higher term. That said I don't see how it would affect safety so probably ok.

On Tue, 9 Jan 2018 at 13:45 Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
The leader has a timeout so it knows if it didn't get positive responses from followers within timeout, it will step down. 


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Tue, Jan 9, 2018 at 3:16 PM, Karl Nilsson <kjni...@gmail.com> wrote:
Not sure I follow why the leader would time out. Do you set a timer on the leader itself? Normally only followers and candidates set election timers. Or do you mean one of the other followers is timing out because the leader is blocking on an an rpc call to some other follower? 

On Tue, 9 Jan 2018 at 12:49 Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
I have a case where the leader is sending entries to followers.
The followers write the entries to stable storage and answer to the leader.
A follower may experience slow I/O, which lead the leader to timeout after 300 ms and consider itself no longer the leader if enough followers hit this condition at the same time. 
We handle this by having the follower let the leader know that it got the entries and that this is in progress.
This way, during this particular period (writing entries to disk), we defer the timeout on both the follower and leader.

Given the reason (slow I/O), this avoid elections that would just cause us to end up in the same place, but I wanted to know if this violates any of the safety features.



Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Jan 9, 2018, 9:47:13 AM1/9/18
to raft...@googlegroups.com
I'm doing this so a leader that was isolated from the network will know that it is not the leader and won't try to accept new commands.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Tue, Jan 9, 2018 at 4:42 PM, Karl Nilsson <kjni...@gmail.com> wrote:
What made you add this behaviour? if a leader isn't able to send rpcs to ensure it's leadership one of the followers will time-out anyway and start an election. I don't see why a leader would voluntarily step down without having seen a higher term. That said I don't see how it would affect safety so probably ok.
On Tue, 9 Jan 2018 at 13:45 Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
The leader has a timeout so it knows if it didn't get positive responses from followers within timeout, it will step down. 


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


On Tue, Jan 9, 2018 at 3:16 PM, Karl Nilsson <kjni...@gmail.com> wrote:
Not sure I follow why the leader would time out. Do you set a timer on the leader itself? Normally only followers and candidates set election timers. Or do you mean one of the other followers is timing out because the leader is blocking on an an rpc call to some other follower? 

On Tue, 9 Jan 2018 at 12:49 Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:
I have a case where the leader is sending entries to followers.
The followers write the entries to stable storage and answer to the leader.
A follower may experience slow I/O, which lead the leader to timeout after 300 ms and consider itself no longer the leader if enough followers hit this condition at the same time. 
We handle this by having the follower let the leader know that it got the entries and that this is in progress.
This way, during this particular period (writing entries to disk), we defer the timeout on both the follower and leader.

Given the reason (slow I/O), this avoid elections that would just cause us to end up in the same place, but I wanted to know if this violates any of the safety features.



Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

shanc...@gmail.com

unread,
Jan 9, 2018, 8:59:04 PM1/9/18
to raft-dev
use a separate thread to answer the heartbeat

在 2018年1月9日星期二 UTC+8下午8:49:11,Ayende Rahien写道:

Oren Eini (Ayende Rahien)

unread,
Jan 10, 2018, 3:06:52 AM1/10/18
to raft...@googlegroups.com
You can't answer the heartbeat until you finished processing the incoming log entries.
Otherwise, the leader may commit because they are persisted.
--

jordan.h...@gmail.com

unread,
Jan 10, 2018, 3:18:06 PM1/10/18
to raft...@googlegroups.com
This is a pretty common behavior, and the rationale is so a partitioned leader doesn’t continue accepting changes from a client that’s connected to it. If the leader is partitioned, even if another node times out and starts a new election that doesn’t mean the leader will learn about it. Instead, the old leader could continue to accept writes from a client after a new leader has been elected, whereas stepping down can force the client to search for a new leader. Pretty sure this is mentioned in Diego’s dissertation as well.

Anyways, as for my own thoughts on this: I suppose it depends on what the leader does with this information. If the follower notifies the leader simply as a means to reset the leader’s timers and prevent it from stepping down then I don’t see how that could violate safety properties so long as the leader still respects higher terms and doesn’t commit the replicated entries until the follower indicates they’ve been persisted. “Detecting” partitions on the leader and stepping down, after all, is not even a component of the safety proof.

Coincidentally, we’ve been running into some similar problems with slow I/O causing instability in Raft clusters. I’m curious what other measures you’ve taken to avoid various timeouts from slow I/O.

Archie Cobbs

unread,
Jan 10, 2018, 4:53:02 PM1/10/18
to raft-dev
On Wednesday, January 10, 2018 at 2:18:06 PM UTC-6, Jordan Halterman (kuujo) wrote:
This is a pretty common behavior, and the rationale is so a partitioned leader doesn’t continue accepting changes from a client that’s connected to it. If the leader is partitioned, even if another node times out and starts a new election that doesn’t mean the leader will learn about it. Instead, the old leader could continue to accept writes from a client after a new leader has been elected, whereas stepping down can force the client to search for a new leader. Pretty sure this is mentioned in Diego’s dissertation as well.

What is the reason behind doing this?

It might be that I'm not clear on what you mean by "accepting changes from a client".

In other words, I thought that normally a client will wait for the server to confirm its change has been committed before proceeding.

But if the server is partitioned, then nothing will be committable, so the partitioned server will never "accept" the change... at least not by the definition of "accepted" meaning "committed".

-Archie

jordan.h...@gmail.com

unread,
Jan 10, 2018, 5:36:19 PM1/10/18
to raft...@googlegroups.com
The problem is the partitioned leader will presumably still attempt to commit a change sent by a client even when it can safely assume another leader has probably been elected (e.g. it hasn’t reached a majority of the cluster in several election timeouts). Whether that makes much of a difference probably depends on the specific implementation of the clients. If a client’s request is blocked until the leader commits, it will be waiting until the partition heals to find out the leader can’t commit the change. If the client’s request has a timeout, it will have to wait for that timeout before attempting to find another leader. But if the leader steps down, it can immediately reject any client’s’ requests and force them to find a new leader. The benefit of relying on the leader to indicate that it can’t commit a change perhaps is that that leader has much more information about whether it’s likely a client’s change will be committed. It can differentiate between high latency (it’s making progress but taking a while for the client’s change to be committed) and a loss of availability (it hasn’t made progress after an election timeout).
--

Oren Eini (Ayende Rahien)

unread,
Jan 10, 2018, 5:48:17 PM1/10/18
to raft...@googlegroups.com
What I'm doing is sending a "pending" response back to the leader, and that just keep the leader & follower timer reset while this is going on.
It doesn't modify any other state.

I'm doing this for InstallSnapshot as well as AppendEntries.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Jan 10, 2018, 5:49:16 PM1/10/18
to raft...@googlegroups.com
The client will wait for the server to confirm that the change was committed. It may wait until a timeout.
Alternatively, the server can immediately tell, usually, that it is not going to proceed, we want a fast error

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

John Ousterhout

unread,
Jan 10, 2018, 5:49:49 PM1/10/18
to raft...@googlegroups.com
One reason for this approach is that it allows the leader to process reads without going through the consensus protocol. The timeouts provide a form of lease for the leader; as long as the leader's lease is intact, it knows that no one else could have been elected leader for a new term, so it knows any data it stores is up-to-date. Thus, it can respond to read requests without checking with any other servers. Without leases, reads have to go through the consensus protocol to make sure that the supposed leader's data isn't stale.

Writes always have to go through the full consensus protocol.

-John-

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages