Handling failures in a distributed database with Raft

Dan Palmer

unread,

Apr 18, 2014, 9:34:41 AM4/18/14

to raft...@googlegroups.com

Hi everyone,

I have a question concerning the use of Raft to implement a distributed database, specifically the replication mechanism where a master node replicates data down to multiple secondary nodes.

I've encountered a possible failure case, and I'm trying to understand how the use of Raft might be able to negate this, or whether it’s a purely application level issue which I need to deal with separately.

Considering the following order of events. I’m assuming that a write fails from the database client’s perspective if the master fails to commit on a majority of the nodes.

1. client sends a 'write' to the master node
2. master sends this in AppendEntries to secondaries, but it is not yet committed
3. master receives acknowledgements from a majority of secondaries
4. master commits
5. master sends AppendEntries to secondaries with write included below its committed index
6. secondaries apply write and update commit index
7. master fails or a network partition occurs.
8. client's connection to master fails, or master fails to commit on other nodes due to a network partition, therefore 'failing' the write.
9. secondaries attempt to acknowledge AppendEntries, which fails
10. new master is elected from secondaries which have committed the write.

In this case, the client believes the write failed, but the new master has it committed and accepted into the database.

I know my knowledge of Raft is relatively minimal so far, I'm in the process of learning the details related to this, but I'm struggling to see how this scenario is incorrect. It seems like a failure case that should be handled by a distributed consensus algorithm, because making a write that the client believes failed is an issue.

Thanks for any help you can provide.

Hugues Evrard

unread,

Apr 18, 2014, 10:11:53 AM4/18/14

to raft...@googlegroups.com, Dan Palmer

Hi Dan,

I think the scenario you described is possible, and that the solution to
handle such cases is describe in the section "Client Interaction" of the
Raft paper [0].

To be short, it requires the client to assign a serial number to each
command, which is stored along commands in the log, and checked when a
new request is done.

In your example, when the client believes the write failed, it will
request a write again with the same serial number (since the previous
failed). The newly elected leader will detect this serial number in its
log and reply directly, since the command is already committed.

Note that I'm not the most aware of Raft internals, so maybe someone
else can confirm this ?

--
Hugues

Allen George

unread,

Apr 18, 2014, 10:39:51 AM4/18/14

to Hugues Evrard, raft...@googlegroups.com, Dan Palmer

This is a standard problem in distributed (and I mean distributed in the broadest sense - i.e. a number of processes communicating to perform an action) system. Consider the following common case:

1. Client contacts a database server with an UPDATE.

2. Database server makes the UPDATE and sends back the response.

3. Connection between client and server dies and response is lost.

4. Client times out.

In this scenario the client believes the request has failed while the server does not.

There are a number of techniques you can use to deal with this and the one you choose depends on your application semantics and other tradeoffs. You can use request ids, idempotent operations...

The Raft consensus algorithm (in fact, any consensus algorithm that works on an asynchronous network - some one fact check me!) can give you false negatives. Due to message loss or delays it's possible for a participant to believe that a write has failed when, in fact, it has been successfully committed. What it will _never_ do is give you false positives: you will never be told that a write is committed and then have it disappear on you.

Cheers,

Allen

On Fri, Apr 18, 2014 at 7:11 AM, Hugues Evrard <hugues...@inria.fr> wrote:

Hi Dan,

On 04/18/14 15:34, Dan Palmer wrote:

Hi everyone,

I have a question concerning the use of Raft to implement a distributed database, specifically the replication mechanism where a master node replicates data down to multiple secondary nodes.

I've encountered a possible failure case, and I'm trying to understand how the use of Raft might be able to negate this, or whether it's a purely application level issue which I need to deal with separately...

--

Terminal Musings: http://www.allengeorge.com/
Raft in Java: https://github.com/allengeorge/libraft/

Twitter: https://twitter.com/allenageorge/

Hugues Evrard

unread,

Apr 18, 2014, 10:56:22 AM4/18/14

to Dan Palmer, raft...@googlegroups.com

Hi Dan,

(re-add all raft-dev@ in the discussion)

On 04/18/14 16:15, Dan Palmer wrote:

Yeah, this would work, but I think it’s more of a work-around rather than a fix. If the client made the write, which it thought had failed, and then chose not to re-attempt it, the database would still be inconsistent.

I agree. The solution proposed in the paper seems to assume that the client will always keep on retrying the command until it receives a response.

Also, even if it did re-attempt, it would be committed in the database at the wrong time.

I'm not sure what you mean by "the wrong time". From the client point of view, this command will be committed after its previous commands and before its next ones (cap'tain obvious is not far away :) ).
I mean, if you rely on this client command to be committed between certain commands of *other* clients, then I think the problem is a synchronization of the clients.

Thanks for the idea though, from an application point of view, this gives me something to think about. I’d be interested in hearing from anyone with a more detailed knowledge of Raft.

Yep, if anyone has a better insight, I'd be interested too.

--
Hugues

Dan Palmer

unread,

Apr 18, 2014, 12:53:32 PM4/18/14

to Hugues Evrard, raft...@googlegroups.com

On 18 Apr 2014, at 15:56, Hugues Evrard <hugues...@inria.fr> wrote:

> I mean, if you rely on this client command to be committed between certain commands of *other* clients, then I think the problem is a synchronization of the clients.

Good point, hadn’t really thought this bit through fully.

>
> --
> Hugues

Diego Ongaro

unread,

Apr 21, 2014, 11:32:28 AM4/21/14

to Dan Palmer, raft...@googlegroups.com

I agree with what others have said, especially Allen's point that this
is not at all specific to consensus but occurs in any distributed
system.

> On 04/18/14 16:15, Dan Palmer wrote:
>
> Yeah, this would work, but I think it’s more of a work-around rather than a
> fix. If the client made the write, which it thought had failed, and then
> chose not to re-attempt it, the database would still be inconsistent.

To throw in my two cents, the only outcomes that clients see are
"don't know" and "completed". There's only a small fraction of failure
cases where you know for certain that the request has not been and
will never be executed, so it's better to just lump that into "don't
know". A client that gets "don't know" can either pass that on up the
stack to the application or retry. It's pretty hard to guarantee that
a command will never be executed; that takes about as much work as
completing the command.

-Diego

Reply all

Reply to author

Forward