--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
--
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811
On Fri, Aug 21, 2015 at 7:54 PM, Oren Eini (Ayende Rahien)
<aye...@ayende.com> wrote:
> We have implemented much the same thing, see my comments below. I think that
> most people with real world impl had to do that.
>
Hi Oren! Thanks, that is good to know. Gives some validation to hear this.
>> Persistent State:
>> databaseId
>
> Is that meant to be a node id or a cluster id?
> In our impl, we have a cluster id, which is generated as the first action of
> the initial node. Messages from a different cluster are rejected.
No, this is not a node id nor a cluster id.
While a real world implementation probably would have some kind of
cluster id (LogCabin has too, even if it is never mentioned in any of
the Raft papers), to me that's not an important part of Raft. It
basically falls under the heading of "authentication", and in
authentication lingo, such an id would commonly be called "shared
secret". But there could also be any other form of authentication,
such as using a variety of cryptographic keys or even just
username+password. But this is not what I'm talking about.
The databaseId is there to identify the *database*. The data,
regardless of the server it is located on. So it should be stored as
part of the database, and should be part of backups, etc.
This is a bit unintuitive. But the gatekeeping happens at AddServer
RPC. As discussed in this thread, a typical implementation (at least
for databases) would fail the AddServer RPC, if the new server
contains some data that is not a copy of the same database as the
cluster has. A corner case, that just happens to be how I wrote it, is
that a node with an empty database will pass AddServer RPC when it has
a different databaseId. It will then get the correct databaseId at
this step in AppendEntries. But there is no data to be deleted, so
while this sounds like a scary step, it is not.
A real world implementation might very well pass along the databaseId
in every call, and check it every time with an assertion. That's
probably good programming practice. But from the algorithm point of
view it's only needed in the AddServer, and the
AppendEntries/InstallSnapshot immediately following and AddServer. So
I've only added it in the places where it is necessary.
> but that is dangerous, because you basically force all servers to
> timeout before a PreVote can succeed, in which case you are pretty much
> guaranteed to have conflicting candidates soliciting votes.
>
No. You need a majority to timeout before the first PreVote will succeed.
This is also ok, but doesn't give you leader stickiness. That's not a
big deal, because Raft is already sticky in many situations (an
isolated node will often be behind in its log). But it doesn't achieve
stickiness in some of the corner cases I'm trying to cover.
What happens if a server S1 sends PreVote to all other servers, and
all other servers also respond to it, and then server S1 crashes? I
assume after some additional timeout the other nodes do try to become
candidates again?
You cannot reject actual RequestVotes. Once you see a RequestVote with
a higher term than the currentTerm, then you must abandon the current
leader and respond to the RequestVote. This is a very fundamental
property of Raft, it is not an optimization.
> Table 3 seems too complex to me.
>
> #3 - dangerous. You created data loss event here.
> If I called AddServer with "db1/orders" instead of "tst1/orders", I just
> lost all my orders.
> It is better to reject such scenario entirely. If the admin want to recreate
> the db, they can delete it and create a new one manually.
Correct. A typical database implementation would be conservative and
fail, rather than automatically delete data. From Raft point of view
both are correct behaviors.
> #4 & #2 are the same thing. We have a backup of the cluster log, the leader
> shouldn't need to make distinction between them.
> Why do you NEED to distinguish between them?
In #2 it is the same server that rejoins the cluster, in #4 it is a
brand new server that was provisioned with a copy of the database.
From Raft point of view these are actually the same (in both you would
just call AddServer) but from the DBA point of view they are of course
completely different use cases.
> And this seems to put too much logic at the hands of the leader. The way we
> have it, when you add a new server, we send AppendEntries.
> It fails if there is no data (new server) or too old for our current log. We
> then send InstallSnapshot.
> If the admin restored from backup, the new node just accepts the
> AppendEntries and move on.
>
> The clusterid issue is important, and I really don't like the way you set it
> up, it seems like it can cause problems.
We are trying to solve different problems. ClusterId is important, but
it is not what I'm discussing here. I hope my above replies have
clarified things.
> The way it works for us.
> All servers starts as followers, and CANNOT become leaders until they have a
> cluster id set.
> The admin is responsible for letting one of the servers knows that it can be
> a leader, it register a database id and then can add other servers, which
> will get their cluster id from it, and only then they can become leader.
I think also LogCabin seems to do something like this. It has a
--bootstrap option.
However, this is again not how the Raft paper is actually written. In
the world of Ongaro's thesis, a single node is its own majority, and
*will* become its own leader after election timeout. The paper doesn't
discuss any further initialization or boot activities.
>
> This avoid the issue where multiple new servers starting at the same time
> all assume that they can start accepting requests.
Probably sensible for a real world implementation, yes. Otoh, also the
opposite is sensible: developers will often run a single instance
database on their laptop, it's not wrong to default to doing that.
> Leader stickiness in your impl means that you now can't wait for the first
> node to timeout, but have to have _all_ the nodes time out first, which
> increase failover time.
To be precise: a majority has to timeout.
Yes, this does increase failover time. For a "large cluster" it is
likely to be election timeout / 2. Otoh, the motivation for this
change is that it would allow you to configure the election timeout to
be shorter, since flip-flopping is now explicitly prevented. Whether
this is true is open to debate and to be proven. (And probably depends
on the kind of network conditions you expect to deploy clusters in.)
Hibernating Rhinos Ltd
Oren Eini l CEO l Mobile: + 972-52-548-6969
Office: +972-4-622-7811 l Fax: +972-153-4-622-7811