Cluster is not in a stable state. No leader was selected, but we require one for making a request using ReadFromLeaderWriteToLeader

211 views
Skip to first unread message

Wallace Turner

unread,
Feb 19, 2018, 12:44:00 AM2/19/18
to RavenDB - 2nd generation document database

We are getting this error from the client after running for 5 minutes:

>Cluster is not in a stable state. No leader was selected, but we require one for making a request using ReadFromLeaderWriteToLeader

We have 2 apps that both frequently connect to ravendb then 5 minutes after connecting they both throw this error.


Our cluster appears to be set up correctly and is otherwise working (or works up until 5 min)
After restarting the apps we get another 5 minutes of working:

Both apps connect using the raven-node-1 server URL



Wallace Turner

unread,
Feb 19, 2018, 1:36:07 AM2/19/18
to RavenDB - 2nd generation document database
Server is version 35247
client is RavenDB-3.5.6-patch-35252

Oren Eini (Ayende Rahien)

unread,
Feb 19, 2018, 3:25:51 AM2/19/18
to ravendb
In a two node cluster, _any_ server being offline (such as GC pause, for example), can cause unavailability to the cluster. 
You'll need to use a three node cluster for stability.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wallace Turner

unread,
Feb 19, 2018, 5:13:52 AM2/19/18
to RavenDB - 2nd generation document database
>_any_ server being offline (such as GC pause, for example),
a GC pause is considered offline? 
Is this what you suspect is happening in our case? That is, after 5 min of constant use on a 2 node cluster you expect errors? And if indeed that is the error then by adding a 3rd node you just make the error happen *less* 

Can we stop this from occurring by changing the failover behaviour to FailoverBehavior.ReadFromAllWriteToLeaderWithFailovers




To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Wallace Turner

unread,
Feb 19, 2018, 5:21:40 AM2/19/18
to RavenDB - 2nd generation document database
Oren can you perhaps explain this a little further. We are using 2 node sql server clustering and we dont experience this problem.

If you are saying you need 3 nodes for stability then effectively you actually need 4 as a minimum because if you *do* have 3 and one goes down the system doesnt work well with 2. So the minimum for any sort of decent redundancy would be 4 if i understand what you are saying correctly.



On Monday, 19 February 2018 16:25:51 UTC+8, Oren Eini wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.

Oren Eini (Ayende Rahien)

unread,
Feb 19, 2018, 8:09:42 AM2/19/18
to ravendb
No, that isn't the case at all.

The problem is a particular deployment and defaults that you have. It might take a bit of time to explain, I'm sorry, it is a complex topic.

In a distributed system, if you want to ensure a consensus, there are a few ways to do that. By far the most popular is the notion of a consensus, which relies on a majority vote.
So in a cluster of  N members, if a majority of them votes on something, that is known to be valid. This is how consensus algorithms such as Paxos and Raft work.
A majority, in this case, is defined as (N/2) + 1. 

If you have 2 nodes in your cluster, your majority is (2 / 2)  +1 = 2.
If you have 3 nodes in your cluster, your majority is (3 / 2) + 1 = 2.

Now, with RavenDB 3.5 the default replication option requires you to talk to the leader: ReadFromLeaderWriteToLeader
But how do you select a leader? All the nodes vote on that, and the _continuously_  ensure that the leader is still in effect.
The default time for that is around 300ms, IIRC. 

Now, under load, especially if you are doing a lot of writes / indexing / etc you will likely generate a lot of garbage for the GC to cleanup. That can cause the GC to pause the process when this happens.
If that happens at the right moment, we may miss our regular checkup call that validate that the leader is still the leader. Remember, this is a the ms scale here, so that matters.
Because the GC stopped us, and because there are only 2 nodes, a hiccup on any node will force the cluster to run re-elections.
That is fine and actually expected, but the problem is with the default  ReadFromLeaderWriteToLeader, it requires a write to the leader, and the leader may change / not be known during that specific time frame.


What would change when you have 3 nodes? In this case, one server doing GC will simply failover to the other nodes, unless two nodes will run blocking GC at the exact same time (unlikely) you will always have enough for a majority rule and have a leader.

In addition to that, using ReadFromLeaderWriteToLeaderWithFailovers as your failover strategy is probably better, because you usually don't care who the leader is.

SQL Server cluster works very differently, with Windows being involved and using active / passive mode.

RavenDB 3.5 can run in this mode, see the Windows Clustering support.

Hopefully this explains things.


To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Wallace Turner

unread,
Feb 19, 2018, 7:27:33 PM2/19/18
to RavenDB - 2nd generation document database
Ok thanks for clarifying.
I dont believe what is going on in our situation is purely GC. It fails consistently after exactly 5 minutes. If you restart the process halfway thru then it takes 5 minutes to get to the failing state.
Also, store.Conventions.FailoverBehavior is being reset back to `ReadFromLeaderWriteToLeader` 


here is the code:






               
var store = new DocumentStore { Url = "http://foo-domain:8080" };
                store
.DefaultDatabase = "TestDb";
                store
.Initialize();
                store
.Conventions.FailoverBehavior = FailoverBehavior.ReadFromAllWriteToLeaderWithFailovers;
               
var sw = new Stopwatch();
                sw
.Start();
               
while (true)
               
{
                   
using (var session = store.OpenSession())
                   
using (var repository = new SystemRepository(session, new DateTimeService()))
                   
{
                       
try
                       
{
                           
Console.WriteLine(store.Conventions.FailoverBehavior + " Running for " + sw.ElapsedMilliseconds/1000 + " CurrentTradeDate: " + repository.GetCurrentTradeDate());
                       
}
                       
catch (Exception e)
                       
{
                           
Console.WriteLine(e.Message);
                       
}
                       
Thread.Sleep(500);
                   
}
               
}

so after 298 sec (have run 5+ times) I am getting the aforementioned exception
>Cluster is not in a stable state. No leader was selected, but we require one for making a request using ReadFromLeaderWriteToLeader


Note I am logging store.Conventions.FailoverBehavior to the Console. It prints out ReadFromAllWriteToLeaderWithFailovers however when the exception is caught the value of this field has flipped back to ReadFromLeaderWriteToLeader!

Wallace Turner

unread,
Feb 19, 2018, 10:03:32 PM2/19/18
to RavenDB - 2nd generation document database
Oren there is something fishy (a race condition) in ClusterAwareRequestExecuter.cs

It's not necessary to assign store.Conventions.FailoverBehavior either.

Instead change this

store.Conventions.TimeToWaitBetweenReplicationTopologyUpdates = TimeSpan.FromSeconds(60);

so that you dont have to wait 5 minutes and assuming your servers 'Client failover behaviour' is 'Read from leader write to leader'.

Run any code that selects any document every second (as above) - you should see a failure when `UpdateTopologyAsync` runs (e.g. 60 seconds)



The thrown exception is `Failover behavior is ReadFromLeaderWriteToLeader, waited for 10 seconds and no leader was selected.`  (line 248) but it is the events leading up to that that cause it to throw that error.

It seems you are updating the topology but then immediately setting leaderNode back to null

                clusterAwareRequestExecuter.UpdateTopology(this, new OperationMetadata(Url, PrimaryCredentials, topology.ClusterInformation), topology, serverHash, prevLeader);

                // when the leader is not responsive to its follower but clients may still communicate to the leader node we have
                // a problem, we will send requests to the leader and they will fail, we must fetch the topology from all nodes 
                // to make sure we have the latest one, since our primary may be a non-responsive leader.

                await clusterAwareRequestExecuter.UpdateReplicationInformationIfNeededAsync(this, force: force).ConfigureAwait(false);


you can see this in the DEBUG enabled nlog file:

>2018-02-20 10:58:40.5026,Raven.Client.Connection.Request.ClusterAwareRequestExecuter,Debug,,10,Leader node is changing from null to http://foo:8080/databases/TestDbIsLeader=True,
>2018-02-20 10:58:42.3753,Raven.Client.Connection.Request.ClusterAwareRequestExecuter,Debug,,10,Leader node is changing from http:/foo:8080/databases/TestDbIsLeader=True to null.,

Then what i believe happens is a normal query executes `ExecuteWithinClusterInternalAsync` and sees node == null and calls 
>UpdateReplicationInformationIfNeededAsync(serverClient, force: true); line223

I believe this is where the race condition is. I can show this in action if it is not reproduceable your end

Oren Eini (Ayende Rahien)

unread,
Feb 19, 2018, 10:32:56 PM2/19/18
to ravendb
In order to reproduce, can you provide some details about your setup?
In particular, you have just two nodes running 3.5 as a cluster, correct? 
You are doing pure reads (one per second) from one of the databases, with  ReadFromAllWriteToLeaderWithFailovers being set.

No other configuration?
While this is going on, are both servers up?
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Wallace Turner

unread,
Feb 19, 2018, 11:14:29 PM2/19/18
to RavenDB - 2nd generation document database
1) two nodes setup (both are running) - in global configuration the 'client failover behaviour' is the default. (read from leader write to leader)
2) run the following code:  (you do not need to set ReadFromAllWriteToLeaderWithFailovers )

                var store = new DocumentStore { Url = "http://raven-node-1.fex-sydney.local:8080" };
                store.DefaultDatabase = "TestDb";
                store.Initialize();
                store.Conventions.TimeToWaitBetweenReplicationTopologyUpdates = TimeSpan.FromSeconds(30);
                var sw = new Stopwatch();
                sw.Start();
                while (true)
                {
                    using (var session = store.OpenSession())                    
                    {
                        Console.WriteLine(DateTime.Now + " " + session.Load<Alerts>("Raven/Alerts"));
                        Thread.Sleep(1000);
                    }
                }

...and wait 30 seconds...

I've looked into this further. This is the problem line (line 635)

var majorityOfNodesAgreeThereIsLeader = Nodes.Count == 1 || hasLeaderCount > (newestTopology?.Task.Result.Destinations.Count + 1) / 2;

This line will never return true for 2 nodes. You can see this in the log output that it keeps trying and trying fetching the topology. Meanwhile the ManualResetEventSlim (leaderNodeSelected) is still not signalled and thus client calls block/throw exception

Its not until UpdateTopologyAsync is called again (30 sec later) that this ManualResetEventSlim  is set back to Set( )

!

see attached logfile
app.log

Oren Eini (Ayende Rahien)

unread,
Feb 19, 2018, 11:28:58 PM2/19/18
to ravendb
Okay, interesting. I think we just never considered this to be a valid scenario, which is why you get this.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.

Wallace Turner

unread,
Feb 19, 2018, 11:33:10 PM2/19/18
to RavenDB - 2nd generation document database
just hold up, my topology is looking a little odd

Wallace Turner

unread,
Feb 20, 2018, 1:45:02 AM2/20/18
to RavenDB - 2nd generation document database
i think you can remove that issue.
the issue lies in the cluster setup :(

just for someone else who runs into this problem... it was set up by someone else who was familiar with sql clustering and used 3 domain names (node1, node2 and active) instead of just two domain names
i.e. it looked like this:


i dont know how they did this as they did not go in and manually change the replication settings.


sorry about the hassles - well i had a nice tour of your source code for half a day...

Oren Eini (Ayende Rahien)

unread,
Feb 20, 2018, 5:08:17 AM2/20/18
to ravendb
Okay, that is a relief, I was worried how something like that could slip.
That said, the recommendation on three nodes is still in effect.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages