Client API, automatic failover and replication

Petr Sereda

unread,

Jul 28, 2011, 9:01:43 AM7/28/11

to rav...@googlegroups.com

Hello,

I've looked through the documentation and this news group but still have some misunderstanding about the following scenario:

Imaging we have two instances of RavenDB, say A and B, and replication set between them. Documentation states that Client API "When that instance is down, will automatically shift to the other instances.". And it does really behave that way.

But if instance A goes online again and after that instance B suddenly goes offline, client API doesn't return back to instance A (which is alive at the moment) and throws the exception "Attempted to conect to master and all replicas has failed, giving up" instead.

How would you recommend to handle such a situation? Re-creating IDocumentSession doesn't help.

Thanks

Ayende Rahien

unread,

Jul 28, 2011, 9:42:19 AM7/28/11

to rav...@googlegroups.com

Petr,

The way we are handling failover is built with an eye to handle good performance even in the case of failure.

On the very first failure, we will immediately re-try to the master, and only when two consecutive requests have failed, we will detect the master as failed.

When the master is detected as failed, we failover to the secondary, as expected.

The next request, we still try the master, and continue to do so until the 10th consecutive failure.

After that, we will only try to hit the master on every 10th requests.

In other words, only after 11 failures do we start back off from the master.

If after additional 90 additional failures, the master is still not up, we back off again.

At this time, we are at 911 requests after the very first failure.

Now we only query the master once every 100 requests. If the master is still failing, we let another 900 additional requests to fail before we back off even more.

At this time, we are at 9911 failures for the master. Now we will only check it every 1000 requests. And that will remain that (we have no additional backoff stages).

In addition to all of that, every 5 minutes, we check the health of all of the nodes, regardless of whatever we know them to be sick or not.

All of that is global state for the Document _Store_, not the document session.

Your scenario seems to fall into the "What did I do, Oh Murphy?!", because you had enough failures to trigger the backoff, and an immediate switch in availability of the servers. After so many failures (~10,000 or more) we practically ignore that node, and the only thing you can do is either wait for the remapping operation happening automatically every 5 minutes, or initiate it manually by recreating the document _store_.

How did this happen to you? And did you break any mirrors lately?

Petr Sereda

unread,

Jul 28, 2011, 11:02:26 AM7/28/11

to rav...@googlegroups.com

Ayende, thank you for the explanation.

Actually it is not a real scenario but a synthetic test case which I run to try Raven's failover and replication features.

The reason of my error was very short period between instance A goes back online and instance B goes offline (actually I manually start and shut down server's console). As soon as I wait for a longer period - about 5 minutes as you said - instance A was caught by Client API again and I was able to painlessly kill instance B.

If we run RavenDB server as console app, the exact moment when Client API detects instance A again can be seen very clear. Replication messages in A's console are instantly replaced by messages about GET and POST requests. Vice versa for console of the instance B.

Ayende Rahien

unread,

Jul 28, 2011, 11:04:43 AM7/28/11

to rav...@googlegroups.com

Okay, glad that you weren't haunted by Murphy, then.

Is there anything else that we can do to assist?

Petr Sereda

unread,

Jul 28, 2011, 11:19:46 AM7/28/11

to rav...@googlegroups.com

Not at the moment, thank you.

The only thing to mention that imho it would be great to update Replication documentation on ravendb.net to include brief discussion about FailoverBehavior for IDocumentSession. It is not very clear for a newbie why write operations don't failover to instance B when you face with this problem for the first time.

Itamar Syn-Hershko

unread,

Jul 28, 2011, 11:41:03 AM7/28/11

to rav...@googlegroups.com

We are already on it...

Reply all

Reply to author

Forward