Petr,
The way we are handling failover is built with an eye to handle good performance even in the case of failure.
On the very first failure, we will immediately re-try to the master, and only when two consecutive requests have failed, we will detect the master as failed.
When the master is detected as failed, we failover to the secondary, as expected.
The next request, we still try the master, and continue to do so until the 10th consecutive failure.
After that, we will only try to hit the master on every 10th requests.
In other words, only after 11 failures do we start back off from the master.
If after additional 90 additional failures, the master is still not up, we back off again.
At this time, we are at 911 requests after the very first failure.
Now we only query the master once every 100 requests. If the master is still failing, we let another 900 additional requests to fail before we back off even more.
At this time, we are at 9911 failures for the master. Now we will only check it every 1000 requests. And that will remain that (we have no additional backoff stages).
In addition to all of that, every 5 minutes, we check the health of all of the nodes, regardless of whatever we know them to be sick or not.
All of that is global state for the Document _Store_, not the document session.
Your scenario seems to fall into the "What did I do, Oh Murphy?!", because you had enough failures to trigger the backoff, and an immediate switch in availability of the servers. After so many failures (~10,000 or more) we practically ignore that node, and the only thing you can do is either wait for the remapping operation happening automatically every 5 minutes, or initiate it manually by recreating the document _store_.
How did this happen to you? And did you break any mirrors lately?