Can we decrease the mongodb failover time?( currently it takes ~30sec)

430 views
Skip to first unread message

Sanjay D Raju

unread,
Aug 14, 2015, 8:28:13 AM8/14/15
to mongodb-dev
Hi All,

We Have a 3 node mongodb setup with a primary,secondary and an arbiter. 
If a Primary node goes down the election process  to choose a new primary is taking around 30 seconds.
During this period all the Create and Update operations fail. So is there anyway to reduce this failover time so that the client application can resume normal operations in less time.

What is the ideal way to handle Create and Update failures during the failover time? Should it be queued and retried? if the number of operations are high the queue size will not be enough.
So please suggest a solution for this problem.If the failover time can be reduced it would solve this problem.

We tried reducing the heartbeat interval in the replica configuration. It did not help either

Regards
Sanjay 

Oleg Rekutin

unread,
Aug 14, 2015, 3:31:44 PM8/14/15
to mongo...@googlegroups.com
No, you cannot decrease failover time. There is already an issue to fix it, please vote for it and watch it:
 
 
Ideal way to handle depends on your application. Queue and retry in general is the only practical choice. You can spill the excess events to disk and reprocess the events from disk once servers resume. You can also send a back pressure signal of some kind to your source of operations and ask it to pause sending (with potential data loss) or queue the events.
 
- Oleg
--
You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-dev...@googlegroups.com.
To post to this group, send email to mongo...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
 

Eric Milkie

unread,
Aug 14, 2015, 4:09:02 PM8/14/15
to mongodb-dev
Hi Sanjay,
For the 3.2 release of MongoDB, we are retooling the election protocol for replica sets.  By upgrading your replica set to the new protocol in 3.2, you will be able to configure a replica set such that failover time from a failed primary to a newly elected primary should be under two seconds.

In your current installation, are you running version 3.0 of MongoDB?  We implemented several enhancements in that version that accelerated elections.  It is still possible to have failover time take longer than 30 seconds in version 3.0, but you would need to have multiple failures of primary nodes in order to experience that.
-Eric

Sanjay D Raju

unread,
Aug 17, 2015, 3:01:09 AM8/17/15
to mongodb-dev
Thanks Eric. We are using 3.0.5 version and we are seeing for non graceful failure it is taking more than 30 seconds.
When can we expect 3.2 GA release of MongoDB?

Thanks
Sanjay  

David Murphy

unread,
Aug 17, 2015, 5:34:20 AM8/17/15
to mongo...@googlegroups.com
Sanjay,

I think its worth pointing out,  if you issue a rs.stepdown()  it should be faster, otherwise you need enough heartbeats to fail to trigger an election. If people were to lower the timeout on heartbeats there could be a lower election time, however more false alarms and needless elections could result. This is why 3.2 is targeting a new pattern for handling elections/failovers that should be a bit better. I think you request will still always be somewhat problematic as there is an amount of time needed to detect a machine is down, versus hung for a second or two. Find the balance for that in a generic way will need give you millisecond based detection, while properly detecting for  flapping or false positives.

PS - I only made the heartbeat comment as this is a dev list unless you know what your doing there is a horrifically bad idea to tinker with it  ;)


Hopefully this helps at-least frame the  complexity of the issue.

David

--
Reply all
Reply to author
Forward
0 new messages