mongo failover

114 views
Skip to first unread message

tetlika

unread,
Sep 5, 2012, 9:12:53 AM9/5/12
to mongodb-user

we use pymongo 2.1.1, I've noticed such behavior in replica set (can
be easy reproduced):



scenario 1:

1) master host is shutted down by "shutdown -r now" command, or
hardware failure
2) slave becomes master
3) application fails saying it cant establish connection to the db
4) when the host with master's IP appears on the network and is
pingable (even without mongod on it), things become ok

scenario 2:

1) just mongod is shutted down on master host
2) slave becomes master
3) application handles failover and things are fine


is that normal behavior?

Stephan

unread,
Sep 5, 2012, 11:05:17 AM9/5/12
to mongod...@googlegroups.com
Actually, in Java I always see the first behavior, no matter if host is really down, or only the mongod.
If there is a write access during the fail-over time (a couple of seconds), there is an error "can't find master". After that, it works here...

Strange, that you see a different behavior...

tetlika

unread,
Sep 5, 2012, 11:08:40 AM9/5/12
to mongodb-user
in second scenario of course the fail over is not happening
immediately, - it takes 10-15 seconds or so

but in scenario 1 - the fail over is not happening at least for 30-40
minutes

tetlika

unread,
Sep 5, 2012, 12:14:19 PM9/5/12
to mongodb-user
can anyone help me on that?

A. Jesse Jiryu Davis

unread,
Sep 6, 2012, 1:42:40 PM9/6/12
to mongod...@googlegroups.com
What operating system are you using on the client (where Python runs) and on the MongoDB servers? What Python version? Can you provide the exact code you use to create the PyMongo ReplicaSetConnection? I need to see precisely what options you pass to ReplicaSetConnection.

Thanks.

tetlika

unread,
Sep 6, 2012, 1:49:00 PM9/6/12
to mongodb-user
centOS 5.6 x64 everywhere
python 2.7.3

wait a bit - I will provide the code , I hope
Message has been deleted

tetlika

unread,
Sep 6, 2012, 1:55:10 PM9/6/12
to mongodb-user
sorry, previous was a bit wrongly formatted

сonnection=mongokit.connection.Connection(settings.MONGODB[db_id]
['master'][0],settings.MONGODB[db_id]['master'][1])

while ['master'][0] are ['master'][1] defined in config as slave and
master hosts of replica

On 6 Вер, 20:51, tetlika <tetl...@gmail.com> wrote:
>   connection = mongokit.connection.Connection(settings.MONGODB[db_id]
> ['master'][0],
>                                                 settings.MONGODB[db_id]
> ['master'][1])

tetlika

unread,
Sep 6, 2012, 2:04:53 PM9/6/12
to mongodb-user
a little wrong info

['master'][0] is an array of slave and master hosts defined in
config, like:

QUEUE_HOST = replaqueue.internal.com, replbqueue.internal.com

and ['master'][1] is a port

A. Jesse Jiryu Davis

unread,
Sep 6, 2012, 2:05:28 PM9/6/12
to mongod...@googlegroups.com
Thanks, and what are the values of settings.MONGODB[db_id]['master'][0] and settings.MONGODB[db_id]['master'][1]? Are you configuring any timeouts in PyMongo or MongoKit, or using the default timeouts?

tetlika

unread,
Sep 6, 2012, 2:15:14 PM9/6/12
to mongodb-user
In [8]: print  settings.MONGODB['queue']['master']
(['replaqueue.bill1.ninternal.com', 'replbqueue.bill1.ninternal.com'],
27017)

A. Jesse Jiryu Davis

unread,
Sep 6, 2012, 2:16:05 PM9/6/12
to mongod...@googlegroups.com
OK, thanks. So your code is equivalent to this?:

mongokit.connection.Connection(['replaqueue.internal.com', 'replbqueue.internal.com'], 27017)

tetlika

unread,
Sep 6, 2012, 2:17:53 PM9/6/12
to mongodb-user
yes, that's correct

tetlika

unread,
Sep 6, 2012, 2:24:43 PM9/6/12
to mongodb-user
P.S.

default values in pymongo and mongokit are used

A. Jesse Jiryu Davis

unread,
Sep 6, 2012, 2:26:40 PM9/6/12
to mongod...@googlegroups.com
I might understand the problem. Is your application single-threaded or multi-threaded?

tetlika

unread,
Sep 6, 2012, 2:27:40 PM9/6/12
to mongodb-user
single

A. Jesse Jiryu Davis

unread,
Sep 6, 2012, 2:33:05 PM9/6/12
to mongod...@googlegroups.com
OK, I need to test a theory; stay tuned. Are your machines in EC2 or your own data center or what?

tetlika

unread,
Sep 6, 2012, 2:33:24 PM9/6/12
to mongodb-user
on ec2

A. Jesse Jiryu Davis

unread,
Sep 6, 2012, 3:54:31 PM9/6/12
to mongod...@googlegroups.com
I can't reproduce your issue. In EC2 I set up two instances, one with a mongod which I forced to be primary, and one with a secondary and an arbiter. With PyMongo 2.1.1:

>>> c.host # Which instance has the primary right now?

>>> c.test.test.find_one()
{u'_id': ObjectId('5048f365923450f553000000')}

>>> # In AWS console, I stop the instance with the primary. The instance with secondary
>>> # and arbiter is still up, so secondary becomes new primary.
>>> c.test.test.find_one() # Expect first try to fail
Traceback (most recent call last)
  pymongo/collection.py in find_one(self, spec_or_id, *args, **kwargs)
  pymongo/cursor.py in next(self)
  pymongo/cursor.py in _refresh(self)
  pymongo/cursor.py in __send_message(self, message)
  pymongo/connection.py in _send_message_with_response(self, message, _must_use_master, **kwargs)
  pymongo/connection.py in __socket(self)
  pymongo/connection.py in get_socket(self, host, port)
  pymongo/connection.py in connect(self, host, port)
  /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.pyc in meth(name, self, *args)
timeout: timed out

>>> c.test.test.find_one() # Second try works as expected
{u'_id': ObjectId('5048f365923450f553000000')}

>>> c.host # New primary's instance

After PyMongo throws a timeout error, it reconnects and detects the new primary during the next operation. So we expect only one timeout after you stop an instance in EC2.

tetlika

unread,
Sep 6, 2012, 4:00:01 PM9/6/12
to mongodb-user
very weird, before submitting I've reproduced it 4 times: once we
faced that on production, and 3 times in staging envirinment
mongod is 2.0.6

could you please try with the exact same connection:

mongokit.connection.Connection(settings.MONGODB[db_id]
['master'][0],settings.MONGODB[db_id]['master'][1])

A. Jesse Jiryu Davis

unread,
Sep 6, 2012, 4:32:42 PM9/6/12
to mongod...@googlegroups.com
Do you create a single Connection instance for the lifetime of the application, or more frequently than that? Can you provide a complete test script please?

tetlika

unread,
Sep 7, 2012, 3:21:05 AM9/7/12
to mongodb-user
Thanks for head up Jesse!

We'll dive more deeply with dev team into that situation.

Michael Korbakov

unread,
Sep 7, 2012, 6:01:34 AM9/7/12
to mongod...@googlegroups.com
Hi Jesse! Thank you for helping us.

I'm working with tetlika and can answer your question. We're using Mongokit-0.7.2 and create one Connection instance per document model. That was done quite a long time ago to workaround connection pooling issues. I think we can change these multiple connection instances to single DB connection quite easy. Do you think that it could be root of the problem we're facing?

четверг, 6 сентября 2012 г., 23:32:42 UTC+3 пользователь A. Jesse Jiryu Davis написал:

A. Jesse Jiryu Davis

unread,
Sep 7, 2012, 11:10:40 AM9/7/12
to mongod...@googlegroups.com
You should create a single Connection instance used by all operations in all threads for the lifetime of the application process. This will maximize performance, minimize the total number of open sockets, and minimize the time required to recover from an event like a database server shutting down. This last point is a little subtle -- a Connection doesn't know that a database server went down until it tries an operation. Then it raises an AutoReconnect exception, and on the next operation, it resets the connection, connecting to the correct server. So if you have 10 distinct Connection instances, they each must fail, raise an AutoReconnect, and reset. If you have one Connection, you need go through this process only once.

A. Jesse Jiryu Davis

unread,
Sep 12, 2012, 3:01:46 PM9/12/12
to mongod...@googlegroups.com
Michael, tetlika, are you still experiencing this issue? Have you updated your code to use a single Connection, persisting for the lifetime of the process, for all operations?

tetlika

unread,
Sep 12, 2012, 3:15:31 PM9/12/12
to mongodb-user
hi Jesse!

still working on that, as soon as we'll make changes - it will be
posted here
> ...
>
> читати далі »

tetlika

unread,
Sep 23, 2012, 6:50:21 AM9/23/12
to mongod...@googlegroups.com
Jessem looks it was fixed after we've implemented the fix you've suggested
thanks!

Середа, 12 вересня 2012 р. 22:01:46 UTC+3 користувач A. Jesse Jiryu Davis написав:

A. Jesse Jiryu Davis

unread,
Sep 23, 2012, 5:44:11 PM9/23/12
to mongod...@googlegroups.com
Glad it worked, thanks for the update!
Reply all
Reply to author
Forward
0 new messages