Experiencing Mongo::OperationTimeout every 20mins-2 hours

Alon Burg

unread,

Jan 18, 2012, 11:43:15 AM1/18/12

to mongod...@googlegroups.com

I seem to be experiencing a Mongo::OperationTimeout every ~20 mins - 1 Hour

My stack:

Rails 3.1.3
Mongoid 3 (git edge)
Unicorn 4.1.1
2 X MongoDB 2.0.2 (which should have the KeepAlive default set right) configured as ReplicaSet
Ubuntu m1.large EC2

I have tried setting KeepAlive on EC2 to 300 like said in http://www.mongodb.org/display/DOCS/Amazon+EC2 but still did not help

I have tried working with just one primary configuration instead of the ReplicaSet, but this did not help either.

Below is mongoid.conf:
production:
database: cloud
op_timeout: 10
read_secondary: true
max_retries_on_connection_failure: 3
identity_map_enabled: true
allow_dynamic_fields: false
hosts:
- - ip-XXX.ec2.internal
- 27017
- - ip-XXX.ec2.internal
- 27017

also posted on Stackoverflow: http://stackoverflow.com/questions/8913867/experiencing-mongooperationtimeout-every-20mins-2-hours

Eliot Horowitz

unread,

Jan 18, 2012, 6:17:49 PM1/18/12

to mongod...@googlegroups.com

Have you looked in the mongod log for slow queries?

Do you have a socket timeout configured in the code?

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mongodb-user/-/YtZUL9SoMrgJ.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.

Alon Burg

unread,

Jan 18, 2012, 6:22:21 PM1/18/12

to mongod...@googlegroups.com

We do not have slow queries in the mongo log.

We do have socket timeout configured through mongoid.yml op_timeout: 10

Alon Burg

unread,

Jan 18, 2012, 6:44:00 PM1/18/12

to mongod...@googlegroups.com

Our DB is non production yet (still testing) 200 MB... so there really shouldn't be too big queries

Kyle Banker

unread,

Jan 18, 2012, 10:19:31 PM1/18/12

to mongodb-user

I'd recommend setting the op_timeout to something much higher. Perhaps
60.

Alon Burg

unread,

Jan 19, 2012, 4:01:35 PM1/19/12

to mongod...@googlegroups.com

More people experiencing same situation https://groups.google.com/forum/#!topic/mongoid/I9foayTJ5Wo

Another related question:

In http://www.mongodb.org/display/DOCS/Troubleshooting#Troubleshooting-Socketerrorsinshardedclustersandreplicasets

it is indicated that the keepalive should be reduced to 300 secs on the Mongod machines.

Shouldn't this actually be set on the mongo-clients (frontends) machines?

Kyle Banker

unread,

Jan 20, 2012, 10:04:32 AM1/20/12

to mongod...@googlegroups.com

Alon,

Please try either raising the op_timeout value or eliminating the timeout altogether by setting it to nil.

To answer your other question, you should set the keepalive on all machines. The mongod machines do communicate with other parts of the shard cluster (config servers, etc.).

Kyle

Chuck Remes

unread,

Jan 20, 2012, 10:26:11 AM1/20/12

to mongod...@googlegroups.com

In mongo ruby driver 1.5.2 it is *not possible* to eliminate the :op_timeout by setting it to nil as seen in this code.

https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/repl_set_connection.rb#L472

A nil setting will cause the default of 30 to be set.

BTW, I think this thread is uncovering the same issue as the earlier thread "Re: When upgrade to ruby mongo 1.5.2, server is very slow". I have also seen some hard crashes with the driver when a replica secondary goes offline. I will capture the backtrace next time I see it and open an issue.

cr

Kyle Banker

unread,

Jan 20, 2012, 11:03:43 AM1/20/12

to mongod...@googlegroups.com

Thanks, Chuck. We'll try to reproduce these issues and get some fixes out ASAP.

Alon Burg

unread,

Jan 20, 2012, 11:16:50 AM1/20/12

to mongod...@googlegroups.com

After some group thinking, here are some points we came up with regarding our situation:

We are using mongoid 3.0 with op_timeout: 30 (versions 2.3 and less of Mongoid did not have op_timeout enabled) which actually floats the OperationTimeout. It is possible that many other users are experiencing this but do not actually get this in the logs, but rather just stuck unicorn workers.
We are using Unicorn, which spawns processes ahead of time and keep them waiting, unlike Passenger which scales dynamically. Since we currently are just in test mode, and do not have real traffic, it is possible that many of the workers become idle, and their mongo connection becomes stale. Most people are probably not getting to this either, but might experience this every now and then.
It seems like the Linux KeepAlive described in here www.mongodb.org/display/DOCS/Troubleshooting#Troubleshooting-Socketerrorsinshardedclustersandreplicasets does not help
For now, I have created a dummy Rack middleware to do an initial mongo query and handle the exception if needed. Here's the code https://gist.github.com/1647879

I hope that later today I'll be able to get to the bottom of why the connection is not kept alive.

Reply all

Reply to author

Forward