We are encountering odd behavior with our sharded mongo cluster running on AWS that I wonder if anyone else has encountered.
Essentially we are seeing a "beat" pattern across all clients that seems to be on the order of 10 seconds. So, for 10 seconds the clients are all going very well--running maybe 2,000 queries/second, 50-100 inserts (via findAndModify)/second, another 100-200 updates a second, and so forth. Then, all the appservers just stall, sometimes going as low as 0 operations in a 1-2 second window. Queries, updates, commands, everything. We can see this both from watching the queues, mongostat on mongos running on each appserver, and on mongostat running on the shard primaries.
We have 3 clusters running. Primary cluster is running under Amazon Linux (CentOS) with 15GB RAM and 400 GB of EBS (2000 piops). Two smaller shards are running with same OS 200GB EBS (1000 piops) on m1.large instance. Just two collections are sharded--one small collection (8million documents), one a little larger (30 million entries). So we are definitely not running a large system here.
Each mongod has about 350+ connections to 5-6 app servers.So we have too many connections?
Thanks for any guidance.
Mike