@Suhail;
I would like to clarify the issue here, b/c I honestly think that
you've simply hit a capacity issue.
Server #1: 8GB of RAM:
"storageSize" : 5,028,424,960 (5.0GB)
"totalIndexSize" : 7,831,079,536 (7.8GB)
Server #2: 8GB of RAM:
"storageSize" : 4,180,220,928 (4.2GB)
"totalIndexSize" : 6,100,879,984 (6.1GB)
Problem #1: IOWait Spikes on Server #1
-----------
The index size on Server #1 is basically the size of the RAM.
This means that, at any given time, there's almost no actual data in
RAM. The RAM is mostly busy just keeping the index around.
As you get more documents, this problem is just going to get worse
because there is less and less "data" in RAM and more and more
indexes.
In this sense you have a capacity problem. You have a total of 14GB of
indexes with only 16GB of available RAM. Our general rule for good
performance is to keep all indexes in RAM.
*Solution #1*: It's time to add a new shard. The indexes on your data
are ~= your total RAM, the IO spikes are just going to get worse.
*Solution #1a*: Shrink the indexes. What's the purpose of key_1? How
does "key" differ from "_id"?
Problem #2: Uneven query distrubution / balancing problems
-----------
Shard #1:
693 queries/s
"count" : 43,504,321 (43.5M)
Shard #2:
183 queries/s
"count" : 35,162,459 (35.1M)
Ideal balance for this much data is 39M / node (= (43.5+35.1)/2). As
Eliot stated, we're within about 20% here, so that distribution is not
bad.
Of course, your queries are not balanced. Look at "queries per second
per million documents":
Shard #1: 15.9 q/s/M
Shard #2: 5.2 q/s/M
So Shard #1 is getting 3x the number of queries, even after we
normalize for the number of documents. So, for whatever reason, the
users that are active on Shard #1 are just *more* active. So the
server that is the most short on RAM is also the server that's
fielding more queries.
MongoDB does *not* shard "by activity", it shards "by key region". You
say that the load on each should be even, but clearly you have more
activity on one node than the other.
*Solution #2*: It's time to add a new shard. You'll need to re-
distribute that load and get more RAM into the equation.
*Solution #2a*: (see solution #1a)
*Solution #2b*: manual re-chunking. This is the *very* last thing that
you want to do. But you may have to manually move some of the 'hot'
zones from #1 to #2. I personally advise against this. In most cases,
adding a new server is much cheaper than trying to make this happen.
Plus, in general, you want to trust Mongo to do this correctly.
Conclusion
----------
So there are solutions to both of your problems.
Based purely on the numbers, it looks like you have some capacity
issues. It also looks like the data is not evenly distributed on terms
of query load.
The easiest solution to both problems is simply to add a new shard. I
highly suggest that you add that shard during a period of low IOWait
so that you don't overload the existing servers during the chunking
process.
To re-iterate, adding a shard is going to slow down everyone else
until that shard is "caught up".
Regards;
Gates