I am running MongoDB 2.6.4 with 4 servers and 8 shards (each server has four shards, and +1 replication). I have several mongos routers and three config servers.
While balancing a collection I ended up with this (it was not pre-sharded):
{ "_id" : "rty", "partitioned" : true, "primary" : "rty" }
rty.users
shard key: { "_id" : "hashed" }
chunks:
rty-sh4 880
rty-sh5 880
rty-sh7 880
rty-sh6 880
rty-sh8 880
rty-sh1 881
rty-sh3 880
rty 881
As you can see, the data distribution looks very wrong:
Totals
data : 218.57GiB docs : 631746259 chunks : 7042
Shard rty contains 19.7% data, 19.21% docs in cluster, avg obj size on shard : 380B
Shard rty-sh1 contains 40.58% data, 45.32% docs in cluster, avg obj size on shard : 332B
Shard rty-sh3 contains 5.86% data, 4.98% docs in cluster, avg obj size on shard : 437B
Shard rty-sh4 contains 7.39% data, 6.89% docs in cluster, avg obj size on shard : 398B
Shard rty-sh5 contains 5.89% data, 5.04% docs in cluster, avg obj size on shard : 434B
Shard rty-sh6 contains 7.32% data, 6.73% docs in cluster, avg obj size on shard : 404B
Shard rty-sh7 contains 5.89% data, 5.04% docs in cluster, avg obj size on shard : 434B
Shard rty-sh8 contains 7.33% data, 6.75% docs in cluster, avg obj size on shard : 403B
Going through the logs, I found the following:
27019/ded1326:27017:1409603898:1804289383', sleeping for 30000ms
2014-10-08T06:02:07.087-0500 [conn474] SyncClusterConnection connecting to [
10.30.1.162:27019]
2014-10-08T06:02:07.088-0500 [conn474] SyncClusterConnection connecting to [
10.30.1.166:27019]
2014-10-08T06:02:07.088-0500 [conn474] SyncClusterConnection connecting to [
10.30.1.78:27019]
2014-10-08T06:02:07.112-0500 [conn475] ChunkManager: time to load chunks for rty.users: 24ms sequenceNumber: 2265652 version: 3523|1||5406025d06417b544f8da8a4 based on: 3522|1||5406025d06417b544f8da8a4
2014-10-08T06:02:07.566-0500 [Balancer] couldn't find database [rty_test] in config db
2014-10-08T06:02:07.735-0500 [Balancer] warning: could not move chunk min: { _id: MinKey } max: { _id: -9223075039127251768 }, continuing balancing round :: caused by :: 10181 not sharded:rty_test.users_test
2014-10-08T06:02:08.661-0500 [Balancer] moveChunk result: { ok: 0.0, errmsg: "ns not found, should be impossible" }
2014-10-08T06:02:08.662-0500 [Balancer] balancer move failed: { ok: 0.0, errmsg: "ns not found, should be impossible" } from: rty to: rty-sh3 chunk: min: { _id: MinKey } max: { _id: -8849721541900570023 }
2014-10-08T06:02:08.852-0500 [conn477] ChunkManager: time to load chunks for rty.users: 104ms sequenceNumber: 2265653 version: 3523|1||5406025d06417b544f8da8a4 based on: (empty)
2014-10-08T06:02:08.872-0500 [Balancer] distributed lock 'balancer/srv0001:27017:1409603898:1804289383' unlocked.
After that, there are numerous entries about various moves failing with the same error message. Can someone shed some light on how this can happen, what to do to balance it (if there are options other than dump annd re-import because there is over 300M documents) and how to prevent this from happening in the future.
Thanks!