I added a third shard to my cluster last week and no data has moved to
it at all. I've been digging for answers and I'm finding related
things but nothing that quite addresses the issue -- my first thought
was that there is a stale lock somewhere, but I don't see it. Any
help would be much appreciated, because I've GOT to get this sorted
soon -- the RS1 shard is about 80% on disk.
DB is v2.0; we have been using v2.0 for several months now and have
not made the jump to 2.2 yet.
Below is everything I think might help someone figure this out. Sorry
for the wall o' text; I wasn't sure how much to include.
---------------------------------------------------------------------------
[The following done from the RS1 primary unless otherwise indicated,
if that matters]
mongos> ObjectId().getTimestamp()
ISODate("2012-09-03T21:03:09Z")
mongos> use admin
switched to db admin
mongos> db.printShardingStatus()
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{ "_id" : "RS1", "host" : "RS1/xxx:27018" }
{ "_id" : "RS2", "host" : "RS2/xxx:27018" }
{ "_id" : "RS3", "host" : "RS3/xxx:27018" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" :
"config" }
{ "_id" : "demo", "partitioned" : false, "primary" :
"RS1" }
{ "_id" : "cm_prod", "partitioned" : true, "primary" :
"RS1" }
cm_prod.profile_daily_stats chunks:
RS2 16
RS1 16
too many chunks to print, use verbose if you
want to force print
cm_prod.profile_raw_stats chunks:
RS2 160
RS1 157
too many chunks to print, use verbose if you
want to force print
cm_prod.video_latest_stats chunks:
RS1 152
RS2 153
too many chunks to print, use verbose if you
want to force print
cm_prod.video_raw_stats chunks:
RS1 3360
RS2 3364
too many chunks to print, use verbose if you
want to force print
{ "_id" : "test", "partitioned" : false, "primary" :
"RS2" }
{ "_id" : "admin; db", "partitioned" : false, "primary" :
"RS2" }
{ "_id" : "system", "partitioned" : false, "primary" :
"RS2" }
{ "_id" : "geoff", "partitioned" : false, "primary" :
"RS2" }
{ "_id" : "cm_demo_data", "partitioned" : false,
"primary" : "RS2" }
{ "_id" : "cm_develop_data", "partitioned" : false,
"primary" : "RS2" }
{ "_id" : "cm", "partitioned" : false, "primary" : "RS2" }
As you can see, there are only 4 sharded collections, all in the
cm_prod DB. Of those, video_raw_stats completely dominates the size:
mongos> use cm_prod
switched to db cm_prod
mongos> db.video_raw_stats.totalSize()
558973318352 [ == 520G]
mongos> use config
switched to db config
There is only one balancer lock outstanding:
mongos> db.locks.find({ _id : 'balancer'})
{ "_id" : "balancer", "process" : "RS3:27017:1346462580:1804289383",
"state" : 2, "ts" : ObjectId("50451aea4bbb07ad98706071"), "when" :
ISODate("2012-09-03T21:02:34.033Z"), "who" :
"RS3:27017:1346462580:1804289383:Balancer:846930886", "why" : "doing
balance round" }
All existing locks (included for completeness):
mongos> db.locks.find()
{ "_id" : "balancer", "process" : "RS3:27017:1346462580:1804289383",
"state" : 2, "ts" : ObjectId("50451b814bbb07ad9870607f"), "when" :
ISODate("2012-09-03T21:05:05.089Z"), "who" :
"RS3:27017:1346462580:1804289383:Balancer:846930886", "why" : "doing
balance round" }
{ "_id" : "cm_prod.video_latest_stats", "process" :
"RS2:27018:1346463433:367275570", "state" : 0, "ts" :
ObjectId("50451b8828145fb74ca6281a"), "when" :
ISODate("2012-09-03T21:05:12.887Z"), "who" :
"RS2:27018:1346463433:367275570:conn19:96364934", "why" : "migrate-
{ _id: \"-OkLQldVbhI\" }" }
{ "_id" : "cm_prod.video_descriptions", "process" :
RS1:27017:1343957738:1804289383", "state" : 0, "ts" :
ObjectId("5022feeaf7448e409c940963"), "when" :
ISODate("2012-08-09T00:06:02.618Z"), "who" :
"RS1:27017:1343957738:1804289383:conn9530:1058323149", "why" :
"drop" }
{ "_id" : "cm_prod.video_raw_stats", "process" :
"RS2:27018:1346463433:367275570", "state" : 0, "ts" :
ObjectId("50451b8a28145fb74ca6281c"), "when" :
ISODate("2012-09-03T21:05:14.050Z"), "who" :
"RS2:27018:1346463433:367275570:conn19:96364934", "why" : "migrate-
{ vt: \"--RsEGRDz6g\", h: 1343228523 }" }
{ "_id" : "cm_prod.profile_raw_stats", "process" :
"RS2:27018:1346463433:367275570", "state" : 0, "ts" :
ObjectId("50451b8928145fb74ca6281b"), "when" :
ISODate("2012-09-03T21:05:13.469Z"), "who" :
"RS2:27018:1346463433:367275570:conn19:96364934", "why" : "migrate-
{ u: MinKey, h: MinKey }" }
{ "_id" : "cm_prod.profile_daily_stats", "process" :
"RS1:27018:1346463419:1483632176", "state" : 0, "ts" :
ObjectId("50451b7d43ad3b2149370fe0"), "when" :
ISODate("2012-09-03T21:05:01.805Z"), "who" :
"RS1:27018:1346463419:1483632176:conn19:
2127675310", "why" : "migrate-
{ u: \"lionsgatelive\", h: 1345939325 }" }
{ "_id" : "cm_prod.channel_engagements", "process" :
"RS1:27017:1342673255:1804289383", "state" : 0, "ts" :
ObjectId("5007927048e98d9778948bb9"), "when" :
ISODate("2012-07-19T04:52:00.791Z"), "who" :
"RS1:27017:1342673255:1804289383:conn2:972605807", "why" : "drop" }
{ "_id" : "cm_prod.channel_video_grouped", "process" :
"RS1:27017:1343540831:1804289383", "state" : 0, "ts" :
ObjectId("501adff96dac5fc19aab36e4"), "when" :
ISODate("2012-08-02T20:15:53.401Z"), "who" :
"RS1:27017:1343540831:1804289383:conn1093:258764218", "why" : "drop" }
On the RS2 machine:
[RS2 ~]# tail -n 30 /var/log/mongo/mongod.log
Mon Sep 3 21:26:41 [conn19] command admin.$cmd command: { moveChunk:
"cm_prod.video_latest_stats", from: "RS2/xxx:27018", to: "RS3/xxx:
27018", fromShard: "RS2", toShard: "RS3", min: { _id: "-OkLQldVbhI" },
max: { _id: "-amRGWbej2g" }, maxChunkSizeBytes: 67108864, shardId:
"cm_prod.video_latest_stats-_id_"-OkLQldVbhI"", configdb:
"dbconf1:27019,dbconf2:27019,dbconf3:27019", secondaryThrottle:
false } ntoreturn:1 reslen:199 680ms
Mon Sep 3 21:26:41 [conn19] received moveChunk request: { moveChunk:
"cm_prod.profile_raw_stats", from: "RS2/xxx:27018", to: "RS3/xxx:
27018", fromShard: "RS2", toShard: "RS3", min: { u: MinKey, h:
MinKey }, max: { u: "000christoph000", h: 1339286523 },
maxChunkSizeBytes: 67108864, shardId: "cm_prod.profile_raw_stats-
u_MinKeyh_MinKey", configdb:
"dbconf1:27019,dbconf2:27019,dbconf3:27019", secondaryThrottle:
false }
Mon Sep 3 21:26:41 [conn19] created new distributed lock for
cm_prod.profile_raw_stats on dbconf1:27019,dbconf2:27019,dbconf3:27019
( lock timeout : 900000, ping interval : 30000, process : 0 )
Mon Sep 3 21:26:41 [conn19] distributed lock
'cm_prod.profile_raw_stats/RS2:27018:1346463433:367275570' acquired,
ts : 5045209128145fb74ca6290c
Mon Sep 3 21:26:41 [conn19] about to log metadata event: { _id:
"RS2-2012-09-03T21:26:41-228109", server: "RS2", clientAddr:
"xxx.xxx.xxx.xxx (RS3):58137", time: new Date(1346707601262), what:
"moveChunk.start", ns: "cm_prod.profile_raw_stats", details: { min:
{ u: MinKey, h: MinKey }, max: { u: "000christoph000", h:
1339286523 }, from: "RS2", to: "RS3" } }
Mon Sep 3 21:26:41 [conn19] moveChunk request accepted at version 52|
7
Mon Sep 3 21:26:41 [conn19] moveChunk number of documents: 1241
Mon Sep 3 21:26:41 [conn19] distributed lock
'cm_prod.profile_raw_stats/RS2:27018:1346463433:367275570' unlocked.
Mon Sep 3 21:26:41 [conn19] about to log metadata event: { _id:
"RS2-2012-09-03T21:26:41-228110", server: "RS2", clientAddr:
"xxx.xxx.xxx.xxx (RS3):58137", time: new Date(1346707601534), what:
"moveChunk.from", ns: "cm_prod.profile_raw_stats", details: { min:
{ u: MinKey, h: MinKey }, max: { u: "000christoph000", h:
1339286523 }, step1: 0, step2: 279, note: "aborted" } }
Mon Sep 3 21:26:41 [conn19] command admin.$cmd command: { moveChunk:
"cm_prod.profile_raw_stats", from: "RS2/xxx:27018", to: "RS3/xxx:
27018", fromShard: "RS2", toShard: "RS3", min: { u: MinKey, h:
MinKey }, max: { u: "000christoph000", h: 1339286523 },
maxChunkSizeBytes: 67108864, shardId: "cm_prod.profile_raw_stats-
u_MinKeyh_MinKey", configdb:
"dbconf1:27019,dbconf2:27019,dbconf3:27019", secondaryThrottle:
false } ntoreturn:1 reslen:199 511ms
Mon Sep 3 21:26:41 [conn19] received moveChunk request: { moveChunk:
"cm_prod.video_raw_stats", from: "RS2/xxx:27018", to: "RS3/xxx:27018",
fromShard: "RS2", toShard: "RS3", min: { vt: "--RsEGRDz6g", h:
1343228523 }, max: { vt: "--u6Xjj9524", h: "1340809383" },
maxChunkSizeBytes: 67108864, shardId: "cm_prod.video_raw_stats-vt_"--
RsEGRDz6g"h_1343228523", configdb:
"dbconf1:27019,dbconf2:27019,dbconf3:27019", secondaryThrottle:
false }
Mon Sep 3 21:26:41 [conn19] created new distributed lock for
cm_prod.video_raw_stats on dbconf1:27019,dbconf2:27019,dbconf3:27019
( lock timeout : 900000, ping interval : 30000, process : 0 )
Mon Sep 3 21:26:41 [conn195] remove cm_prod.video_latest_stats query:
{ _id: "Fh-DHjnZ1Q8" } 125ms
Mon Sep 3 21:26:41 [conn177] remove cm_prod.video_latest_stats query:
{ _id: "NTo1BEPzXBs" } 118ms
Mon Sep 3 21:26:41 [conn191] remove cm_prod.video_latest_stats query:
{ _id: "0lM78b-hM8k" } 162ms
Mon Sep 3 21:26:41 [conn19] distributed lock 'cm_prod.video_raw_stats/
RS2:27018:1346463433:367275570' acquired, ts :
5045209128145fb74ca6290d
Mon Sep 3 21:26:41 [conn19] about to log metadata event: { _id:
"RS2-2012-09-03T21:26:41-228111", server: "RS2", clientAddr:
"xxx.xxx.xxx.xxx (RS3):58137", time: new Date(1346707601742), what:
"moveChunk.start", ns: "cm_prod.video_raw_stats", details: { min:
{ vt: "--RsEGRDz6g", h: 1343228523 }, max: { vt: "--u6Xjj9524", h:
"1340809383" }, from: "RS2", to: "RS3" } }
Mon Sep 3 21:26:41 [conn19] moveChunk request accepted at version
1542|437
Mon Sep 3 21:26:42 [conn177] remove cm_prod.video_latest_stats query:
{ _id: "75B0x5bT87U" } 140ms
Mon Sep 3 21:26:42 [conn157] query cm_prod.video_raw_stats query:
{ orderby: { h: -1 }, query: { vt: "4_V-NB2v6zI", hs: { $in:
[ 1346198400, 1345852800, 1345248000, 1343865600, 1338681600 ] } } }
ntoreturn:10 nscanned:10 scanAndOrder:1 nreturned:6 reslen:704 101ms
Mon Sep 3 21:26:43 [conn19] moveChunk number of documents: 68064
Mon Sep 3 21:26:43 [conn19] distributed lock 'cm_prod.video_raw_stats/
RS2:27018:1346463433:367275570' unlocked.
Mon Sep 3 21:26:43 [conn19] about to log metadata event: { _id:
"RS2-2012-09-03T21:26:43-228112", server: "RS2", clientAddr:
"xxx.xxx.xxx.xxx (RS3):58137", time: new Date(1346707603436), what:
"moveChunk.from", ns: "cm_prod.video_raw_stats", details: { min: { vt:
"--RsEGRDz6g", h: 1343228523 }, max: { vt: "--u6Xjj9524", h:
"1340809383" }, step1: 0, step2: 268, note: "aborted" } }
Mon Sep 3 21:26:43 [conn19] command admin.$cmd command: { moveChunk:
"cm_prod.video_raw_stats", from: "RS2/xxx:27018", to: "RS3/xxx:27018",
fromShard: "RS2", toShard: "RS3", min: { vt: "--RsEGRDz6g", h:
1343228523 }, max: { vt: "--u6Xjj9524", h: "1340809383" },
maxChunkSizeBytes: 67108864, shardId: "cm_prod.video_raw_stats-vt_"--
RsEGRDz6g"h_1343228523", configdb:
"dbconf1:27019,dbconf2:27019,dbconf3:27019", secondaryThrottle:
false } ntoreturn:1 reslen:199 2036ms
Mon Sep 3 21:26:44 [conn193] remove cm_prod.video_latest_stats query:
{ _id: "PXRPX2yZeBo" } 107ms
Mon Sep 3 21:26:44 [conn158] remove cm_prod.video_latest_stats query:
{ _id: "RhvB4mPPYr0" } 152ms
Mon Sep 3 21:26:44 [conn187] remove cm_prod.video_latest_stats query:
{ _id: "B2CtCrT3tXU" } 128ms
Mon Sep 3 21:26:44 [conn196] remove cm_prod.video_latest_stats query:
{ _id: "AlKQTBRJ3nk" } 121ms
Mon Sep 3 21:26:44 [conn159] remove cm_prod.video_latest_stats query:
{ _id: "HczAI7Hgf1Q" } 121ms
Mon Sep 3 21:26:44 [conn193] remove cm_prod.video_latest_stats query:
{ _id: "N5WjWgatUm8" } 146ms
On the RS3 machine:
[RS3 ~]# tail -n 30 /var/log/mongo/mongod.log
Mon Sep 3 20:37:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:37:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:37:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:39:37 [initandlisten] connection accepted from
127.0.1.1:43404 #192 (74 connections now open)
Mon Sep 3 20:40:35 [initandlisten] connection accepted from <RS1
machine private IP>:46020 #193 (75 connections now open)
Mon Sep 3 20:42:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:42:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:42:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:47:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:47:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:47:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:52:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:52:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:52:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:57:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:57:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 20:57:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:02:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:02:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:02:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:03:40 [conn140] end connection "<RS2 machine private IP>:
45967 (74 connections now open)
Mon Sep 3 21:07:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:07:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:07:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:12:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:12:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:12:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:17:06 [conn4] command admin.$cmd command:
{ writebacklisten: ObjectId('50416343edfa40550746b1b8') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:17:14 [conn10] command admin.$cmd command:
{ writebacklisten: ObjectId('5041633c0774eaaaa46d50d3') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms
Mon Sep 3 21:17:37 [conn14] command admin.$cmd command:
{ writebacklisten: ObjectId('504163334bbb07ad9870119d') } ntoreturn:1
keyUpdates:0 reslen:44 300000ms