Balancing does not work

Showing 1-18 of 18 messages
Balancing does not work Daniel Schlegel 8/30/12 3:20 AM
Hello
i've upgraded to mongo 2.2.0 yesterday. all went like charme and i could shard my collection.
In the documentation i read:
Both splits and migrates are performed automatically.

But unfortunately this doesn't work on our setup. 
I get error messages like these:

Thu Aug 30 12:16:42 [Balancer]  ns: production.people going to move { _id: "production.people-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('503e9d5ef940d75c2de07f8e'), ns: "production.people", min: { _id: MinKey }, max: { _id: 304836 }, shard: "s1" } from: s1 to: s2 tag []

Thu Aug 30 12:16:42 [Balancer] moving chunk ns: production.people moving ( ns:production.people at: s1:s1/mongo11.example.com:20011,mongo12.example.com:20012 lastmod: 1|0||000000000000000000000000 min: { _id: MinKey } max: { _id: 304836 }) s1:s1/mongo11.example.com:20011,mongo12.example.com:20012 -> s2:s2/mongo21.example.com:20021,mongo22.example.com:20022

Thu Aug 30 12:16:43 [Balancer] moveChunk result: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 }

Thu Aug 30 12:16:43 [Balancer] balancer move failed: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 } from: s1 to: s2 chunk:  min: { _id: MinKey } max: { _id: MinKey }


After 10h of waiting there is still all data on the first shard and the second is empty.
I think about manually splitting but as it's not so easy to find the right splitpoint in our database i've not done this yet.

Thanks for your Help
Dani
Re: Balancing does not work Adam C 8/30/12 3:32 AM
Dani,

You might have a stale lock lying around, or the mongos may have a stale view of things - can you do a couple of things for me:

1. Bounce (restart) all of your mongos
2. Once the bounce is complete, log into the mongos and run:

use config;
db.locks.find();

And post the results here.

Thanks,

Adam
Re: Balancing does not work Daniel Schlegel 8/30/12 4:37 AM
Hi Adam
Here's the output:

mongos> db.locks.find();

{ "_id" : "admin-movePrimary", "process" : "Web1:27069:1343571761:1804289383", "state" : 0, "ts" : ObjectId("501555d74439248d85dc8867"), "when" : ISODate("2012-07-29T15:25:11.099Z"), "who" : "Web1:27069:1343571761:1804289383:conn122:1714636915", "why" : "Moving primary shard of admin" }

{ "_id" : "example_production-movePrimary", "process" : "Web1:27069:1343571761:1804289383", "state" : 0, "ts" : ObjectId("501553614439248d85dc885a"), "when" : ISODate("2012-07-29T15:14:41.616Z"), "who" : "Web1:27069:1343571761:1804289383:conn1:1681692777", "why" : "Moving primary shard of example_production" }

{ "_id" : "example_production_vanity-movePrimary", "process" : "Web1:27069:1343571761:1804289383", "state" : 0, "ts" : ObjectId("501552fb4439248d85dc8855"), "when" : ISODate("2012-07-29T15:12:59.598Z"), "who" : "Web1:27069:1343571761:1804289383:conn1:1681692777", "why" : "Moving primary shard of example_production_vanity" }

{ "_id" : "balancer", "process" : "web1:27069:1346283357:314909341", "state" : 2, "ts" : ObjectId("503f4f2a3c113ffbd8e4a7e9"), "when" : ISODate("2012-08-30T11:31:54.320Z"), "who" : "web1:27069:1346283357:314909341:Balancer:1842493053", "why" : "doing balance round" }

{ "_id" : "example_production.people", "process" : "mongo11:20011:1346282264:758785138", "state" : 0, "ts" : ObjectId("503f4f2ce69a6c2009e22331"), "when" : ISODate("2012-08-30T11:31:56.182Z"), "who" : "mongo11:20011:1346282264:758785138:conn37208:1670912857", "why" : "migrate-{ _id: MinKey }" }

{ "_id" : "example_production.new_coll", "process" : "web1:27069:1346283357:314909341", "state" : 0, "ts" : ObjectId("503f36eb3c113ffbd8e4a6a1"), "when" : ISODate("2012-08-30T09:48:27.208Z"), "who" : "web1:27069:1346283357:314909341:conn37665:149759223", "why" : "drop" }

{ "_id" : "example_production_vanity.metrics", "process" : "web1:27069:1346283357:314909341", "state" : 0, "ts" : ObjectId("503f37833c113ffbd8e4a6aa"), "when" : ISODate("2012-08-30T09:50:59.474Z"), "who" : "web1:27069:1346283357:314909341:conn37665:149759223", "why" : "drop" }


Thanks!

Dani


Am Donnerstag, 30. August 2012 12:32:06 UTC+2 schrieb Adam C:
Re: Balancing does not work Daniel Schlegel 8/30/12 4:37 AM
example_production.people is the important collection.

Am Donnerstag, 30. August 2012 13:37:02 UTC+2 schrieb Daniel Schlegel:
Re: Balancing does not work Adam C 8/30/12 4:48 AM
You have several movePrimary locks dating back to last month, including the admin database, which should *always* be on "config" - did you drain a shard, or use movePrimary for some other reason?

Can you also post the output of:

use config;
db.databases.find();

Adam.
Re: Balancing does not work Daniel Schlegel 8/30/12 5:17 AM
Yes i did drain a shard because i had a problem removing a replica:
1. I was not able to remove a replica but had to move the server.
2. So ive added a new shard, drained the old one and remove the old shard.

Heres the discussion about it:
https://groups.google.com/forum/?fromgroups=#!searchin/mongodb-user/remove$20old$20replica/mongodb-user/Goi5cIqC6hQ/XmQF9NODUfMJ

mongos> db.databases.find();
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "silp_production", "partitioned" : true, "primary" : "s1" }

Thank you so much
Dani

Am Donnerstag, 30. August 2012 13:48:51 UTC+2 schrieb Adam C:
Re: Balancing does not work Adam C 8/30/12 7:36 AM
Well, let's clear up the old entries first, then we can bounce the mongos and see if that helps things.  It may help if you restart the mongod's too - clear out anything stale on their side.  Each primary can only take part in a single migration at a time, so if they believe they are actually doing a migration that may be part of the problem too.

To remove the stale entries:

use config;
db.locks.remove({"_id" : "admin-movePrimary"});
db.locks.remove({"_id" : "example_production-movePrimary"});
db.locks.remove({"_id" : "example_production_vanity-movePrimary"});

Then bounce the mongos, see if the migrations kick off.  If not, bounce the mongod's also to clear out anything stale there.

Adam
Re: Balancing does not work Daniel Schlegel 8/30/12 8:26 AM
Hi Adam
i removed all the entries and restartet both, mongos and mongod.
The Problem is still here:

Thu Aug 30 17:18:34 [conn502] resetting shard version of example_production.people on mongo22.example.com:20022, version is zero
Thu Aug 30 17:18:34 [conn503] resetting shard version of example_production.people on mongo22.example.com:20022, version is zero
...

Thu Aug 30 17:18:41 [Balancer]  ns: example_production.people going to move { _id: "example_production.people-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('503e9d5ef940d75c2de07f8e'), ns: "example_production.people", min: { _id: MinKey }, max: { _id: 304836 }, shard: "s1" } from: s1 to: s2 tag []
Thu Aug 30 17:18:41 [Balancer] moving chunk ns: example_production.people moving ( ns:example_production.people at: s1:s1/mongo11.example.com:20011,mongo12.example.com:20012 lastmod: 1|0||000000000000000000000000 min: { _id: MinKey } max: { _id: 304836 }) s1:s1/mongo11.example.com:20011,mongo12.example.com:20012 -> s2:s2/mongo21.example.com:20021,mongo22.example.com:20022
...
Thu Aug 30 17:18:44 [Balancer] moveChunk result: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 }
Thu Aug 30 17:18:44 [Balancer] balancer move failed: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 } from: s1 to: s2 chunk:  min: { _id: MinKey } max: { _id: MinKey }

mongos> sh.status()
--- Sharding Status --- 
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
{  "_id" : "s1",  "host" : "s1/mongo11.example.com:20011,mongo12.example.com:20012" }
{  "_id" : "s2",  "host" : "s2/mongo21.example.com:20021,mongo22.example.com:20022" }
  databases:
{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
{  "_id" : "example_production",  "partitioned" : true,  "primary" : "s1" }
example_production.people chunks:
s1 3079
too many chunks to print, use verbose if you want to force print

these locks appeared again(in think this is like it should be):

mongos> db.locks.find();
{ "_id" : "example_production.people", "process" : "mongo11:20011:1346339905:604514035", "state" : 2, "ts" : ObjectId("503f85a8b61cd58923b1fcea"), "when" : ISODate("2012-08-30T15:24:24.524Z"), "who" : "mongo11:20011:1346339905:604514035:conn46:460093839", "why" : "migrate-{ _id: MinKey }" }
{ "_id" : "balancer", "process" : "worker:27069:1346339673:1804289383", "state" : 2, "ts" : ObjectId("503f85a740d7cd22f64415dc"), "when" : ISODate("2012-08-30T15:24:23.583Z"), "who" : "worker:27069:1346339673:1804289383:Balancer:846930886", "why" : "doing balance round" }

i really appreciate your help, i'm pretty stuck with my current know how.
Dani

Am Donnerstag, 30. August 2012 16:36:23 UTC+2 schrieb Adam C:
Re: Balancing does not work Adam C 8/30/12 8:43 AM
What it is saying is that the shard it is attempting to migrate to, namely the primary from:


Thinks it is already taking part in a migration (as I mentioned, it can only take part in one at a time, hence the error).  Are you sure those mongod instances were restarted?

Can you show me rs.status() from that s2 replica set, and a db.currentOp() from the primary?

In addition - are there any migration messages (or other errors) visible in the logs for the primary for that set?

Adam
Re: Balancing does not work Daniel Schlegel 8/30/12 9:04 AM
Im pretty sure i restarted all the mongod's in our system. I made it again for the s2 mongod's.
Not sure if it was exactly on point of restarting the mongod's but the exception seam to have disappeared. 
Is it possible, that i had to bounce first the primary of s1 and then the primary of s2? I did it in reverse order.

Here are the outputs:
s2:PRIMARY> rs.status()
{
"set" : "s2",
"date" : ISODate("2012-08-30T15:52:40Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 57,
"optime" : Timestamp(1346341931000, 3),
"optimeDate" : ISODate("2012-08-30T15:52:11Z"),
"lastHeartbeat" : ISODate("2012-08-30T15:52:40Z"),
"pingMs" : 0,
"errmsg" : "syncing to: mongo22.example.com:20022"
},
{
"_id" : 1,
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 57,
"lastHeartbeat" : ISODate("2012-08-30T15:52:40Z"),
"pingMs" : 0
},
{
"_id" : 2,
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 57,
"optime" : Timestamp(1346341931000, 3),
"optimeDate" : ISODate("2012-08-30T15:52:11Z"),
"self" : true
}
],
"ok" : 1
}


s2:PRIMARY> db.currentOp()
{
"inprog" : [
{
"opid" : 748,
"active" : true,
"secs_running" : 2,
"op" : "getmore",
"ns" : "local.oplog.rs",
"query" : {
},
"client" : "6.23.38.199:46472",
"desc" : "conn35",
"threadId" : "0x7fad895e7700",
"connectionId" : 35,
"waitingForLock" : false,
"numYields" : 0,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(97),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(11),
"w" : NumberLong(0)
}
}
},
{
"opid" : 24,
"active" : false,
"op" : "none",
"ns" : "",
"query" : {
},
"desc" : "repl writer worker 1",
"threadId" : "0x7fddb837b700",
"waitingForLock" : false,
"numYields" : 0,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(0),
"w" : NumberLong(33)
},
"timeAcquiringMicros" : {
}
}
}
]
}

The really good looking part of it is the sharding status :)
mongos> sh.status()
--- Sharding Status --- 
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
{  "_id" : "s1",  "host" : "s1/mongo11.example.com:20011,mongo12.example.com:20012" }
{  "_id" : "s2",  "host" : "s2/mongo21.example.com:20021,mongo22.example.com:20022" }
  databases:
{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
{  "_id" : "example_production",  "partitioned" : true,  "primary" : "s1" }
example_production.people chunks:
s2 1
s1 3078
too many chunks to print, use verbose if you want to force print

it's about 10min ago and there didn't appear another chunk on s2 but it seams really better.

Thank you very much for your help. I will get back if it's not going on with sharding.

Dani


Am Donnerstag, 30. August 2012 17:43:50 UTC+2 schrieb Adam C:
Re: Balancing does not work Adam C 8/30/12 9:12 AM
The primaries are the ones responsible for the migrations, at the behest of the mongos.  I would be surprised if the order mattered, but at least the chunks have started moving.  Let us know if it breaks down or has issues again.

Adam
Re: Balancing does not work Geoff L 8/31/12 6:58 PM
I've got a similar problem with my database, I had 2 shards, I added a 3rd, and now the third one won't accept any data. the mongos logs show the same "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress" message that Daniel got. I've tried bouncing just the mongoses, then both the mongoses and the mongods, but I still get the same message. I also get

[Balancer] distributed lock 'balancer/dbs3a:27017:1346462580:1804289383' unlocked. but I think that's the balancer giving up it's lock.

Any advice would be great and greatly appreciated.

Geoff
Re: Balancing does not work Adam C 9/1/12 5:46 AM
Geoff, are you also using 2.2 or are you on the 2.0 branch?

Can you post the output of the information I requested above on your system? In particular, from the mongos:

use config;
db.locks.find();

Thanks,

Adam
Re: Balancing does not work Adam C 9/3/12 3:27 PM
For reference, it seems this is a consequence of this bug:


The upgrade sequence needed to ensure this does not happen is as follows:

1) turn off balancer
2) upgrade all mongos's 
3) upgrade config dabases 
4) upgrade each shard (using procedure for replica sets or standalone, whichever applies) 
5) turn on balancer

Adam
Re: Balancing does not work Geoff L 9/4/12 3:37 PM
The system is running 2.0, the output from db.locks.find() is

{ "_id" : "balancer", "process" : "dbs3a:27017:1346462580:1804289383", "state" : 0, "ts" : ObjectId("5046807a4bbb07ad987074bb"), "when" : ISODate("2012-09-04T22:28:10.373Z"), "who" : "dbs3a:27017:1346462580:1804289383:Balancer:846930886", "why" : "doing balance round" }
{ "_id" : "cm_prod.video_latest_stats", "process" : "dbs2a:27018:1346463433:367275570", "state" : 0, "ts" : ObjectId("5046808228145fb74ca67041"), "when" : ISODate("2012-09-04T22:28:18.167Z"), "who" : "dbs2a:27018:1346463433:367275570:conn19:96364934", "why" : "migrate-{ _id: \"-OkLQldVbhI\" }" }
{ "_id" : "cm_prod.video_descriptions", "process" : "dbs1a:27017:1343957738:1804289383", "state" : 0, "ts" : ObjectId("5022feeaf7448e409c940963"), "when" : ISODate("2012-08-09T00:06:02.618Z"), "who" : "dbs1a:27017:1343957738:1804289383:conn9530:1058323149", "why" : "drop" }
{ "_id" : "cm_prod.video_raw_stats", "process" : "dbs1a:27018:1346463419:1483632176", "state" : 0, "ts" : ObjectId("5046807843ad3b2149394704"), "when" : ISODate("2012-09-04T22:28:08.185Z"), "who" : "dbs1a:27018:1346463419:1483632176:conn19:2127675310", "why" : "migrate-{ vt: MinKey, h: MinKey }" }
{ "_id" : "cm_prod.profile_raw_stats", "process" : "dbs2a:27018:1346463433:367275570", "state" : 0, "ts" : ObjectId("5046808228145fb74ca67042"), "when" : ISODate("2012-09-04T22:28:18.682Z"), "who" : "dbs2a:27018:1346463433:367275570:conn19:96364934", "why" : "migrate-{ u: MinKey, h: MinKey }" }
{ "_id" : "cm_prod.profile_daily_stats", "process" : "dbs1a:27018:1346463419:1483632176", "state" : 0, "ts" : ObjectId("5046807643ad3b2149394703"), "when" : ISODate("2012-09-04T22:28:06.602Z"), "who" : "dbs1a:27018:1346463419:1483632176:conn19:2127675310", "why" : "migrate-{ u: \"lionsgatelive\", h: 1345939325 }" }
{ "_id" : "cm_prod.channel_engagements", "process" : "dbs1a:27017:1342673255:1804289383", "state" : 0, "ts" : ObjectId("5007927048e98d9778948bb9"), "when" : ISODate("2012-07-19T04:52:00.791Z"), "who" : "dbs1a:27017:1342673255:1804289383:conn2:972605807", "why" : "drop" }
{ "_id" : "cm_prod.channel_video_grouped", "process" : "dbs1a:27017:1343540831:1804289383", "state" : 0, "ts" : ObjectId("501adff96dac5fc19aab36e4"), "when" : ISODate("2012-08-02T20:15:53.401Z"), "who" : "dbs1a:27017:1343540831:1804289383:conn1093:258764218", "why" : "drop" }

Thanks,

Geoff
Re: Balancing does not work David Hows 9/16/12 10:30 PM
Hi Geoff,

Were you able to update to 2.2 following Adams steps?

Did this resolve your issue?

Cheers,

David
Re: Balancing does not work Adam C 10/6/12 9:28 AM
For anyone else running into similar issues here on 2.2.0, the root cause is very likely SERVER-7003 which will be fixed in 2.2.1

There is some evidence that the bad state on the target shard can be eventually resolved and restarts may help clear it out, but the fix in 2.2.1 is the real cure here.

Adam.
Re: Balancing does not work Vasyl 2/9/13 1:39 PM
Thanks Adam, it answers my questions in

<code>
Sat Feb  9 04:26:19 [Balancer] balancer move failed: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 } from: replicaSet2 to: replicaSet1 chunk:  min: { User: MinKey } max: { User: MinKey }
Sat Feb  9 04:26:59 [Balancer] moveChunk result: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 }
Sat Feb  9 04:27:01 [Balancer] balancer move failed: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 } from: replicaSet2 to: replicaSet1 chunk:  min: { User: MinKey } max: { User: MinKey }
Sat Feb  9 04:27:36 [Balancer] moveChunk result: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 }
Sat Feb  9 04:27:38 [Balancer] balancer move failed: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 } from: replicaSet2 to: replicaSet1 chunk:  min: { UserID: MinKey } max: { UserID: MinKey }
Sat Feb  9 04:28:16 [Balancer] moveChunk result: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 }
Sat Feb  9 04:28:22 [Balancer] balancer move failed: { cause: { errmsg: "migrate already in progress", ok: 0.0 }, errmsg: "moveChunk failed to engage TO-shard in the data transfer: migrate already in progress", ok: 0.0 } from: replicaSet2 to: replicaSet1 chunk:  min: { OutgoingMessagesID: MinKey } max: { OutgoingMessagesID: MinKey }</code>

Also, recently we got 
DBException in process: socket exception [SEND_ERROR] for 127.0.0.1:41725 (bug https://jira.mongodb.org/browse/SERVER-7008)
which brokes mongoses connection permanently with cursor timing out on application side (timeout: 30000, time left: 0:0, status: 0) 

Full condition state shows that socket exception [SEND_ERROR] for localhost occurs once  ChunkManager with autosplitting kicks in:

<code>Sat Feb  9 09:41:49 [conn7]  authenticate db: que_db { authenticate: 1, user: "user", nonce: "xxxx", key: "xxx" }
Sat Feb  9 09:41:49 [conn7] resetting shard version of que_db.exp on m1.10001, version is zero
Sat Feb  9 11:01:38 [conn7] SyncClusterConnection connecting to [conf1:10003]
Sat Feb  9 11:01:38 [conn7] SyncClusterConnection connecting to [conf2:10003]
Sat Feb  9 11:01:39 [conn7] SyncClusterConnection connecting to [conf3:10003]
Sat Feb  9 11:01:41 [conn7] ChunkManager: time to load chunks for que_db.exp: 1ms sequenceNumber: 9 version: 1|40||50f5369065d3598b74ff2841 based on: 1|38||50f5369065d3598b74ff2841
Sat Feb  9 11:01:41 [conn7] autosplitted que_db.exp shard: ns:que_db.exp at: replicaSet2:replicaSet2/m2:10001,m2rpl:10001 lastmod: 1|24||000000000000000000000000 min: { UserID: 21596 } max: { UserID: 21744 } on: { UserID: 21608 } (splitThreshold 67108864)
Sat Feb  9 11:02:36 [conn7] DBException in process: socket exception [SEND_ERROR] for 127.0.0.1:41725
Sat Feb  9 11:02:36 [conn7] SocketException handling request, closing client connection: 9001 socket exception [2] server [127.0.0.1:41725]
</code>

Thanks
Vasyl

Субота, 6 жовтня 2012 р. 19:28:43 UTC+3 користувач Adam C написав: