mongodb move chunk fail , the primary wait for the secondary !! HELP!!

yyya...@gmail.com

unread,

Mar 6, 2014, 8:02:29 AM3/6/14

to mongod...@googlegroups.com, yaoyi...@126.com

version: 2.2.0

shard num : 32

replica num : 4

what i do is a pre split an empty collection

when i split a chunk , and move a chunk to a shard , the mongo shell hung

in the to shard, i can see the log:

Thu Mar 6 20:14:50 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 53184087:5

in the from shard, i can see the log:

Thu Mar 6 18:54:37 [conn4773458] moveChunk data transfer progress: { active: true, ns: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", min: { uid: -2130706432 }, max: { uid: MaxKey }, shardKeyPattern: { uid: 1 }, state: "catchup", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0

about the to shard log i show above, i find the optime int the oplog:

shard1_1:PRIMARY> db.oplog.rs.find({ts:Timestamp(1394098311000, 5)}).sort({$natual:-1}).limit(1)

{ "ts" : Timestamp(1394098311000, 5), "h" : NumberLong("-1059036026990639681"), "op" : "i", "ns" : "archive.system.indexes", "o" : { "v" : 1, "key" : { "uid" : 1, "cate" : 1 }, "ns" : "archive.archive", "name" : "uid_1_cate_1" } }

however,when i see its secondary:

shard1_1:PRIMARY> db.printSlaveReplicationInfo()

source: 192.168.130.12:13002

syncedTo: Thu Mar 06 2014 20:18:08 GMT+0800 (CST)

= 3 secs ago (0hrs)

source: 10.1.2.12:13003

syncedTo: Thu Mar 06 2014 20:18:08 GMT+0800 (CST)

= 3 secs ago (0hrs)

source: 192.168.115.223:13004

syncedTo: Thu Mar 06 2014 17:31:51 GMT+0800 (CST)

= 9980 secs ago (2.77hrs)

yes 192.168.115.223:13004 delay for 2.77 hours, but i always set it delay for 7200s. and i have split and move chunk for many time in the pre 1 year

the move chunk process come to the end after 10 hours, and the chunk finally not move to the to shard~~~ why??

yyya...@gmail.com

unread,

Mar 6, 2014, 8:08:22 AM3/6/14

to mongod...@googlegroups.com, yaoyi...@126.com

from the code of 2.2.0 i can see,the migrate was hung here:

d_migrate.cc:

                // pause to wait for replication

                // this will prevent us from going into critical section until we're ready

                Timer t;

                while ( t.minutes() < 600 ) {

                    if ( flushPendingWrites( lastOpApplied ) )

                        break;

                    sleepsecs(1);

在 2014年3月6日星期四UTC+8下午9时02分29秒，yyya...@gmail.com写道：

Linda Qin

unread,

Mar 6, 2014, 6:31:28 PM3/6/14

to mongod...@googlegroups.com, yaoyi...@126.com

Hi,

Could you please upgrade from MongoDB version 2.2.0 to the latest 2.2 release(currently it's 2.2.7) or latest 2.4 release (2.4.9) and try moving chunks again? There are many bug fixes from 2.2.0 to 2.2.7, including SERVER-5351 which could cause the issue you have seen.

To upgrade a sharded cluster from MongoDB version 2.2 to 2.4, you can follow the steps in the following document to do the upgrade:

- http://docs.mongodb.org/manual/release-notes/2.4-upgrade/#upgrade-a-sharded-cluster-from-mongodb-2-2-to-mongodb-2-4

If you still have the same issue after upgrade, could you run the following commands on the primary of the target shard and paste the results?

db.getSiblingDB("local").slaves.find()
rs.conf()

Thanks,

Linda

Message has been deleted

yyya...@gmail.com

unread,

Mar 7, 2014, 1:09:56 AM3/7/14

to mongod...@googlegroups.com, yaoyi...@126.com

Thanks Linda

I read the SERVER-5351 carefully and found our previous log:

Wed Aug 21 01:31:35 [migrateThread] warning: migrate commit waiting for 2 slaves for 'post.post' { uid: 1107296256 } -> { uid: MaxKey } waiting for: 5213a7f7:4

yes , our replica has 4 members, so 3 secondary, so it should have wait for 3/2+1 = 2 slaves

by the way, the problem shard has had some problem before, do u know what curcumstance may trigger this bug?

on the other hand, is there any method i can do to fix the problem without upgrade our cluster??

thank you

在 2014年3月7日星期五UTC+8上午7时31分28秒，Linda Qin写道：

Linda Qin

unread,

Mar 16, 2014, 1:28:10 AM3/16/14

to mongod...@googlegroups.com, yaoyi...@126.com

You can check the local.slaves collection on the primary of the target shard to see if there are stale records in it.

The local.slaves collection contains information about each member of the replica set and the latest point in time that this member has synced to. It's just a reflection of the in-memory cache. If this is a sharded cluster, a stale record may mess up the majority calculation during chunk migration. This issue about majority calculation is fixed in 2.2.5 and 2.4.5 by SERVER-5351.

You can drop this collection and it will be re-generated automatically soon. Please try repeatedly running db.slaves.drop() until db.slaves.find() shows stale records have gone away.

We still highly recommend you upgrade to the latest version since there are many bug fixes since 2.2.0 to 2.2.7.