mongodb move chunk fail , the primary wait for the secondary !! HELP!!

224 views
Skip to first unread message

yyya...@gmail.com

unread,
Mar 6, 2014, 8:02:29 AM3/6/14
to mongod...@googlegroups.com, yaoyi...@126.com
version:  2.2.0
shard num   :  32
replica num :   4


what i do is a pre split an empty collection
when i split a chunk , and move a chunk to a shard , the mongo shell  hung



in the to shard, i can see the log:
Thu Mar  6 20:14:50 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 53184087:5

in the from shard, i can see the log:
Thu Mar  6 18:54:37 [conn4773458] moveChunk data transfer progress: { active: true, ns: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", min: { uid: -2130706432 }, max: { uid: MaxKey }, shardKeyPattern: { uid: 1 }, state: "catchup", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0


about the  to shard log i show above, i find the  optime int the oplog:
shard1_1:PRIMARY> db.oplog.rs.find({ts:Timestamp(1394098311000, 5)}).sort({$natual:-1}).limit(1)
{ "ts" : Timestamp(1394098311000, 5), "h" : NumberLong("-1059036026990639681"), "op" : "i", "ns" : "archive.system.indexes", "o" : { "v" : 1, "key" : { "uid" : 1, "cate" : 1 }, "ns" : "archive.archive", "name" : "uid_1_cate_1" } }

however,when i see its secondary:
shard1_1:PRIMARY> db.printSlaveReplicationInfo()
         syncedTo: Thu Mar 06 2014 20:18:08 GMT+0800 (CST)
                 = 3 secs ago (0hrs)
source:   10.1.2.12:13003
         syncedTo: Thu Mar 06 2014 20:18:08 GMT+0800 (CST)
                 = 3 secs ago (0hrs)
         syncedTo: Thu Mar 06 2014 17:31:51 GMT+0800 (CST)
                 = 9980 secs ago (2.77hrs)

yes 192.168.115.223:13004 delay for 2.77 hours, but i always set it delay for 7200s. and i have split and move chunk for many time in the pre 1 year




the move chunk process come to the end after 10 hours, and the chunk finally not move to the to shard~~~ why??


yyya...@gmail.com

unread,
Mar 6, 2014, 8:08:22 AM3/6/14
to mongod...@googlegroups.com, yaoyi...@126.com
from the code of 2.2.0 i can see,the migrate was hung here:

d_migrate.cc:
           { 
                // pause to wait for replication
                // this will prevent us from going into critical section until we're ready
                Timer t;
                while ( t.minutes() < 600 ) {
                    if ( flushPendingWrites( lastOpApplied ) )
                        break;
                    sleepsecs(1);
                }
            }





在 2014年3月6日星期四UTC+8下午9时02分29秒,yyya...@gmail.com写道:

Linda Qin

unread,
Mar 6, 2014, 6:31:28 PM3/6/14
to mongod...@googlegroups.com, yaoyi...@126.com
Hi,

Could you please upgrade from MongoDB version 2.2.0 to the latest 2.2 release(currently it's 2.2.7) or latest 2.4 release (2.4.9) and try moving chunks again? There are many bug fixes from 2.2.0 to 2.2.7, including SERVER-5351 which could cause the issue you have seen.

To upgrade a sharded cluster from MongoDB version 2.2 to 2.4, you can follow the steps in the following document to do the upgrade:

If you still have the same issue after upgrade, could you run the following commands on the primary of the target shard and paste the results?
  • db.getSiblingDB("local").slaves.find()
  • rs.conf()

Thanks,
Linda
Message has been deleted

yyya...@gmail.com

unread,
Mar 7, 2014, 1:09:56 AM3/7/14
to mongod...@googlegroups.com, yaoyi...@126.com
Thanks Linda

I read the SERVER-5351 carefully and found our previous log:
Wed Aug 21 01:31:35 [migrateThread] warning: migrate commit waiting for 2 slaves for 'post.post' { uid: 1107296256 } -> { uid: MaxKey } waiting for: 5213a7f7:4
yes , our replica has 4 members, so  3 secondary, so it should have wait for  3/2+1 = 2 slaves


by the way, the problem shard has had some problem before,  do u know what curcumstance may  trigger this bug?

on the other hand, is there any method i can do to fix the problem without upgrade our cluster?? 

thank you


在 2014年3月7日星期五UTC+8上午7时31分28秒,Linda Qin写道:

Linda Qin

unread,
Mar 16, 2014, 1:28:10 AM3/16/14
to mongod...@googlegroups.com, yaoyi...@126.com
You can check the local.slaves collection on the primary of the target shard to see if there are stale records in it.

The local.slaves collection contains information about each member of the replica set and the latest point in time that this member has synced to. It's just a reflection of the in-memory cache. If this is a sharded cluster, a stale record may mess up the majority calculation during chunk migration. This issue about majority calculation is fixed in 2.2.5 and 2.4.5 by SERVER-5351.

You can drop this collection and it will be re-generated automatically soon. Please try repeatedly running db.slaves.drop() until db.slaves.find() shows stale records have gone away.

We still highly recommend you upgrade to the latest version since there are many bug fixes since 2.2.0 to 2.2.7.

Thanks,
Linda
Reply all
Reply to author
Forward
0 new messages