Hi,
There's situation in our mongodb cluster. In one of the replica set (say
machine A, B), B was down due to server failure and required repair.
Unfortunately, right before B was done repairing, A was down due to
segfault:
Mon Nov 12 16:37:24 [conn233105274] Uncaught std::exception:
St9bad_alloc, terminating
Mon Nov 12 16:37:24 dbexit:
Mon Nov 12 16:37:24 Backtrace:
0x8ad399 0x8ad970 0x367ee0eb70 0x2279a90
mongod(_ZN5mongo10abruptQuitEi+0x399) [0x8ad399]
mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0x8ad970]
/lib64/libpthread.so.0 [0x367ee0eb70]
[0x2279a90]
Then, we tried to restart A, while B was master. At that moment, A goes
into ROLLBACK state as expected.
From mongostat, we see "UNK" for A; "M" for B.
From the log, we see some:
Mon Nov 12 17:53:18 [replica set sync] replSet info rollback of
renameCollection is slow in this version of mongod
Mon Nov 12 17:53:18 [replica set sync] replSet WARNING ignoring op on
rollback no _id TODO : xs.system.indexes { ts: Timestamp 1351583022000|218,
h: 5274814664110145128, op: "i", ns: "xs.system.indexes", o: { ns:
"xs.tmp.mr.profile_tmp.mrs.profile_1351583022_86493_1139205_inc", key: { 0:
1 }, name: "0_1", v: 0 } }
However, for quiet sometime, we only see:
Mon Nov 12 18:41:04 [initandlisten] connection accepted from
10.28.120.169:44420 #154
Mon Nov 12 18:41:04 [conn154] end connection 10.28.120.169:44420
Mon Nov 12 18:41:16 [conn153] end connection 10.28.6.91:55265
Mon Nov 12 18:41:16 [initandlisten] connection accepted from
10.28.6.91:55271 #155
and now more ROLLBACK logs. And we have also checked the currentOp, and it
does show things like below
{
"opid" : "rs_c:1265572266",
"active" : false,
"waitingForLock" : false,
"op" : "none",
"ns" : "?xs.profile",
"query" : {
},
"client_s" : "(NONE)",
"desc" : "replica set sync"
},
So, is the system still in the process of rolling back? or it's stuck?
*The DB cluster is v1.8.3
Thanks.