RepSet Two primaries then fail rollback

184 views
Skip to first unread message

Anthony Peyron

unread,
Jan 21, 2014, 7:55:20 AM1/21/14
to mongod...@googlegroups.com
Hi all,

We had an issue with Mongo 2.2.0. with a replica set of 3 servers in total.

Two of our secondary server couldn't start due to duplicate key while syncing.

Because we have 3 server in total (2 secondary and 1 primary), the primary was then unable to remain primary because no other replica member was alive. so we had in total 2 mongo down and 1 secondary up.

I've read a previous topic saying that network issue would result in having two primaries elected (https://groups.google.com/forum/#!topic/mongodb-user/fmMtkz-9KYE).

In my case, I believe the primary (MONGO1) become unreachable for the two secondary servers, so one secondary become primary (MONGO3) and the other secondary server (MONGO2) start the replication against that new elected primary (MONGO3).

But it look like when the network was back for the first primary (MONGO1), the two other mongo instance (MONGO 2 and 3) tried both to become secondary again and rollback but failed due to a duplicate key.

Is this a know bug ?

Here are the log of one replica server (MONGO2).


//003 PRIMARY at that stage

Tue Jan 21 10:49:31 [rsHealthPoll] couldn't connect to MONGO1                                       
Tue Jan 21 10:49:32 [rsHealthPoll] replSet member MONGO1 is now in state UNKNOWN                                         

//MONGO1 UNKNOWN

Tue Jan 21 10:49:34 [rsHealthPoll] replSet member MONGO1 is now in state PRIMARY                                         

//MONGO1 PRIMARY

Tue Jan 21 10:49:34 [rsMgr] replSet info two primaries (transiently)

//MONGO1 AND MONGO3 BOTH PRIMARY 

Tue Jan 21 10:49:34 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: MONGO3  
Tue Jan 21 10:49:34 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: MONGO3     


Tue Jan 21 10:49:48 [rsBackgroundSync] replSet we are ahead of the primary, will try to roll back
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 0
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet ROLLBACK
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 1
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet info rollback our last optime:   Jan 21 10:50:15:a
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet info rollback their last optime: Jan 21 10:49:46:2a
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet info rollback diff in end of log times: 29 seconds
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback found matching events at Jan 21 10:49:08:2
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback findcommonpoint scanned : 984
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet replSet rollback 3 fixup
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 3.5
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 4 n:64
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet minvalid=Jan 21 10:49:47 52de50cb:16
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 4.6
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 4.7
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 5 d:0 u:64
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 6
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback 7
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet rollback done
Tue Jan 21 10:49:48 [rsBackgroundSync] replSet RECOVERING

Tue Jan 21 10:49:48 [rsSync] replSet still syncing, not yet to minValid optime 52de50cb:16
Tue Jan 21 10:49:48 [repl writer worker 6] ERROR: writer worker caught exception: E11000 duplicate key error index: production.table.$_id_  dup key: { : "EDO9wSyAV_1zY5QmDcZVuGA" } on: { ts: Timestamp 1390301410000|9, h: 7393169436717622182, op: "u", ns: "production.table", o2: { _id: "EDO9wSyAV_1zY5QmDcZVuGA"}, $set: { Version: 151 } } }
Tue Jan 21 10:49:48 [repl writer worker 6]   Fatal Assertion 16360
0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7f18517aee9a 0x7f1850ac44bd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f18517aee9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1850ac44bd]
Tue Jan 21 10:49:48 [repl writer worker 6] 

***aborting after fassert() failure


Tue Jan 21 10:49:48 Got signal: 6 (Aborted).

Tue Jan 21 10:49:48 Backtrace:
0xade6e1 0x5582d9 0x7f1850a084c0 0x7f1850a08445 0x7f1850a0bbab 0x802e3e 0x64f77d 0x77d3dd 0x7c3659 0x7f18517aee9a 0x7f1850ac44bd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /lib/x86_64-linux-gnu/libc.so.6(+0x364c0) [0x7f1850a084c0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f1850a08445]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f1850a0bbab]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0x802e3e]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f18517aee9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1850ac44bd]

Thanks,

Ger Hartnett

unread,
Feb 12, 2014, 9:05:38 AM2/12/14
to mongod...@googlegroups.com
There were several replication idempotency bugs in the early versions of 2.2 so the best thing to do would be to upgrade then back up the data from the secondaries that were primary previously, then resync them from the primary. If you need any data from the time the secondaries were primary prior to rollback, you could stand up a new instance based on the backup and get it from there.

Regards,
/ger.
Reply all
Reply to author
Forward
0 new messages