Re: Crash while performing initial sync (2.2)

345 views
Skip to first unread message

Braden Evans

unread,
Sep 3, 2012, 11:55:36 PM9/3/12
to mongod...@googlegroups.com
2nd attempt has failed in the same way, sync fails with "ERROR: exception: Can no longer connect to initial sync source: 192.168.1.40:27017" and then an endless cascade of these duplicate key assertions. 

I am guessing now there are two issues here, one is that for some reason sync is interrupted and two that it can not resume. 


Log of the 2nd failure:


Mon Sep  3 20:13:56 [repl writer worker 5] repl: couldn't connect to server 192.168.1.40:27017
Mon Sep  3 20:13:56 [repl writer worker 5] ERROR: exception: Can no longer connect to initial sync source: 192.168.1.40:27017 on: { ts: Timestamp 1346717498000|261, h: -4332373531262227304, op: "u", ns: "[collection]", o2: { _id: ObjectId('500e8550003c0b0b380cd9e9') }, o: { $set: { st: 1, la: new Date(1346717775888) } } }
Mon Sep  3 20:13:56 [repl writer worker 5]   Fatal Assertion 16361
0xade6e1 0x802e03 0x6505b6 0x77d3dd 0x7c3659 0x7f39b0e75efc 0x7f39b020759d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
 /usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x1d6) [0x6505b6]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f39b0e75efc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f39b020759d]
Mon Sep  3 20:13:56 [repl writer worker 5]


***aborting after fassert() failure


Mon Sep  3 20:13:56 Got signal: 6 (Aborted).

Mon Sep  3 20:13:56 Backtrace:
0xade6e1 0x5582d9 0x7f39b015a420 0x7f39b015a3a5 0x7f39b015db0b 0x802e3e 0x6505b6 0x77d3dd 0x7c3659 0x7f39b0e75efc 0x7f39b020759d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /lib/x86_64-linux-gnu/libc.so.6(+0x36420) [0x7f39b015a420]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f39b015a3a5]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f39b015db0b]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0x802e3e]
 /usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x1d6) [0x6505b6]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f39b0e75efc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f39b020759d]


On Monday, September 3, 2012 3:35:06 PM UTC-7, Braden Evans wrote:
We just updated our secondary to 2.2 and while performing the inital sync (near the end while replaying the oplog) we get an fassert:

Mon Sep  3 15:00:38 [repl writer worker 7] ERROR: writer worker caught exception: E11000 duplicate key error index: [collection].$_id_  dup key: { : ObjectId('5044fe2d293c0b02a893576b') } on: { ts: Timestamp 1346698527000|3315, h: -2145615260524114526, op: "u", ns: "[collection]", o2: { _id: ObjectId('5044fe2d293c0b02a893576b'), pc: { $size: 35 } }, o: { $push: { pc: { u: ObjectId('0000000000000000002283a9'), c: [ 12, 20, 43, 51, 73, 5, 24, 31, 48, 67, 1, 23, 34, 58, 69, 3, 30, 36, 53, 71, 11, 19, 39, 59, 68 ] } } } }
Mon Sep  3 15:00:38 [repl writer worker 7]   Fatal Assertion 16360
0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7f4a859c1efc 0x7f4a84d5359d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f4a859c1efc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4a84d5359d]
Mon Sep  3 15:00:38 [repl writer worker 7]

***aborting after fassert() failure

Is there any way to work around this problem? Manually deleting the record (on the secondary) does nothing, I'm not sure how to proceed except resyncing and hoping it doesn't happen again. 

This collection is updated frequently, I wonder if it is this case: https://jira.mongodb.org/browse/SERVER-6671 



William Zola

unread,
Sep 21, 2012, 8:15:39 PM9/21/12
to mongod...@googlegroups.com
Hi Braden!

Thanks for the bug report.  As you already know, this has been reported as Jira Issue SERVER-5961.

The good news is that this bug has been fixed for the upcoming release of MongoDB 2.4.   The workaround is to use a copy of the data files from the master to perform the initial sync.

 -William

William Zola

unread,
Sep 21, 2012, 8:17:38 PM9/21/12
to mongod...@googlegroups.com
Hi Tim!

This problem is different from the one that Braden reported.  If you examine the stack trace, you'll notice that the addresses and symbols are different from the ones in his report.

There should be some additional lines in the log file ahead of these: can you post those as well?

 -William


On Wednesday, September 12, 2012 7:58:53 AM UTC-7, Tim wrote:

Experiencing  the exact same problem can anyone give us a resolution?

0xade6e1 0x802e03 0x5e29f6 0x6538aa 0x65394a 0x653d58 0x7c3659 0x39fec07851 0x39fe8e811d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x46) [0x5e29f6]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x6a) [0x6538aa]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2a) [0x65394a]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0xa8) [0x653d58]
 /usr/bin/mongod() [0x7c3659]
 /lib64/libpthread.so.0() [0x39fec07851]
 /lib64/libc.so.6(clone+0x6d) [0x39fe8e811d]
Wed Sep 12 15:43:35 [rsSync]
***aborting after fassert() failure

Rotem

unread,
Oct 3, 2012, 6:21:25 AM10/3/12
to mongod...@googlegroups.com
So 2.2.0 has a bug that crashes replica set secondaries and it will be fixed only in 2.4 ?
Nice.

William Zola

unread,
Oct 3, 2012, 1:22:07 PM10/3/12
to mongod...@googlegroups.com
Bug fixes regularly get backported to earlier versions.  I wouldn't be surprised if this one does as well.

 -William

Andy O'Neill

unread,
Nov 14, 2012, 9:59:48 PM11/14/12
to mongod...@googlegroups.com


On Wednesday, October 3, 2012 1:22:08 PM UTC-4, William Zola wrote:
Bug fixes regularly get backported to earlier versions.  I wouldn't be surprised if this one does as well.

 -William


Looks like it didn't make it into 2.2.1 because I just ran into this during initial sync (https://gist.github.com/4076327). Are there any useful tips from people who have run into this, or any suggestion of when this might be fixed in a production version? Bear in mind

1) rsyncing the data over before startup is not an option because our data needs to be compacted, which is done by bringing up new servers (much quicker and until now more reliable than compact or repairing the db).
2) I am unwilling to use non-released or development versions of code for a production database.

Is it worth trying a re-sync or will I just hit the same issue again?

As it stands I can't bring up new 2.2 servers and have to remain on 2.0, despite the fact that the bug is probably present on 2.0 anyway. How is just aborting during initial sync helping? It's too bad because from my initial testing 2.2 is way more performant.

Eric Milkie

unread,
Nov 15, 2012, 3:09:10 PM11/15/12
to mongod...@googlegroups.com
Hi Andy,
Sorry about the lack of clarity regarding this bug.  SERVER-5961 was mismarked as "Fixed in 2.3" when in fact it was superseded by two other SERVER tickets mentioned at the bottom of it.  The fixes for those tickets were actually committed to version 2.2.1, but there are some caveats to the fixes which you have personally run into.

It turns out that the fix is not competely live until the primary is running 2.2.1, since it is the primary that is generating the non-idempotent operations.  To upgrade fully to 2.2.1 (or 2.2.2, which is in rc now and will be released in 2 weeks), you will need to sync a secondary on 2.2.1 and then make that the primary while you upgrade the other members of the replica set.
To do so, we have some options:
1- stop writes on the primary while the secondary is cloning databases (after the first clone stage, writes can proceed again)
2- restart all members of the replica set onto 2.2.1 after taking a backup
3- start a secondary using a copy of the datafiles from the primary, rather than allowing initial sync to clone the data

There is also one fix in 2.2.2 that makes this upgrade more likely to succeed, as it restores some of the 2.0.x behavior if a secondary is syncing from a primary of that version.
-Eric

Andy O'Neill

unread,
Nov 15, 2012, 3:24:07 PM11/15/12
to mongod...@googlegroups.com

To do so, we have some options:
1- stop writes on the primary while the secondary is cloning databases (after the first clone stage, writes can proceed again)
2- restart all members of the replica set onto 2.2.1 after taking a backup
3- start a secondary using a copy of the datafiles from the primary, rather than allowing initial sync to clone the data

There is also one fix in 2.2.2 that makes this upgrade more likely to succeed, as it restores some of the 2.0.x behavior if a secondary is syncing from a primary of that version.
-Eric


May thanks Eric. In case anyone cares, we may try option (3) since I'm reluctant to move the primary to 2.2 until we have tested it as a secondary. Otherwise we'll be waiting for 2.2.2 in the hopes that running it as a secondary is more reliable.
Reply all
Reply to author
Forward
0 new messages