Message from discussion
Crash while performing initial sync (2.2)
Received: by 10.66.84.168 with SMTP id a8mr639928paz.25.1349284937789;
Wed, 03 Oct 2012 10:22:17 -0700 (PDT)
X-BeenThere: mongodb-user@googlegroups.com
Received: by 10.68.197.163 with SMTP id iv3ls7114698pbc.9.gmail; Wed, 03 Oct
2012 10:22:08 -0700 (PDT)
Received: by 10.68.194.133 with SMTP id hw5mr1479988pbc.17.1349284928583;
Wed, 03 Oct 2012 10:22:08 -0700 (PDT)
Date: Wed, 3 Oct 2012 10:22:07 -0700 (PDT)
From: William Zola <william.z...@10gen.com>
To: mongodb-user@googlegroups.com
Message-Id: <4ed8b44a-0d51-42b7-af6d-40d2c5525920@googlegroups.com>
In-Reply-To: <c87ec043-3340-436c-935a-9a5b6abc8031@googlegroups.com>
References: <09042996-fd83-4369-b3a7-9df02f6a5736@googlegroups.com>
<4b9bc55c-0bfc-44bc-9bee-2fa0cddf15a9@googlegroups.com>
<dfe384da-ed03-41c2-b89c-d0ef451ad2c9@googlegroups.com>
<c87ec043-3340-436c-935a-9a5b6abc8031@googlegroups.com>
Subject: Re: Crash while performing initial sync (2.2)
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_228_16139004.1349284927951"
------=_Part_228_16139004.1349284927951
Content-Type: multipart/alternative;
boundary="----=_Part_229_27769124.1349284927951"
------=_Part_229_27769124.1349284927951
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Bug fixes regularly get backported to earlier versions. I wouldn't be
surprised if this one does as well.
-William
On Wednesday, October 3, 2012 3:21:25 AM UTC-7, Rotem wrote:
>
> So 2.2.0 has a bug that crashes replica set secondaries and it will be
> fixed only in 2.4 ?
> Nice.
>
>
> On Saturday, September 22, 2012 3:15:40 AM UTC+3, William Zola wrote:
>>
>> Hi Braden!
>>
>> Thanks for the bug report. As you already know, this has been reported
>> as Jira Issue SERVER-5961 <https://jira.mongodb.org/browse/SERVER-5961>.
>>
>> The good news is that this bug has been fixed for the upcoming release of
>> MongoDB 2.4. The workaround is to use a copy of the data files from the
>> master to perform the initial sync.
>>
>> -William
>>
>> On Monday, September 3, 2012 8:55:36 PM UTC-7, Braden Evans wrote:
>>>
>>> 2nd attempt has failed in the same way, sync fails with "ERROR:
>>> exception: Can no longer connect to initial sync source:
>>> 192.168.1.40:27017" and then an endless cascade of these duplicate key
>>> assertions.
>>>
>>> I am guessing now there are two issues here, one is that for some reason
>>> sync is interrupted and two that it can not resume.
>>>
>>>
>>> Log of the 2nd failure:
>>>
>>>
>>> Mon Sep 3 20:13:56 [repl writer worker 5] repl: couldn't connect to
>>> server 192.168.1.40:27017
>>> Mon Sep 3 20:13:56 [repl writer worker 5] ERROR: exception: Can no
>>> longer connect to initial sync source: 192.168.1.40:27017 on: { ts:
>>> Timestamp 1346717498000|261, h: -4332373531262227304, op: "u", ns:
>>> "[collection]", o2: { _id: ObjectId('500e8550003c0b0b380cd9e9') }, o: {
>>> $set: { st: 1, la: new Date(1346717775888) } } }
>>> Mon Sep 3 20:13:56 [repl writer worker 5] Fatal Assertion 16361
>>> 0xade6e1 0x802e03 0x6505b6 0x77d3dd 0x7c3659 0x7f39b0e75efc
>>> 0x7f39b020759d
>>> /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
>>> /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
>>> /usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x1d6)
>>> [0x6505b6]
>>> /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
>>> /usr/bin/mongod() [0x7c3659]
>>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f39b0e75efc]
>>> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f39b020759d]
>>> Mon Sep 3 20:13:56 [repl writer worker 5]
>>>
>>>
>>> ***aborting after fassert() failure
>>>
>>>
>>> Mon Sep 3 20:13:56 Got signal: 6 (Aborted).
>>>
>>> Mon Sep 3 20:13:56 Backtrace:
>>> 0xade6e1 0x5582d9 0x7f39b015a420 0x7f39b015a3a5 0x7f39b015db0b 0x802e3e
>>> 0x6505b6 0x77d3dd 0x7c3659 0x7f39b0e75efc 0x7f39b020759d
>>> /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
>>> /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
>>> /lib/x86_64-linux-gnu/libc.so.6(+0x36420) [0x7f39b015a420]
>>> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f39b015a3a5]
>>> /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f39b015db0b]
>>> /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0x802e3e]
>>> /usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x1d6)
>>> [0x6505b6]
>>> /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
>>> /usr/bin/mongod() [0x7c3659]
>>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f39b0e75efc]
>>> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f39b020759d]
>>>
>>>
>>> On Monday, September 3, 2012 3:35:06 PM UTC-7, Braden Evans wrote:
>>>>
>>>> We just updated our secondary to 2.2 and while performing the inital
>>>> sync (near the end while replaying the oplog) we get an fassert:
>>>>
>>>> Mon Sep 3 15:00:38 [repl writer worker 7] ERROR: writer worker caught
>>>> exception: E11000 duplicate key error index: [collection].$_id_ dup key: {
>>>> : ObjectId('5044fe2d293c0b02a893576b') } on: { ts: Timestamp
>>>> 1346698527000|3315, h: -2145615260524114526, op: "u", ns: "[collection]",
>>>> o2: { _id: ObjectId('5044fe2d293c0b02a893576b'), pc: { $size: 35 } }, o: {
>>>> $push: { pc: { u: ObjectId('0000000000000000002283a9'), c: [ 12, 20, 43,
>>>> 51, 73, 5, 24, 31, 48, 67, 1, 23, 34, 58, 69, 3, 30, 36, 53, 71, 11, 19,
>>>> 39, 59, 68 ] } } } }
>>>> Mon Sep 3 15:00:38 [repl writer worker 7] Fatal Assertion 16360
>>>> 0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7f4a859c1efc
>>>> 0x7f4a84d5359d
>>>> /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
>>>> /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
>>>> /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d)
>>>> [0x64f77d]
>>>> /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
>>>> /usr/bin/mongod() [0x7c3659]
>>>> /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f4a859c1efc]
>>>> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4a84d5359d]
>>>> Mon Sep 3 15:00:38 [repl writer worker 7]
>>>>
>>>> ***aborting after fassert() failure
>>>>
>>>> Is there any way to work around this problem? Manually deleting the
>>>> record (on the secondary) does nothing, I'm not sure how to proceed except
>>>> resyncing and hoping it doesn't happen again.
>>>>
>>>> This collection is updated frequently, I wonder if it is this case:
>>>> https://jira.mongodb.org/browse/SERVER-6671
>>>>
>>>>
>>>>
>>>>
------=_Part_229_27769124.1349284927951
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Bug fixes regularly get backported to earlier versions. I wouldn't be=
surprised if this one does as well.<br><br> -William <br><br>On Wedne=
sday, October 3, 2012 3:21:25 AM UTC-7, Rotem wrote:<blockquote class=3D"gm=
ail_quote" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc soli=
d;padding-left: 1ex;">So 2.2.0 has a bug that crashes replica set secondari=
es and it will be fixed only in 2.4 ?<br>Nice.<br><br><br>On Saturday, Sept=
ember 22, 2012 3:15:40 AM UTC+3, William Zola wrote:<blockquote class=3D"gm=
ail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;p=
adding-left:1ex">Hi Braden!<br><br>Thanks for the bug report. As you =
already know, this has been reported as Jira Issue <a href=3D"https://jira.=
mongodb.org/browse/SERVER-5961" target=3D"_blank">SERVER-5961</a>.<br><br>T=
he good news is that this bug has been fixed for the upcoming release of Mo=
ngoDB 2.4. The workaround is to use a copy of the data files fr=
om the master to perform the initial sync.<br><br> -William<br><br>On =
Monday, September 3, 2012 8:55:36 PM UTC-7, Braden Evans wrote:<blockquote =
class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #=
ccc solid;padding-left:1ex"><div>2nd attempt has failed in the same way, sy=
nc fails with "ERROR: exception: Can no longer connect to initial sync sour=
ce: <a href=3D"http://192.168.1.40:27017" target=3D"_blank">192.168.1.40:27=
017</a>" and then an endless cascade of these duplicate key assertions.&nbs=
p;<br></div><div><br></div><div>I am guessing now there are two issues here=
, one is that for some reason sync is interrupted and two that it can not r=
esume. </div><div><br></div><div><br></div><div>Log of the 2nd failure=
:</div><div><br></div><div><br></div><div><div>Mon Sep 3 20:13:56 [re=
pl writer worker 5] repl: couldn't connect to server <a href=3D"http://192.=
168.1.40:27017" target=3D"_blank">192.168.1.40:27017</a></div><div>Mon Sep =
3 20:13:56 [repl writer worker 5] ERROR: exception: Can no longer con=
nect to initial sync source: <a href=3D"http://192.168.1.40:27017" target=
=3D"_blank">192.168.1.40:27017</a> on: { ts: Timestamp 1346717498000|261, h=
: -4332373531262227304, op: "u", ns: "[collection]", o2: { _id: ObjectId('<=
wbr>500e8550003c0b0b380cd9e9') }, o: { $set: { st: 1, la: new Date(13467177=
75888) } } }</div><div>Mon Sep 3 20:13:56 [repl writer worker 5] &nbs=
p; Fatal Assertion 16361</div><div>0xade6e1 0x802e03 0x6505b6 0x77d3dd 0x7c=
3659 0x7f39b0e75efc 0x7f39b020759d</div><div> /usr/bin/mongod(_<wbr>ZN=
5mongo15printStackTraceERSo+<wbr>0x21) [0xade6e1]</div><div> /usr/bin/=
mongod(_<wbr>ZN5mongo13fassertFailedEi+<wbr>0xa3) [0x802e03]</div><div>&nbs=
p;/usr/bin/mongod(_<wbr>ZN5mongo7replset21multiInitial<wbr>SyncApplyERKSt6v=
ectorINS_<wbr>7BSONObjESaIS2_EEPNS0_<wbr>8SyncTailE+0x1d6) [0x6505b6]</div>=
<div> /usr/bin/mongod(_<wbr>ZN5mongo10threadpool6Worker4lo<wbr>opEv+0x=
26d) [0x77d3dd]</div><div> /usr/bin/mongod() [0x7c3659]</div><div>&nbs=
p;/lib/x86_64-linux-gnu/<wbr>libpthread.so.0(+0x7efc) [0x7f39b0e75efc]</div=
><div> /lib/x86_64-linux-gnu/libc.<wbr>so.6(clone+0x6d) [0x7f39b020759=
d]</div><div>Mon Sep 3 20:13:56 [repl writer worker 5]</div></div><di=
v><br></div><div><div><br></div><div>***aborting after fassert() failure</d=
iv><div><br></div><div><br></div><div>Mon Sep 3 20:13:56 Got signal: =
6 (Aborted).</div><div><br></div><div>Mon Sep 3 20:13:56 Backtrace:</=
div><div>0xade6e1 0x5582d9 0x7f39b015a420 0x7f39b015a3a5 0x7f39b015db0b 0x8=
02e3e 0x6505b6 0x77d3dd 0x7c3659 0x7f39b0e75efc 0x7f39b020759d</div><div>&n=
bsp;/usr/bin/mongod(_<wbr>ZN5mongo15printStackTraceERSo+<wbr>0x21) [0xade6e=
1]</div><div> /usr/bin/mongod(_<wbr>ZN5mongo10abruptQuitEi+0x399) [0x5=
582d9]</div><div> /lib/x86_64-linux-gnu/libc.<wbr>so.6(+0x36420) [0x7f=
39b015a420]</div><div> /lib/x86_64-linux-gnu/libc.<wbr>so.6(gsignal+0x=
35) [0x7f39b015a3a5]</div><div> /lib/x86_64-linux-gnu/libc.<wbr>so.6(a=
bort+0x17b) [0x7f39b015db0b]</div><div> /usr/bin/mongod(_<wbr>ZN5mongo=
13fassertFailedEi+<wbr>0xde) [0x802e3e]</div><div> /usr/bin/mongod(_<w=
br>ZN5mongo7replset21multiInitial<wbr>SyncApplyERKSt6vectorINS_<wbr>7BSONOb=
jESaIS2_EEPNS0_<wbr>8SyncTailE+0x1d6) [0x6505b6]</div><div> /usr/bin/m=
ongod(_<wbr>ZN5mongo10threadpool6Worker4lo<wbr>opEv+0x26d) [0x77d3dd]</div>=
<div> /usr/bin/mongod() [0x7c3659]</div><div> /lib/x86_64-linux-g=
nu/<wbr>libpthread.so.0(+0x7efc) [0x7f39b0e75efc]</div><div> /lib/x86_=
64-linux-gnu/libc.<wbr>so.6(clone+0x6d) [0x7f39b020759d]</div></div><div><b=
r></div><br>On Monday, September 3, 2012 3:35:06 PM UTC-7, Braden Evans wro=
te:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex">We just updated our secondary to=
2.2 and while performing the inital sync (near the end while replaying the=
oplog) we get an fassert:<div><br></div><div><div>Mon Sep 3 15:00:38=
[repl writer worker 7] ERROR: writer worker caught exception: E11000 dupli=
cate key error index: [collection].$_id_ dup key: { : ObjectId('<wbr>=
5044fe2d293c0b02a893576b') } on: { ts: Timestamp 1346698527000|3315, h: -21=
45615260524114526, op: "u", ns: "[collection]", o2: { _id: ObjectId('<wbr>5=
044fe2d293c0b02a893576b'), pc: { $size: 35 } }, o: { $push: { pc: { u: Obje=
ctId('<wbr>0000000000000000002283a9'), c: [ 12, 20, 43, 51, 73, 5, 24, 31, =
48, 67, 1, 23, 34, 58, 69, 3, 30, 36, 53, 71, 11, 19, 39, 59, 68 ] } } } }<=
br></div><div>Mon Sep 3 15:00:38 [repl writer worker 7] Fatal =
Assertion 16360</div><div>0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7f=
4a859c1efc 0x7f4a84d5359d</div><div> /usr/bin/mongod(_<wbr>ZN5mongo15p=
rintStackTraceERSo+<wbr>0x21) [0xade6e1]</div><div> /usr/bin/mongod(_<=
wbr>ZN5mongo13fassertFailedEi+<wbr>0xa3) [0x802e03]</div><div> /usr/bi=
n/mongod(_<wbr>ZN5mongo7replset14multiSyncApp<wbr>lyERKSt6vectorINS_<wbr>7B=
SONObjESaIS2_EEPNS0_<wbr>8SyncTailE+0x12d) [0x64f77d]</div><div> /usr/=
bin/mongod(_<wbr>ZN5mongo10threadpool6Worker4lo<wbr>opEv+0x26d) [0x77d3dd]<=
/div><div> /usr/bin/mongod() [0x7c3659]</div><div> /lib/x86_64-li=
nux-gnu/<wbr>libpthread.so.0(+0x7efc) [0x7f4a859c1efc]</div><div> /lib=
/x86_64-linux-gnu/libc.<wbr>so.6(clone+0x6d) [0x7f4a84d5359d]</div><div>Mon=
Sep 3 15:00:38 [repl writer worker 7]</div><div><br></div><div>***ab=
orting after fassert() failure</div><div><br></div><div>Is there any way to=
work around this problem? Manually deleting the record (on the secondary) =
does nothing, I'm not sure how to proceed except resyncing and hoping it do=
esn't happen again. <br></div><div><br></div><div><div>This collection=
is updated frequently, I wonder if it is this case: <a href=3D"https:=
//jira.mongodb.org/browse/SERVER-6671" target=3D"_blank">https://jira.mongo=
db.<wbr>org/browse/SERVER-6671</a> </div><br></div><div><br></div><div=
><br></div></div></blockquote></blockquote></blockquote></blockquote>
------=_Part_229_27769124.1349284927951--
------=_Part_228_16139004.1349284927951--