Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion rs member died because other member crashed?

Received: by 10.43.105.135 with SMTP id dq7mr715049icc.3.1348688745467;
        Wed, 26 Sep 2012 12:45:45 -0700 (PDT)
X-BeenThere: mongodb-user@googlegroups.com
Received: by 10.50.5.208 with SMTP id u16ls9716527igu.1.canary; Wed, 26 Sep
 2012 12:45:28 -0700 (PDT)
Received: by 10.50.0.148 with SMTP id 20mr6980255ige.3.1348688728266;
        Wed, 26 Sep 2012 12:45:28 -0700 (PDT)
Received: by 10.50.82.36 with SMTP id f4msigy;
        Wed, 26 Sep 2012 12:43:55 -0700 (PDT)
Received: by 10.68.237.163 with SMTP id vd3mr762794pbc.9.1348688635366;
        Wed, 26 Sep 2012 12:43:55 -0700 (PDT)
Date: Wed, 26 Sep 2012 12:43:54 -0700 (PDT)
From: Shaun <shaun.ve...@10gen.com>
To: mongodb-user@googlegroups.com
Message-Id: <9cbc49b7-adf7-4988-8c10-9d8fb11f1d3d@googlegroups.com>
In-Reply-To: <2c5cfae9-8501-45cc-b41f-8f15ad5542b0@googlegroups.com>
References: <c03d3521-599b-4e3c-ae2c-873c034db790@googlegroups.com>
 <2c5cfae9-8501-45cc-b41f-8f15ad5542b0@googlegroups.com>
Subject: Re: rs member died because other member crashed?
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_222_26347376.1348688634769"

------=_Part_222_26347376.1348688634769
Content-Type: multipart/alternative; 
	boundary="----=_Part_223_2525290.1348688634769"

------=_Part_223_2525290.1348688634769
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Hi Daniel,

It looks like your config server went down halfway through a chunk 
migration.  When you have fewer than 3 config servers, the config metadata 
goes read only.

It does look like something that should be made more robust, or at least be 
more clearly defined.  Could you post all of your logfiles from when this 
happened?

Thanks,
-Shaun

On Wednesday, September 26, 2012 4:11:19 AM UTC-7, Daniel wrote:
>
> This "ERROR: moveChunk commit failed: version is at" just happened again. 
> This time in a 3 member replicaset where 2 members where in status STARTUP. 
> :(
>
>
>
> On Tuesday, September 25, 2012 2:04:45 PM UTC+2, Daniel wrote:
>>
>> Hi,
>>
>> setup: 9 servers, 3 shards with 3 rs members each.  MongoDB 2.2
>>
>> One member (member1) in a set crashed, because of a server crash. This 
>> server also runs 1 of 3 config servers. After that another member (member2) 
>> crashed because it couldn't reach the crashed server? This is the message 
>> on member2:
>>
>> Tue Sep 25 13:34:56 [conn538] DBClientCursor::init call() failed
>>> Tue Sep 25 13:34:56 [conn538] scoped connection to 
>>> config1:27019,config2:27019,config3:27019 not being returned to the pool
>>> Tue Sep 25 13:34:56 [conn538] warning: 13104 
>>> SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: 
>>> transport error: config3:27019 ns: admin.$cmd query: { fsync: 1 } 
>>> config3:27019:{}
>>> Tue Sep 25 13:34:56 [conn538] warning: moveChunk commit outcome ongoing: 
>>> { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: 
>>> "db.coll1-uuid_"38f9dbbe-86ec-444b-9e6a-483eab0f9bb2"_id_ObjectId('50444151e4b0c4a3a8c5cf74')", 
>>> lastmod: Timest$
>>> Tue Sep 25 13:34:57 [rsHealthPoll] couldn't connect to member1:27018: 
>>> couldn't connect to server member1:27018
>>> Tue Sep 25 13:34:59 [rsHealthPoll] couldn't connect to member1:27018: 
>>> couldn't connect to server member1:27018
>>> Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: 
>>> couldn't connect to server member1:27018
>>> Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: 
>>> couldn't connect to server member1:27018
>>> Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to member1:27018: 
>>> couldn't connect to server member1:27018
>>> Tue Sep 25 13:35:03 [rsHealthPoll] couldn't connect to member1:27018: 
>>> couldn't connect to server member1:27018
>>> Tue Sep 25 13:35:05 [rsHealthPoll] couldn't connect to member1:27018: 
>>> couldn't connect to server member1:27018
>>> Tue Sep 25 13:35:06 [conn538] ERROR: moveChunk commit failed: version is 
>>> at907|1||000000000000000000000000 instead of 908|1||50604b9fb961dd917fdc2316
>>> Tue Sep 25 13:35:06 [conn538] ERROR: TERMINATING
>>> Tue Sep 25 13:35:06 dbexit:
>>> Tue Sep 25 13:35:06 [conn538] shutdown: going to close listening 
>>> sockets...
>>> Tue Sep 25 13:35:06 [conn538] closing listening socket: 6
>>> Tue Sep 25 13:35:06 [conn538] closing listening socket: 7
>>> Tue Sep 25 13:35:06 [conn538] shutdown: going to flush diaglog...
>>> Tue Sep 25 13:35:06 [conn538] shutdown: going to close sockets...
>>> Tue Sep 25 13:35:06 [conn538] shutdown: waiting for fs preallocator...
>>> Tue Sep 25 13:35:06 [conn538] shutdown: lock for final commit...
>>> Tue Sep 25 13:35:06 [conn538] shutdown: final commit...
>>> Tue Sep 25 13:35:06 [conn1] end connection member2_IP:41925 (21 
>>> connections now open)
>>> Tue Sep 25 13:35:06 [initandlisten] now exiting
>>> Tue Sep 25 13:35:06 dbexit: ; exiting immediately
>>
>>
>> This is really weird, because the redundancy of 3 server should provide 
>> some kind of failover right? But if one member drags down another member, 
>> than thats really ugly.
>>
>> Is this a bug ?
>>
>> Thanks & regards
>>
>> Daniel 
>>
>
------=_Part_223_2525290.1348688634769
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hi Daniel,<br><br>It looks like your config server went down halfway throug=
h a chunk migration.&nbsp; When you have fewer than 3 config servers, the c=
onfig metadata goes read only.<br><br>It does look like something that shou=
ld be made more robust, or at least be more clearly defined.&nbsp; Could yo=
u post all of your logfiles from when this happened?<br><br>Thanks,<br>-Sha=
un<br><br>On Wednesday, September 26, 2012 4:11:19 AM UTC-7, Daniel wrote:<=
blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0.8ex;bord=
er-left: 1px #ccc solid;padding-left: 1ex;">This "ERROR: moveChunk commit f=
ailed: version is at" just&nbsp;happened again. This time in a 3 member rep=
licaset where 2 members where in status STARTUP. :(<div><br></div><div><br>=
<br>On Tuesday, September 25, 2012 2:04:45 PM UTC+2, Daniel wrote:<blockquo=
te class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<div><br></div><div>setup: 9 servers, 3 s=
hards with 3 rs members each. &nbsp;MongoDB 2.2</div><div><br></div><div>On=
e member (member1) in a set crashed, because of a server crash. This server=
 also runs 1 of 3 config servers. After that another member (member2) crash=
ed because it couldn't reach the crashed server? This is the message on mem=
ber2:</div><div><br></div><div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204=
,204);border-left-style:solid;padding-left:1ex">Tue Sep 25 13:34:56 [conn53=
8] DBClientCursor::init call() failed<br>Tue Sep 25 13:34:56 [conn538] scop=
ed connection to config1:27019,config2:27019,<wbr>config3:27019 not being r=
eturned to the pool<br>Tue Sep 25 13:34:56 [conn538] warning: 13104 SyncClu=
sterConnection::findOne prepare failed: 10276 DBClientBase::findN: transpor=
t error: config3:27019 ns: admin.$cmd query: { fsync: 1 } config3:27019:{}<=
br>Tue Sep 25 13:34:56 [conn538] warning: moveChunk commit outcome ongoing:=
 { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "db.coll=
1-uuid_"38f9dbbe-86ec-<wbr>444b-9e6a-483eab0f9bb2"_id_<wbr>ObjectId('<wbr>5=
0444151e4b0c4a3a8c5cf74')", lastmod: Timest$<br>Tue Sep 25 13:34:57 [rsHeal=
thPoll] couldn't connect to member1:27018: couldn't connect to server membe=
r1:27018<br>Tue Sep 25 13:34:59 [rsHealthPoll] couldn't connect to member1:=
27018: couldn't connect to server member1:27018<br>Tue Sep 25 13:35:01 [rsH=
ealthPoll] couldn't connect to member1:27018: couldn't connect to server me=
mber1:27018<br>Tue Sep 25 13:35:01 [rsHealthPoll] couldn't connect to membe=
r1:27018: couldn't connect to server member1:27018<br>Tue Sep 25 13:35:01 [=
rsHealthPoll] couldn't connect to member1:27018: couldn't connect to server=
 member1:27018<br>Tue Sep 25 13:35:03 [rsHealthPoll] couldn't connect to me=
mber1:27018: couldn't connect to server member1:27018<br>Tue Sep 25 13:35:0=
5 [rsHealthPoll] couldn't connect to member1:27018: couldn't connect to ser=
ver member1:27018<br>Tue Sep 25 13:35:06 [conn538] ERROR: moveChunk commit =
failed: version is at907|1||<wbr>000000000000000000000000 instead of 908|1|=
|<wbr>50604b9fb961dd917fdc2316<br>Tue Sep 25 13:35:06 [conn538] ERROR: TERM=
INATING<br>Tue Sep 25 13:35:06 dbexit:<br>Tue Sep 25 13:35:06 [conn538] shu=
tdown: going to close listening sockets...<br>Tue Sep 25 13:35:06 [conn538]=
 closing listening socket: 6<br>Tue Sep 25 13:35:06 [conn538] closing liste=
ning socket: 7<br>Tue Sep 25 13:35:06 [conn538] shutdown: going to flush di=
aglog...<br>Tue Sep 25 13:35:06 [conn538] shutdown: going to close sockets.=
..<br>Tue Sep 25 13:35:06 [conn538] shutdown: waiting for fs preallocator..=
.<br>Tue Sep 25 13:35:06 [conn538] shutdown: lock for final commit...<br>Tu=
e Sep 25 13:35:06 [conn538] shutdown: final commit...<br>Tue Sep 25 13:35:0=
6 [conn1] end connection member2_IP:41925 (21 connections now open)<br>Tue =
Sep 25 13:35:06 [initandlisten] now exiting<br>Tue Sep 25 13:35:06 dbexit: =
; exiting immediately</blockquote><div><br></div><div>This is really weird,=
 because the redundancy of 3 server should provide some kind of failover ri=
ght? But if one member drags down another member, than thats really ugly.</=
div><div><br></div><div>Is this a bug ?</div><div><br></div><div>Thanks &am=
p; regards</div><div><br></div><div>Daniel&nbsp;</div></div></blockquote></=
div></blockquote>
------=_Part_223_2525290.1348688634769--

------=_Part_222_26347376.1348688634769--