Message from discussion
network failure resiliency
Received: by 10.236.175.42 with SMTP id y30mr25779384yhl.3.1326828099078;
Tue, 17 Jan 2012 11:21:39 -0800 (PST)
X-BeenThere: elasticsearch@googlegroups.com
Received: by 10.220.45.196 with SMTP id g4ls4697098vcf.3.gmail; Tue, 17 Jan
2012 11:21:37 -0800 (PST)
MIME-Version: 1.0
Received: by 10.52.20.110 with SMTP id m14mr1866734vde.8.1326828097860; Tue,
17 Jan 2012 11:21:37 -0800 (PST)
Authentication-Results: ls.google.com; spf=pass (google.com: domain of
gr...@brewster.com designates internal as permitted sender)
smtp.mail=gr...@brewster.com; dkim=pass
header...@brewster.com
Received: by v14g2000yqh.googlegroups.com with HTTP; Tue, 17 Jan 2012 11:21:37
-0800 (PST)
Date: Tue, 17 Jan 2012 11:21:37 -0800 (PST)
In-Reply-To: <ca865bc0-0433-4aa6-8272-a5e7136b1608@f1g2000yqi.googlegroups.com>
References: <668487d4-41b8-4370-bfde-245751e0b516@1g2000yqv.googlegroups.com>
<CACBZZX55A6xSUoxoT4V4vDy3XFx0tXR0yTmn3jufw5+MUp01ZA@mail.gmail.com>
<08f3e899-c9a5-4c7a-9b2e-e081ff99a153@k28g2000yqc.googlegroups.com>
<CALzs+uzPSvoX=br=ObWtqgMN9Vb6cZvfQVN3Vhn8Adpb2+Uitg@mail.gmail.com>
<3e457f17-aae5-465b-a1d2-4056e71df3a6@v14g2000yqh.googlegroups.com>
<CALzs+uz30qqsAWaU2A+Ua0w-tKNRFM8X1eYqtL5boTgE5ZR=pQ@mail.gmail.com> <ca865bc0-0433-4aa6-8272-a5e7136b1608@f1g2000yqi.googlegroups.com>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2)
AppleWebKit/534.52.7 (KHTML, like Gecko) Version/5.1.2 Safari/534.52.7,gzip(gfe)
Message-ID: <0611321b-d66f-4649-9673-b5cdd3c9e788@v14g2000yqh.googlegroups.com>
Subject: Re: network failure resiliency
From: Grant <gr...@brewster.com>
To: elasticsearch <elasticsearch@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
As an aside, after talking with our provider, while all our nodes are
on different physicals, six of the 8 exist in the same huddle, so they
share a switch. My suspicion is the switch was either rebooted or was
flapping.
On Jan 17, 12:58=A0pm, Grant <gr...@brewster.com> wrote:
> I still have logs if you'd be interested in having a look. Let me grab
> them...
>
> On Jan 17, 12:28=A0pm, Shay Banon <kim...@gmail.com> wrote:
>
>
>
>
>
>
>
> > Then next time it happens, can you dropbox the logs of the nodes?
>
> > On Tue, Jan 17, 2012 at 2:29 PM, Grant <gr...@brewster.com> wrote:
> > > Hi Shay!
>
> > > Believe it or not, we already run with minimum master nodes set to
> > > 3...
>
> > > On Jan 17, 4:51 am, Shay Banon <kim...@gmail.com> wrote:
> > > > I suggest you set discovery.zen.minimum_master_nodes to a higher va=
lue,
> > > in
> > > > your case, something like 2 or 3. Then, if a node looses connection=
to
> > > > other nodes, it will not "form its own cluster", but will try and r=
ejoin
> > > > and forma =A0cluster with that minimum specified.
>
> > > > On Mon, Jan 16, 2012 at 10:16 PM, Grant <gr...@brewster.com> wrote:
> > > > > We're using unicast now (Rackspace doesn't allow multicast traffi=
c).
>
> > > > > Here's a sample of what's in the logs during the issues. This kin=
d of
> > > > > things was steaming pretty much continuously:
>
> > > > > [2012-01-16 02:52:41,711][WARN ][indices.cluster =A0 =A0 =A0 =A0 =
=A0] [prod-es-
> > > > > r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkY=
ASSg-
> > > > > TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started,=
but
> > > > > shard have not been created, mark shard as failed
> > > > > [2012-01-16 02:52:41,711][WARN ][cluster.action.shard =A0 =A0 ] [=
prod-es-
> > > > > r03] sending failed shard for [contact_documents-527859-0][0],
> > > > > node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [pr=
od-es-
> > > > > r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked sh=
ard
> > > > > as started, but shard have not been created, mark shard as failed=
]
> > > > > [2012-01-16 02:52:41,880][WARN ][indices.cluster =A0 =A0 =A0 =A0 =
=A0] [prod-es-
> > > > > r03] [contact_documents-194054-1322678627][0] master [[prod-es-r0=
6]
> > > > > [IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard=
as
> > > > > started, but shard have not been created, mark shard as failed
> > > > > [2012-01-16 02:52:41,880][WARN ][cluster.action.shard =A0 =A0 ] [=
prod-es-
> > > > > r03] sending failed shard for [contact_documents-194054-132267862=
7]
> > > > > [0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [maste=
r
> > > > > [prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
> > > > > marked shard as started, but shard have not been created, mark sh=
ard
> > > > > as failed]
> > > > > [2012-01-16 02:52:41,894][WARN ][indices.cluster =A0 =A0 =A0 =A0 =
=A0] [prod-es-
> > > > > r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkY=
ASSg-
> > > > > TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started,=
but
> > > > > shard have not been created, mark shard as failed
> > > > > [2012-01-16 02:52:41,894][WARN ][cluster.action.shard =A0 =A0 ] [=
prod-es-
> > > > > r03] sending failed shard for [contact_documents-527859-0][0],
> > > > > node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [pr=
od-es-
> > > > > r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked sh=
ard
> > > > > as started, but shard have not been created, mark shard as failed=
]
>
> > > > > On Jan 16, 2:57 pm, =C6var Arnfj=F6r=F0 Bjarmason <ava...@gmail.c=
om> wrote:
> > > > > > You might want to try switching from multicast to unicast just =
to
> > > > > > eliminate a variable.
>
> > > > > > Some networks don't treat multicast traffic very well.
>
> > > > > > It's also useful to look at the logs for the ES nodes during th=
ese