membase 1.7.1 nodes going up and down

43 vues
Accéder directement au premier message non lu

fred

non lue,
19 sept. 2011, 12:25:1519/09/2011
à membase
Hello,

I'm using Membase Server 1.7.1 on linux installed on a 3 node ring
(replica of 1). When I perform writes to the ring (in this case they
are increments), the ring state keeps changing- within the admin UI
of each of the nodes, other nodes keep being reported as being down
and then come back up. The nodes are all within the same network. I
am trying to move forward with using Membase but this is blocking me
from doing so right now. What could be causing this?

Thanks!

Chad Kouse

non lue,
19 sept. 2011, 12:54:0019/09/2011
à mem...@googlegroups.com
I've seen this happen when nodes are under a heavy load.  How is your CPU load on those nodes?
--chad

Perry Krug

non lue,
19 sept. 2011, 13:10:2819/09/2011
à mem...@googlegroups.com
As the nodes are going up and down according to our UI, is the application actually impacted or is this isolated to the Membase UI?

Perry Krug
Solutions Architect
direct: 831-824-4123
emailpe...@couchbase.com

Aliaksey Kandratsenka

non lue,
19 sept. 2011, 13:18:1019/09/2011
à mem...@googlegroups.com
On Mon, Sep 19, 2011 at 7:54 PM, Chad Kouse <chad....@gmail.com> wrote:
I've seen this happen when nodes are under a heavy load.  How is your CPU load on those nodes?
--chad

More likely is huge IO load. I've seen this to cause kernel to throw out executable pages of memcached and erlang VM. You can verify that by observing major page faults via top. This number should not be increasing for running memcached and beam.smp (or beam in case your machine is uniprocessor).

Chad Kouse

non lue,
19 sept. 2011, 13:27:0019/09/2011
à mem...@googlegroups.com
Yes you're right-- huge IO contention.  This often presents itself as a high CPU load on my nodes as well.
--chad

fred

non lue,
20 sept. 2011, 13:34:3220/09/2011
à membase
Hi Perry,

The application seems to be running fine (no errors at least from the
client writing to membase) but I'll need to do some data integrity
checks. Is the UI not a good source of ring state? I do notice that
the UI event log has a slew of events reporting members going down and
back up. I'm guessing these events are from the actual ring
coordination happening in the erlang bits. Somewhat related I also
noticed total key counts for buckets sometimes decreasing and then
going back up (maybe because of the other nodes being detected as
down). Overall still a mystery to me.

Thanks,
-Robert

On Sep 19, 10:10 am, Perry Krug <pe...@couchbase.com> wrote:
> As the nodes are going up and down according to our UI, is the application
> actually impacted or is this isolated to the Membase UI?
>
> *Perry Krug
> *Solutions Architect
> direct: 831-824-4123
> email: pe...@couchbase.com
>
>
>
>
>
>
>
> On Mon, Sep 19, 2011 at 9:54 AM, Chad Kouse <chad.ko...@gmail.com> wrote:
> > I've seen this happen when nodes are under a heavy load.  How is your CPU
> > load on those nodes?
> > --chad
>

fred

non lue,
20 sept. 2011, 13:36:2820/09/2011
à membase
Thanks for the tips everyone, but CPU has been under 10% for each
node. Page faults have not been increasing either.

On Sep 19, 10:27 am, Chad Kouse <chad.ko...@gmail.com> wrote:
> Yes you're right-- huge IO contention.  This often presents itself as a high
> CPU load on my nodes as well.
> --chad
>
> On Mon, Sep 19, 2011 at 1:18 PM, Aliaksey Kandratsenka <
>
>
>
>
>
>
>
> alkondrate...@gmail.com> wrote:
>
> > On Mon, Sep 19, 2011 at 7:54 PM, Chad Kouse <chad.ko...@gmail.com> wrote:
>
> >> I've seen this happen when nodes are under a heavy load.  How is your CPU
> >> load on those nodes?
> >> --chad
>
> > More likely is huge IO load. I've seen this to cause kernel to throw out
> > executable pages of memcached and erlang VM. You can verify that by
> > observing major page faults via top. This number should not be increasing
> > for running memcached and beam.smp (or beam in case your machine is
> > uniprocessor).
>

Perry Krug

non lue,
20 sept. 2011, 14:39:2220/09/2011
à mem...@googlegroups.com
Is this running in Amazon or on your own hardware?

Yes, the UI is a valid place for looking at this data, but it's only one perspective.  If the nodes are unable to communicate with each other, that doesn't necessarily mean that your clients are also unable to (hence my question about your application).  We've seen a few cases where some delay in the cluster management side of things can cause this, but does not impact the actual functioning of the software.  While we certainly want to resolve that, the process of diagnosis changes from "why is this server malfunctioning" (which it's not) to "why is the communication being lost between these two".

Alk, what info would be useful to look at to see why Erlang might be losing these connections?

Perry Krug
Solutions Architect
direct: 831-824-4123
emailpe...@couchbase.com

Aliaksey Kandratsenka

non lue,
20 sept. 2011, 15:02:4820/09/2011
à mem...@googlegroups.com
On Tue, Sep 20, 2011 at 9:39 PM, Perry Krug <pe...@couchbase.com> wrote:
Is this running in Amazon or on your own hardware?

Yes, the UI is a valid place for looking at this data, but it's only one perspective.  If the nodes are unable to communicate with each other, that doesn't necessarily mean that your clients are also unable to (hence my question about your application).  We've seen a few cases where some delay in the cluster management side of things can cause this, but does not impact the actual functioning of the software.  While we certainly want to resolve that, the process of diagnosis changes from "why is this server malfunctioning" (which it's not) to "why is the communication being lost between these two".

Alk, what info would be useful to look at to see why Erlang might be losing these connections?

Very weird stuff. Just in case lets grab iptables-save output. And information about traffic. Which can be done by sampling ifconfig output. I'd like to get info for queue sizes, but not aware of out of the box way.

fred

non lue,
21 sept. 2011, 22:50:0821/09/2011
à membase
Hi Aliaksey,

Is there a way for me to remsh into the membase erlang node? Wanted
to run etop to get more insight. I couldn't find a way.

How should I send you the iptables and ifconfig save output?

Thanks!

On Sep 20, 12:02 pm, Aliaksey Kandratsenka <alkondrate...@gmail.com>
wrote:
> On Tue, Sep 20, 2011 at 9:39 PM, Perry Krug <pe...@couchbase.com> wrote:
> > Is this running in Amazon or on your own hardware?
>
> > Yes, the UI is a valid place for looking at this data, but it's only one
> > perspective.  If the nodes are unable to communicate with each other, that
> > doesn't necessarily mean that your clients are also unable to (hence my
> > question about your application).  We've seen a few cases where some delay
> > in the cluster management side of things can cause this, but does not impact
> > the actual functioning of the software.  While we certainly want to resolve
> > that, the process of diagnosis changes from "why is this server
> > malfunctioning" (which it's not) to "why is the communication being lost
> > between these two".
>
> > *Alk*, what info would be useful to look at to see why Erlang might be
> > losing these connections?
>
> Very weird stuff. Just in case lets grab iptables-save output. And
> information about traffic. Which can be done by sampling ifconfig output.
> I'd like to get info for queue sizes, but not aware of out of the box way.
>
>
>
>
>
>
>
> >  *Perry Krug
> > *Solutions Architect
> > direct: 831-824-4123
> > email: pe...@couchbase.com

fred

non lue,
21 sept. 2011, 22:51:3421/09/2011
à membase
This is our own hardware.

On Sep 20, 11:39 am, Perry Krug <pe...@couchbase.com> wrote:
> Is this running in Amazon or on your own hardware?
>
> Yes, the UI is a valid place for looking at this data, but it's only one
> perspective.  If the nodes are unable to communicate with each other, that
> doesn't necessarily mean that your clients are also unable to (hence my
> question about your application).  We've seen a few cases where some delay
> in the cluster management side of things can cause this, but does not impact
> the actual functioning of the software.  While we certainly want to resolve
> that, the process of diagnosis changes from "why is this server
> malfunctioning" (which it's not) to "why is the communication being lost
> between these two".
>
> *Alk*, what info would be useful to look at to see why Erlang might be
> losing these connections?
>
> *Perry Krug
> *Solutions Architect

Matt Ingenthron

non lue,
21 sept. 2011, 23:06:4221/09/2011
à mem...@googlegroups.com
the otpcookie is available from the rest interface if using the admin password. you can use that to connect.

fred <lave...@gmail.com> wrote:


Hi Aliaksey,

Is there a way for me to remsh into the membase erlang node? Wanted
to run etop to get more insight. I couldn't find a way.

How should I send you the iptables and ifconfig save output?

Thanks!

On Sep 20, 12:02 pm, Aliaksey Kandratsenka <alkondrate...@gmail.com>
wrote:


> On Tue, Sep 20, 2011 at 9:39 PM, Perry Krug <pe...@couchbase.com> wrote:
> > Is this running in Amazon or on your own hardware?
>
> > Yes, the UI is a valid place for looking at this data, but it's only one
> > perspective. If the nodes are unable to communicate with each other, that
> > doesn't necessarily mean that your clients are also unable to (hence my
> > question about your application). We've seen a few cases where some delay
> > in the cluster management side of things can cause this, but does not impact
> > the actual functioning of the software. While we certainly want to resolve
> > that, the process of diagnosis changes from "why is this server
> > malfunctioning" (which it's not) to "why is the communication being lost
> > between these two".
>
> > *Alk*, what info would be useful to look at to see why Erlang might be
> > losing these connections?
>

> Very weird stuff. Just in case lets grab iptables-save output. And
> information about traffic. Which can be done by sampling ifconfig output.
> I'd like to get info for queue sizes, but not aware of out of the box way.
>
>
>
>
>
>
>

Aliaksey Kandratsenka

non lue,
22 sept. 2011, 02:03:1422/09/2011
à mem...@googlegroups.com
On Thu, Sep 22, 2011 at 5:50 AM, fred <lave...@gmail.com> wrote:
How should I send you the iptables and ifconfig save output?

File bug on jira.couchbase.com and attach them, if possible.
 
Répondre à tous
Répondre à l'auteur
Transférer
0 nouveau message