Re: [percona-group] Percona Xtradb crash after [ERROR] WSREP: exception from gcomm, backend must be restarted

199 views
Skip to first unread message

Alex Yurchenko

unread,
Oct 9, 2012, 12:47:09 PM10/9/12
to percona-d...@googlegroups.com
On 2012-10-09 17:11, Abdel Said wrote:
> See the logs especially node-1-mysql-error.log at line 320.
>
> This is not supposed to happen, node 2 and node 3 tried to sync and
> node 1
> tried to take over but crashed. Any idea what's going on here?

Hi,

1) your servers seem to be silently crashing from time to time. I'd
look into system logs around times when you have lines such like:

121002 15:03:43 mysqld_safe Number of processes running now: 0

But this is not the cause of the situation you encountered. It was a
result of misconfiguration:

2) node1 seems to have wsrep_cluster_address=gcomm:// so
- every time it crashes, other nodes forget it.
- every time it is restarted it starts a new cluster.

So you've been routinely running 2 disjoint clusters, one consisting of
a single node1 and another consisting of nodes2 and 3. And it was
perfectly fine except that they of course became inconsistent with each
other (but that's another story).

Until one day node3 silently crashed.

Since it is a split-brain situation, node2 could not form majority and
started to try to reconnect to the last node it saw: node3.

At the same time node3 was automatically restarted by mysqld_safe.
Since it had wsrep_cluster_address=node1 it connected to node1.

And then node2 connected to node3, since it was trying to reconnect.

This way two nodes from different primary components saw each other in
one cluster. And that's what caused an exception, because Galera
detected inconsistency - and stopped operation to prevent data loss. So
it is not a bug, in fact it is a very valuable feature. Now you can
properly decide which data set is more representative - the one from
node1 or the one from node2.

This story of three nodes once again reminds us how automatic recovery
is inherently evil and can punish you any time. Especially if you have
your cluster misconfigured.

Thanks,
Alex

Vadim Tkachenko

unread,
Oct 9, 2012, 1:29:28 PM10/9/12
to percona-d...@googlegroups.com
Abdel,

5.5.24 is known to have a crashing issues.
Can you please try 5.5.27 ?

Thanks,
Vadim



On Tue, Oct 9, 2012 at 7:11 AM, Abdel Said <said....@gmail.com> wrote:
> See the logs especially node-1-mysql-error.log at line 320.
>
> This is not supposed to happen, node 2 and node 3 tried to sync and node 1
> tried to take over but crashed. Any idea what's going on here?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Percona Discussion" group.
> To post to this group, send email to percona-d...@googlegroups.com.
>
>



--
Vadim Tkachenko, CTO, Percona Inc.
Phone +1-925-400-7377, Skype: vadimtk153
Schedule meeting: http://tungle.me/VadimTkachenko

Looking for Replication with Data Consistency?
Try Percona XtraDB Cluster!

Abdel Said

unread,
Oct 24, 2012, 10:20:43 AM10/24/12
to percona-d...@googlegroups.com
Thanks Alex for your reply. Unfortunatly that's the standard Percona configuration. Can you point me to the right configuration?

Alex Yurchenko

unread,
Oct 24, 2012, 12:36:31 PM10/24/12
to percona-d...@googlegroups.com
On 2012-10-24 17:20, Abdel Said wrote:
> Thanks Alex for your reply. Unfortunatly that's the standard Percona
> configuration. Can you point me to the right configuration?

You should never leave wsrep_cluster_address=gcomm:// on a running
node.
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Abdel Said

unread,
Oct 24, 2012, 12:42:00 PM10/24/12
to percona-d...@googlegroups.com
What do you suggest? set it to ip of the second node after start?

Abdel Said

unread,
Oct 24, 2012, 12:43:06 PM10/24/12
to percona-d...@googlegroups.com
is there any way to avoid using wsrep_cluster_addres at all? to list the ip of the 3 nodes and the system do the rest?

Alex Yurchenko

unread,
Oct 24, 2012, 1:10:12 PM10/24/12
to percona-d...@googlegroups.com
On 2012-10-24 19:43, Abdel Said wrote:
> is there any way to avoid using wsrep_cluster_addres at all? to list
> the ip
> of the 3 nodes and the system do the rest?

To an extent: check our recent 2.2 RC2 and
http://www.codership.com/wiki/doku.php?id=galera_url

If, however, you need to start a cluster from scratch, or the primary
component is lost, you will have to (re)bootstrap the PC manually, by

mysql> SET GLOBAL wsrep_provider_options="pc.bootstrap=1";

It is your responsibility though to make sure that there is no more
than 1 PC at a time.

Abdel Said

unread,
Nov 10, 2012, 2:20:02 PM11/10/12
to percona-d...@googlegroups.com
Thanks Alex. This "You should never leave wsrep_cluster_address=gcomm:// on a running node." seems to have fixed the issue. 
Reply all
Reply to author
Forward
0 new messages