> See the logs especially node-1-mysql-error.log at line 320.
> This is not supposed to happen, node 2 and node 3 tried to sync and > node 1
> tried to take over but crashed. Any idea what's going on here?
Hi,
1) your servers seem to be silently crashing from time to time. I'd look into system logs around times when you have lines such like:
121002 15:03:43 mysqld_safe Number of processes running now: 0
But this is not the cause of the situation you encountered. It was a result of misconfiguration:
2) node1 seems to have wsrep_cluster_address=gcomm:// so
- every time it crashes, other nodes forget it.
- every time it is restarted it starts a new cluster.
So you've been routinely running 2 disjoint clusters, one consisting of a single node1 and another consisting of nodes2 and 3. And it was perfectly fine except that they of course became inconsistent with each other (but that's another story).
Until one day node3 silently crashed.
Since it is a split-brain situation, node2 could not form majority and started to try to reconnect to the last node it saw: node3.
At the same time node3 was automatically restarted by mysqld_safe. Since it had wsrep_cluster_address=node1 it connected to node1.
And then node2 connected to node3, since it was trying to reconnect.
This way two nodes from different primary components saw each other in one cluster. And that's what caused an exception, because Galera detected inconsistency - and stopped operation to prevent data loss. So it is not a bug, in fact it is a very valuable feature. Now you can properly decide which data set is more representative - the one from node1 or the one from node2.
This story of three nodes once again reminds us how automatic recovery is inherently evil and can punish you any time. Especially if you have your cluster misconfigured.
5.5.24 is known to have a crashing issues.
Can you please try 5.5.27 ?
Thanks,
Vadim
On Tue, Oct 9, 2012 at 7:11 AM, Abdel Said <said.ab...@gmail.com> wrote:
> See the logs especially node-1-mysql-error.log at line 320.
> This is not supposed to happen, node 2 and node 3 tried to sync and node 1
> tried to take over but crashed. Any idea what's going on here?
> --
> You received this message because you are subscribed to the Google Groups
> "Percona Discussion" group.
> To post to this group, send email to percona-discussion@googlegroups.com.
On Tuesday, October 9, 2012 12:47:19 PM UTC-4, Alexey Yurchenko wrote:
> On 2012-10-09 17:11, Abdel Said wrote: > > See the logs especially node-1-mysql-error.log at line 320.
> > This is not supposed to happen, node 2 and node 3 tried to sync and > > node 1 > > tried to take over but crashed. Any idea what's going on here?
> Hi,
> 1) your servers seem to be silently crashing from time to time. I'd > look into system logs around times when you have lines such like:
> 121002 15:03:43 mysqld_safe Number of processes running now: 0
> But this is not the cause of the situation you encountered. It was a > result of misconfiguration:
> 2) node1 seems to have wsrep_cluster_address=gcomm:// so > - every time it crashes, other nodes forget it. > - every time it is restarted it starts a new cluster.
> So you've been routinely running 2 disjoint clusters, one consisting of > a single node1 and another consisting of nodes2 and 3. And it was > perfectly fine except that they of course became inconsistent with each > other (but that's another story).
> Until one day node3 silently crashed.
> Since it is a split-brain situation, node2 could not form majority and > started to try to reconnect to the last node it saw: node3.
> At the same time node3 was automatically restarted by mysqld_safe. > Since it had wsrep_cluster_address=node1 it connected to node1.
> And then node2 connected to node3, since it was trying to reconnect.
> This way two nodes from different primary components saw each other in > one cluster. And that's what caused an exception, because Galera > detected inconsistency - and stopped operation to prevent data loss. So > it is not a bug, in fact it is a very valuable feature. Now you can > properly decide which data set is more representative - the one from > node1 or the one from node2.
> This story of three nodes once again reminds us how automatic recovery > is inherently evil and can punish you any time. Especially if you have > your cluster misconfigured.
> On Tuesday, October 9, 2012 12:47:19 PM UTC-4, Alexey Yurchenko > wrote:
>> On 2012-10-09 17:11, Abdel Said wrote:
>> > See the logs especially node-1-mysql-error.log at line 320.
>> > This is not supposed to happen, node 2 and node 3 tried to sync >> and
>> > node 1
>> > tried to take over but crashed. Any idea what's going on here?
>> Hi,
>> 1) your servers seem to be silently crashing from time to time. I'd
>> look into system logs around times when you have lines such like:
>> 121002 15:03:43 mysqld_safe Number of processes running now: 0
>> But this is not the cause of the situation you encountered. It was a
>> result of misconfiguration:
>> 2) node1 seems to have wsrep_cluster_address=gcomm:// so
>> - every time it crashes, other nodes forget it.
>> - every time it is restarted it starts a new cluster.
>> So you've been routinely running 2 disjoint clusters, one consisting >> of
>> a single node1 and another consisting of nodes2 and 3. And it was
>> perfectly fine except that they of course became inconsistent with >> each
>> other (but that's another story).
>> Until one day node3 silently crashed.
>> Since it is a split-brain situation, node2 could not form majority >> and
>> started to try to reconnect to the last node it saw: node3.
>> At the same time node3 was automatically restarted by mysqld_safe.
>> Since it had wsrep_cluster_address=node1 it connected to node1.
>> And then node2 connected to node3, since it was trying to reconnect.
>> This way two nodes from different primary components saw each other >> in
>> one cluster. And that's what caused an exception, because Galera
>> detected inconsistency - and stopped operation to prevent data loss. >> So
>> it is not a bug, in fact it is a very valuable feature. Now you can
>> properly decide which data set is more representative - the one from
>> node1 or the one from node2.
>> This story of three nodes once again reminds us how automatic >> recovery
>> is inherently evil and can punish you any time. Especially if you >> have
>> your cluster misconfigured.
>> Thanks,
>> Alex
-- Alexey Yurchenko,
Codership Oy, www.codership.com Skype: alexey.yurchenko, Phone: +358-400-516-011
On Wednesday, October 24, 2012 12:36:34 PM UTC-4, Alexey Yurchenko wrote:
> On 2012-10-24 17:20, Abdel Said wrote: > > Thanks Alex for your reply. Unfortunatly that's the standard Percona > > configuration. Can you point me to the right configuration?
> You should never leave wsrep_cluster_address=gcomm:// on a running > node.
> > On Tuesday, October 9, 2012 12:47:19 PM UTC-4, Alexey Yurchenko > > wrote:
> >> On 2012-10-09 17:11, Abdel Said wrote: > >> > See the logs especially node-1-mysql-error.log at line 320.
> >> > This is not supposed to happen, node 2 and node 3 tried to sync > >> and > >> > node 1 > >> > tried to take over but crashed. Any idea what's going on here?
> >> Hi,
> >> 1) your servers seem to be silently crashing from time to time. I'd > >> look into system logs around times when you have lines such like:
> >> 121002 15:03:43 mysqld_safe Number of processes running now: 0
> >> But this is not the cause of the situation you encountered. It was a > >> result of misconfiguration:
> >> 2) node1 seems to have wsrep_cluster_address=gcomm:// so > >> - every time it crashes, other nodes forget it. > >> - every time it is restarted it starts a new cluster.
> >> So you've been routinely running 2 disjoint clusters, one consisting > >> of > >> a single node1 and another consisting of nodes2 and 3. And it was > >> perfectly fine except that they of course became inconsistent with > >> each > >> other (but that's another story).
> >> Until one day node3 silently crashed.
> >> Since it is a split-brain situation, node2 could not form majority > >> and > >> started to try to reconnect to the last node it saw: node3.
> >> At the same time node3 was automatically restarted by mysqld_safe. > >> Since it had wsrep_cluster_address=node1 it connected to node1.
> >> And then node2 connected to node3, since it was trying to reconnect.
> >> This way two nodes from different primary components saw each other > >> in > >> one cluster. And that's what caused an exception, because Galera > >> detected inconsistency - and stopped operation to prevent data loss. > >> So > >> it is not a bug, in fact it is a very valuable feature. Now you can > >> properly decide which data set is more representative - the one from > >> node1 or the one from node2.
> >> This story of three nodes once again reminds us how automatic > >> recovery > >> is inherently evil and can punish you any time. Especially if you > >> have > >> your cluster misconfigured.
On Wednesday, October 24, 2012 12:36:34 PM UTC-4, Alexey Yurchenko wrote:
> On 2012-10-24 17:20, Abdel Said wrote: > > Thanks Alex for your reply. Unfortunatly that's the standard Percona > > configuration. Can you point me to the right configuration?
> You should never leave wsrep_cluster_address=gcomm:// on a running > node.
> > On Tuesday, October 9, 2012 12:47:19 PM UTC-4, Alexey Yurchenko > > wrote:
> >> On 2012-10-09 17:11, Abdel Said wrote: > >> > See the logs especially node-1-mysql-error.log at line 320.
> >> > This is not supposed to happen, node 2 and node 3 tried to sync > >> and > >> > node 1 > >> > tried to take over but crashed. Any idea what's going on here?
> >> Hi,
> >> 1) your servers seem to be silently crashing from time to time. I'd > >> look into system logs around times when you have lines such like:
> >> 121002 15:03:43 mysqld_safe Number of processes running now: 0
> >> But this is not the cause of the situation you encountered. It was a > >> result of misconfiguration:
> >> 2) node1 seems to have wsrep_cluster_address=gcomm:// so > >> - every time it crashes, other nodes forget it. > >> - every time it is restarted it starts a new cluster.
> >> So you've been routinely running 2 disjoint clusters, one consisting > >> of > >> a single node1 and another consisting of nodes2 and 3. And it was > >> perfectly fine except that they of course became inconsistent with > >> each > >> other (but that's another story).
> >> Until one day node3 silently crashed.
> >> Since it is a split-brain situation, node2 could not form majority > >> and > >> started to try to reconnect to the last node it saw: node3.
> >> At the same time node3 was automatically restarted by mysqld_safe. > >> Since it had wsrep_cluster_address=node1 it connected to node1.
> >> And then node2 connected to node3, since it was trying to reconnect.
> >> This way two nodes from different primary components saw each other > >> in > >> one cluster. And that's what caused an exception, because Galera > >> detected inconsistency - and stopped operation to prevent data loss. > >> So > >> it is not a bug, in fact it is a very valuable feature. Now you can > >> properly decide which data set is more representative - the one from > >> node1 or the one from node2.
> >> This story of three nodes once again reminds us how automatic > >> recovery > >> is inherently evil and can punish you any time. Especially if you > >> have > >> your cluster misconfigured.
> On Wednesday, October 24, 2012 12:36:34 PM UTC-4, Alexey Yurchenko > wrote:
>> On 2012-10-24 17:20, Abdel Said wrote:
>> > Thanks Alex for your reply. Unfortunatly that's the standard >> Percona
>> > configuration. Can you point me to the right configuration?
>> You should never leave wsrep_cluster_address=gcomm:// on a running
>> node.
>> > On Tuesday, October 9, 2012 12:47:19 PM UTC-4, Alexey Yurchenko
>> > wrote:
>> >> On 2012-10-09 17:11, Abdel Said wrote:
>> >> > See the logs especially node-1-mysql-error.log at line 320.
>> >> > This is not supposed to happen, node 2 and node 3 tried to sync
>> >> and
>> >> > node 1
>> >> > tried to take over but crashed. Any idea what's going on here?
>> >> Hi,
>> >> 1) your servers seem to be silently crashing from time to time. >> I'd
>> >> look into system logs around times when you have lines such like:
>> >> 121002 15:03:43 mysqld_safe Number of processes running now: 0
>> >> But this is not the cause of the situation you encountered. It >> was a
>> >> result of misconfiguration:
>> >> 2) node1 seems to have wsrep_cluster_address=gcomm:// so
>> >> - every time it crashes, other nodes forget it.
>> >> - every time it is restarted it starts a new cluster.
>> >> So you've been routinely running 2 disjoint clusters, one >> consisting
>> >> of
>> >> a single node1 and another consisting of nodes2 and 3. And it was
>> >> perfectly fine except that they of course became inconsistent >> with
>> >> each
>> >> other (but that's another story).
>> >> Until one day node3 silently crashed.
>> >> Since it is a split-brain situation, node2 could not form >> majority
>> >> and
>> >> started to try to reconnect to the last node it saw: node3.
>> >> At the same time node3 was automatically restarted by >> mysqld_safe.
>> >> Since it had wsrep_cluster_address=node1 it connected to node1.
>> >> And then node2 connected to node3, since it was trying to >> reconnect.
>> >> This way two nodes from different primary components saw each >> other
>> >> in
>> >> one cluster. And that's what caused an exception, because Galera
>> >> detected inconsistency - and stopped operation to prevent data >> loss.
>> >> So
>> >> it is not a bug, in fact it is a very valuable feature. Now you >> can
>> >> properly decide which data set is more representative - the one >> from
>> >> node1 or the one from node2.
>> >> This story of three nodes once again reminds us how automatic
>> >> recovery
>> >> is inherently evil and can punish you any time. Especially if you
>> >> have
>> >> your cluster misconfigured.
On Wednesday, October 24, 2012 12:36:34 PM UTC-4, Alexey Yurchenko wrote:
> On 2012-10-24 17:20, Abdel Said wrote: > > Thanks Alex for your reply. Unfortunatly that's the standard Percona > > configuration. Can you point me to the right configuration?
> You should never leave wsrep_cluster_address=gcomm:// on a running > node.
> > On Tuesday, October 9, 2012 12:47:19 PM UTC-4, Alexey Yurchenko > > wrote:
> >> On 2012-10-09 17:11, Abdel Said wrote: > >> > See the logs especially node-1-mysql-error.log at line 320.
> >> > This is not supposed to happen, node 2 and node 3 tried to sync > >> and > >> > node 1 > >> > tried to take over but crashed. Any idea what's going on here?
> >> Hi,
> >> 1) your servers seem to be silently crashing from time to time. I'd > >> look into system logs around times when you have lines such like:
> >> 121002 15:03:43 mysqld_safe Number of processes running now: 0
> >> But this is not the cause of the situation you encountered. It was a > >> result of misconfiguration:
> >> 2) node1 seems to have wsrep_cluster_address=gcomm:// so > >> - every time it crashes, other nodes forget it. > >> - every time it is restarted it starts a new cluster.
> >> So you've been routinely running 2 disjoint clusters, one consisting > >> of > >> a single node1 and another consisting of nodes2 and 3. And it was > >> perfectly fine except that they of course became inconsistent with > >> each > >> other (but that's another story).
> >> Until one day node3 silently crashed.
> >> Since it is a split-brain situation, node2 could not form majority > >> and > >> started to try to reconnect to the last node it saw: node3.
> >> At the same time node3 was automatically restarted by mysqld_safe. > >> Since it had wsrep_cluster_address=node1 it connected to node1.
> >> And then node2 connected to node3, since it was trying to reconnect.
> >> This way two nodes from different primary components saw each other > >> in > >> one cluster. And that's what caused an exception, because Galera > >> detected inconsistency - and stopped operation to prevent data loss. > >> So > >> it is not a bug, in fact it is a very valuable feature. Now you can > >> properly decide which data set is more representative - the one from > >> node1 or the one from node2.
> >> This story of three nodes once again reminds us how automatic > >> recovery > >> is inherently evil and can punish you any time. Especially if you > >> have > >> your cluster misconfigured.