Strange desync/crash of a node

353 views
Skip to first unread message

Laurent MINOST

unread,
May 10, 2012, 4:56:36 AM5/10/12
to codersh...@googlegroups.com
Hi,

This morning I had a node that unsync by itself or crash, I don't really know why, It automatically tried to resync from another Synced node on the cluster but resync stalled probably because of the current opened bug on IST (https://bugs.launchpad.net/galera/+bug/985747), so I have collected all the logs from each node and provide them here to know if it's possible to find the reason of this desync/crash ?

Based on this problem, I have some questions related to the behaviour of Galera :
- What is the real status of a node with a status Donor please, is it really read only ? I saw that wsrep_last_committed seems to be in sync with the last node Synced in the cluster so replication seems to be still enable/active when the node is in wsrep_local_state 2 ?
wsrep_local_state       2
wsrep_local_state_comment       Donor (+)
For dispatching requests with an HAProxy Load Balancer in front of Galera Cluster, does the node which is serving as Donor can be integrated in the load balancing as a serving node (I'm not able to distinguish READ from WRITE requests) or do I need to exclude it totally from the cluster during the nodes synchronisation please and so having only 1 node out of 3 for serving requests/traffic during this time ? Is there any possibility to have the Donor node R/W available without being blocked so we have still 2/3 nodes available for serving requests please ?

- What is exactly the meaning of wsrep_provider_version please ? I cannot find any information on this parameter from http://www.codership.com/wiki/doku.php?id=mysql_options_0.8
I saw that on my cluster I have a different wsrep_provider_version on one of my node, that seems pretty strange to me as I remember I used the same binaries for MySQL-wsrep : mysql-5.5.23_wsrep_23.5-linux and source file for Galera lib : galera-23.2.0-src.tar.gz ? Maybe it can have an impact on the cluster stability ?
   node1 : wsrep_provider_version  2.1dev(rXXXX)
   node2 : wsrep_provider_version  2.0(rXXXX)
   node3 : wsrep_provider_version  2.0(rXXXX)

- Is it possible to disable IST totally for all nodes temporarly until the bug with IST is fixed please and how ?

Thanks.
Regards,

Laurent
strange_crash_node1.txt
strange_crash_node2.txt
strange_crash_node3.txt

Alex Yurchenko

unread,
May 10, 2012, 6:29:02 AM5/10/12
to codersh...@googlegroups.com
Hi Laurent,

On 2012-05-10 11:56, Laurent MINOST wrote:
> Hi,
>
> This morning I had a node that unsync by itself or crash, I don't
> really
> know why, It automatically tried to resync from another Synced node
> on the
> cluster but resync stalled probably because of the current opened bug
> on
> IST (https://bugs.launchpad.net/galera/+bug/985747), so I have
> collected
> all the logs from each node and provide them here to know if it's
> possible
> to find the reason of this desync/crash ?

Thanks for your report. According to the logs it was not a crash but a
network partitioning. The "crashed" node became unresponsive and was
kicked out by the other two:

> 120510 9:30:09 [Warning] WSREP: last inactive check more than PT1.5S
> ago, skipping check
> 120510 9:30:10 [Note] WSREP: (468c977e-9995-11e1-0800-6f132903562b,
> 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive
> peers: tcp://192.168.0.1:4567 tcp://192.168.0.5:4567
> 120510 9:30:10 [Warning] WSREP: last inactive check more than PT1.5S
> ago, skipping check
> 120510 9:30:11 [Note] WSREP: (468c977e-9995-11e1-0800-6f132903562b,
> 'tcp://0.0.0.0:4567') turning message relay requesting off
> 120510 9:30:11 [Note] WSREP:
> view(view_id(NON_PRIM,468c977e-9995-11e1-0800-6f132903562b,13) memb {
> 468c977e-9995-11e1-0800-6f132903562b,
> } joined {
> } left {
> } partitioned {
> 5d8c2fc7-976d-11e1-0800-2052394e626e,
> 9f9b7649-976d-11e1-0800-659e792e93dd,
> })
1) During IST the donor is not blocked at all, so it keeps on serving
clients as usual.

2) For non-blocking IST you should be using Percona XtraDB Cluster or
just take xtrabackup SST script from them.

> - What is exactly the meaning of wsrep_provider_version please ? I
> cannot
> find any information on this parameter from
> http://www.codership.com/wiki/doku.php?id=mysql_options_0.8
> I saw that on my cluster I have a different wsrep_provider_version on
> one
> of my node, that seems pretty strange to me as I remember I used the
> same
> binaries for MySQL-wsrep : mysql-5.5.23_wsrep_23.5-linux and source
> file
> for Galera lib : galera-23.2.0-src.tar.gz ? Maybe it can have an
> impact on
> the cluster stability ?
> node1 : wsrep_provider_version 2.1dev(rXXXX)
> node2 : wsrep_provider_version 2.0(rXXXX)
> node3 : wsrep_provider_version 2.0(rXXXX)

This is just for informational purposes.

> - Is it possible to disable IST totally for all nodes temporarly
> until the
> bug with IST is fixed please and how ?

1) Yes, you should unset wsrep_node_address and ist.recv_addr
variables.

2) I have just pushed a fix for that bug in revision 127, you may want
to try that now.

Regards,
Alex


> Thanks.
> Regards,
>
> Laurent

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Laurent MINOST

unread,
May 10, 2012, 8:05:41 AM5/10/12
to codersh...@googlegroups.com


Le jeudi 10 mai 2012 12:29:02 UTC+2, Alexey Yurchenko a écrit :
Hi Laurent,

On 2012-05-10 11:56, Laurent MINOST wrote:
> Hi,
Hi Alexey,
 
>
> This morning I had a node that unsync by itself or crash, I don't
> really
> know why, It automatically tried to resync from another Synced node
> on the
> cluster but resync stalled probably because of the current opened bug
> on
> IST (https://bugs.launchpad.net/galera/+bug/985747), so I have
> collected
> all the logs from each node and provide them here to know if it's
> possible
> to find the reason of this desync/crash ?

Thanks for your report. According to the logs it was not a crash but a
network partitioning. The "crashed" node became unresponsive and was
kicked out by the other two:


Oh ok thks a lot, as this node is a virtual server, I've checked the hosting server graphs to see if I found an explanation such as a high load or others at this time but found nothing ... nevermind. Thanks for your explanation, could you please explain what is the meaning of PT1.5S please ? It seems to be the unit for galera internal healthcheck ?
Ok, so does it mean that :
- the donor is blocked if there is an SST please ?
- this donor node can accept and answer to READ queries like SELECT *AND* also WRITES (INSERT, UPDATE, DELETE ...) and that the integrity of data/answers served are the same as if the same request were done on the last node which is in state 4 (Synced (6)) please ?
 
2) For non-blocking IST you should be using Percona XtraDB Cluster or
just take xtrabackup SST script from them.

I have some difficulties to understand your the beginning of your sentence compared to your previous answer : "For non-blocking IST" <> "During IST the donor is not blocked at all" ? I have probably lost something somewhere in the sense ?! (and my english lack of vocabulary is certainly having a responsability in it lol) :)
If I understand properly, then with Percona xtrabackup we can have a Galera cluster with nodes that will sync back each other when needing IST or even SST without "downtime" because they will go on, during the same time that it resyncs with a node, to serve requests to clients Read/Writes, that's it ?



> - What is exactly the meaning of wsrep_provider_version please ? I
> cannot
> find any information on this parameter from
> http://www.codership.com/wiki/doku.php?id=mysql_options_0.8
> I saw that on my cluster I have a different wsrep_provider_version on
> one
> of my node, that seems pretty strange to me as I remember I used the
> same
> binaries for MySQL-wsrep : mysql-5.5.23_wsrep_23.5-linux and source
> file
> for Galera lib : galera-23.2.0-src.tar.gz ? Maybe it can have an
> impact on
> the cluster stability ?
>    node1 : wsrep_provider_version  2.1dev(rXXXX)
>    node2 : wsrep_provider_version  2.0(rXXXX)
>    node3 : wsrep_provider_version  2.0(rXXXX)

This is just for informational purposes.

Thks for this one, after having a deep reflexion on this one, I was thinking that maybe it was related with the x64 build which is used on node1 and i686 on node 2 & 3 ...

> - Is it possible to disable IST totally for all nodes temporarly
> until the
> bug with IST is fixed please and how ?

1) Yes, you should unset wsrep_node_address and ist.recv_addr
variables.

Ok thks a lot, this was just to avoid having some particular behaviour related to this problem and so avoid to bother this group uselessly ...
 
2) I have just pushed a fix for that bug in revision 127, you may want
to try that now.

Perfect, I will give it a try as soon as I find time to do and will let you know on the launchpad's bug entry ! ;) 

Regards,
Alex


Thks for your answers and time !

Alex Yurchenko

unread,
May 10, 2012, 11:17:26 AM5/10/12
to codersh...@googlegroups.com
On 2012-05-10 15:05, Laurent MINOST wrote:
> Oh ok thks a lot, as this node is a virtual server, I've checked the
> hosting server graphs to see if I found an explanation such as a high
> load
> or others at this time but found nothing ... nevermind. Thanks for
> your
> explanation, could you please explain what is the meaning of PT1.5S
> please
> ? It seems to be the unit for galera internal healthcheck ?

Sort of. It checks the liveness of other nodes and PT1.5S mean that it
could not do that for more than 1.5 seconds. On real hardware that
usually means heavy swapping/IO. In VMs something else may add to this,
especially if your VMs compete for CPU cores.

>>
>> 1) During IST the donor is not blocked at all, so it keeps on
>> serving
>> clients as usual.
>>
>
> Ok, so does it mean that :
> - the donor is blocked if there is an SST please ?

If it is a blocking SST, like mysqldump or rsync. xtrabackup SST is
non-blocking

> - this donor node can accept and answer to READ queries like SELECT
> *AND*
> also WRITES (INSERT, UPDATE, DELETE ...) and that the integrity of
> data/answers served are the same as if the same request were done on
> the
> last node which is in state 4 (Synced (6)) please ?

Yes, but there may be significant delay on commit time, and the
probability of cluster-wide conflict increases.

>> 2) For non-blocking IST you should be using Percona XtraDB Cluster
>> or
>> just take xtrabackup SST script from them.
>>
>
> I have some difficulties to understand your the beginning of
> your sentence compared to your previous answer : "For non-blocking
> IST" <>
> "During IST the donor is not blocked at all" ? I have probably lost
> something somewhere in the sense ?! (and my english lack of
> vocabulary is
> certainly having a responsability in it lol) :)

That was just a typo. Of course I meant SST there

> If I understand properly, then with Percona xtrabackup we can have a
> Galera
> cluster with nodes that will sync back each other when needing IST or
> even
> SST without "downtime" because they will go on, during the same time
> that
> it resyncs with a node, to serve requests to clients Read/Writes,
> that's it
> ?

Yes.

Regards,
Alex

Laurent MINOST

unread,
May 11, 2012, 3:35:56 AM5/11/12
to codersh...@googlegroups.com
Ok, thks a lot for your answers Alexey !

Laurent MINOST

unread,
May 11, 2012, 7:26:40 AM5/11/12
to codersh...@googlegroups.com
Hi Alexey,

I'm coming back about your previous information about taking the wsrep_sst_xtrabackup script from Percona XtraDB Cluster and put it in my stock Galera/MySQL-wsrep that I'm currently using. I tried this by copying the script within the bin/ subdir, I also installed Percona XtraBackup v2.00 from their website and installed it so it is available in the PATH.
Then, changed configuration in my.cnf for each node from :
   wsrep_sst_method                       = rsync
to
   wsrep_sst_method                       = xtrabackup

Shutdown every node one by one so the whole cluster was down then restart a first node and then tried a second one to see how it will behave with xtrabackup SST method, the node was not able to resync with the cluster so checked the logs (one of my tries is attached) and see that there seems to be an error about uuid:seqno :

120511 13:11:00 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup 'joiner' '192.168.0.5' 'sst:5T13wPid' '/opt/mysql-galera/data/' '/etc/my-galera.cnf' '22019' 2>sst.err: 2 (No such file or directory)
120511 13:11:00 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
120511 13:11:00 [ERROR] WSREP: SST failed: 2 (No such file or directory)

I think I have probably forgotten something somewhere but don't have a clue about what for the moment, the log of first node didn't help me.
Could you please tell what does this error mean please ?

Thanks,

Laurent

Le jeudi 10 mai 2012 12:29:02 UTC+2, Alexey Yurchenko a écrit :
Hi Laurent,
 

Laurent MINOST

unread,
May 11, 2012, 7:27:56 AM5/11/12
to codersh...@googlegroups.com
Sorry, forgot to attach the logs ... hopefully this is the end of week today :)
enable_xtrabackup_sst_logs_node1.txt
enable_xtrabackup_sst_logs_node2.txt

Alex Yurchenko

unread,
May 11, 2012, 1:03:02 PM5/11/12
to codersh...@googlegroups.com
On 2012-05-11 14:27, Laurent MINOST wrote:
> Sorry, forgot to attach the logs ... hopefully this is the end of
> week
> today :)

Alas, you forgot sst.err from node2. That could shed a light. IIRC you
might need to change some trivial detail in wsrep_sst_xtrabackup, but I
don't remember which exactly.
Reply all
Reply to author
Forward
0 new messages