wsrep_ready or wsrep_local_state_comment to determine Galera node health ?

952 views
Skip to first unread message

admin extremeshok.com

unread,
Feb 20, 2012, 2:47:08 AM2/20/12
to codersh...@googlegroups.com
Hi

Which is more correct to use in order to determine the health of a
Galera node.

wsrep_local_state_comment : Synced (6)
or
wsrep_ready : ON

Thanks

Henrik Ingo

unread,
Feb 20, 2012, 3:32:12 AM2/20/12
to ad...@extremeshok.com, codersh...@googlegroups.com
Ah, one of my favorite questions.

The answer is, when checking for error scenarios, you should check for
the very thing you are interested to know the answer to. So in this
case you want to know if Galera is in a state that your application
can read and write data to tables. Your check should be simply "SELECT
* FROM someinnodbtable WHERE id=1;"

3 different results are possible:
- You get the row with id=1 (node is healthy)
- Unknown error (node is online but Galera is not connected/synced
with the cluster)
- Connection error (node is not online)

It's simple as that. Trying to poll anything else than what you are
actually interested in is a failure waiting to happen. As an example,
the semantics of wsrep_local_state_comment changed in 2.0 version so
something that worked with 1.0 version wouldn't work anymore. The
above test always works because it tests the very thing you want to
know.

henrik

> --
> You received this message because you are subscribed to the Google Groups "codership" group.
> To post to this group, send email to codersh...@googlegroups.com.
> To unsubscribe from this group, send email to codership-tea...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/codership-team?hl=en.
>

--
henri...@avoinelama.fi
+358-40-8211286 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://www.linkedin.com/profile/view?id=9522559

Simon Balz

unread,
Feb 20, 2012, 4:13:08 AM2/20/12
to henri...@avoinelama.fi, ad...@extremeshok.com, codersh...@googlegroups.com
Hi All

My test query is even more simpler than Henrik's one:
SELECT 1 FROM DUAL;
--> returns 1 column with 1 row when your node is online and synced.
In case of a DESYNCED state, you get an "Unknown command" error (mysql error number 1047, http://dev.mysql.com/doc/refman/5.5/en/error-messages-server.html#error_er_unknown_com_error).
Since this query is very lightweight and you don't need an existing table and row, this is a nice method to pool the cluster state quite often.

Note: During an SST, this test query will still work on the donor in state 2 and when you have a blocking SST, then all your writes to the donor will be hold until the end of the SST.
If your application will fail in such cases (maybe caused by a query timeout), then you should consider to check the 'wsrep_local_state' status variable directly.

Simon

Alex Yurchenko

unread,
Feb 20, 2012, 7:45:02 AM2/20/12
to codersh...@googlegroups.com
Hi,

In a sense Henrik is right: cluster partition can happen any time, so
even if your previous SHOW STATUS command showed wsrep_ready = ON, your
next query can easily return 'Unknown command'.

However, until we come out with an unambiguous error code, the 'Unknown
command' may really be an unknown command. So it makes sense to check
the wsrep status, when your application gets such error.

Another situation when you may actually look for the wsrep_local_state
value is polling for load balancer. Anything but 4 is a reason to divert
connections to another node. The thing is that donor node will still
have wsrep_ready = ON, but since it may be blocked and is anyways busy
with state transfer and has deficit of IO, it may be a good idea not to
send new connections to it.

Henrik, what exactly do you mean by

> As an example, the semantics of wsrep_local_state_comment changed in
> 2.0 version so something that worked with 1.0 version wouldn't work
> anymore.

? I'm not aware of anything like that. It must be a bug then.

Regards,
Alex

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Henrik Ingo

unread,
Feb 21, 2012, 2:49:19 AM2/21/12
to Alex Yurchenko, codersh...@googlegroups.com
On Mon, Feb 20, 2012 at 2:45 PM, Alex Yurchenko
<alexey.y...@codership.com> wrote:
>> As an example, the semantics of wsrep_local_state_comment changed in 2.0
>> version so something that worked with 1.0 version wouldn't work anymore.
>
>
> ? I'm not aware of anything like that. It must be a bug then.

The test wasn't done by me personally, and it was agains XtraDB
Cluster, but it seems to me that previously the "Donor" state would
mean that the node is blocked, whereas in 2.0 if you use xtrabackup it
is actually available and also when the node is donor to an IST
operation it is fully available. Hence the Donor status of this
variable doesn't tell us whether the node is usable or not.

It's a good example of why a load balancer or application should
handle the actual errors, not check some proxy value. If there is an
error, there is an error.

Your point that such checks also inherently include a race condition
is good too, although such designs typically allow that transactions
will be lost for some amount of seconds until failover happens. (Ie it
is a master-slave or primary-secondary paradigm, imposing limitations
Galera has not.)

henrik

Reply all
Reply to author
Forward
0 new messages