Auto-recovery, bootstrap and monitoring (reformat)

Ragnar Rova

unread,

Mar 12, 2015, 4:32:43 AM3/12/15

to codersh...@googlegroups.com

Hi,

Planning to use galera for a HA scenario, and thinking of how to deal with monitoring and automated cluster restart and recovery.

Looking for comments on my setup and have some questions below.

Background

Three-node cluster, not using garbd. wsrep version 25.3.9.

Low volume writes, mostly reads, relatively small database size and simple schema. Small transactions. During normal operations all writes are preferably sent to one node.

Clients use an in-application list of mysql nodes, each having writable/readable status. Particular requirements are that it is more important to be available, we can even loose some seconds worth of transactions in favor of automatically recovering at startup, as long as we recover to some point-in-time consistent state between transactions.

We monitor wsrep_local_state == 4 on each node and use this to dynamically update the list of writable nodes used by the applications. The list of readable nodes is determined by if it is possible to perform a mysql connect to it, but will probably include a health check query later.

We use the wsrep_dirty_reads setting to serve reads with only a single node. We can cope with stale data in our apps and prefer this over not being available.

Config is default, meaning that pc.recovery = true. The only wsrep setting defined is wsrep_cluster_address = %(other_nodes).

Questions

1. Is wsrep_local_state == 4 a good way to determine if a node is eligible for write transactions? Maybe add SELECT @@global.read_only?

2. Is it advisable to have identical wsrep_cluster_address on all nodes, or should we filter out our own address on each node? (we do that
now). Seems like self-address detection can have bugs?

3. Is it possible to automatically recover the cluster in a safe way from the state of all nodes being stopped and one node being completely unresponsive or reinstalled (only having correct my.cnf, all data lost)? In the event of a partition, having two nodes stopped but reachable out of three should be enough to start the cluster and form a primary component?

To achieve 3, I started writing a start script which would be used in combination with that mysql service on each node is set to start on boot. It is run on every start of the application to ensure mysql is up.

The start script is run manually by an operator to start any stopped nodes, perform bootstrap at system setup and perform recovery from an outage.
The script should not require any knowledge of galera internals in particular and try its best to form a running cluster after an outage.

I am unsure if this is really such a smart thing to do in an automated fashion. Comments are welcome. We want to simplify cluster operations.

Pseudocode for mysql start scripts

Uses a fourth machine as control node which uses ssh to
provision the mysql nodes.

start_cluster:

running_hosts = hosts reachable over ssh with a running mysqld process
stopped_hosts = hosts reachable over ssh with a stopped mysqld process
unknown_hosts = hosts not reachable over ssh
unclean_shutdown_recoverable = hosts with gvwstate.dat present and
-1 in grastate.dat seqno


if #unknown_hosts > 1
abort 'cannot proceed automatically'


if #running_hosts > 1
run_in_parallel 'service mysql start' on stopped_hosts
else if #unclean_shutdown_recoverable == #all_hosts
run_in_parallel 'service mysql start' on stopped_hosts
else bootstrap_cluster


bootstrap_cluster:


bootstrap_node = select_bootstrap_node over all hosts
'service mysql bootstrap' on bootstrap_node
run_in_parallel 'service mysql start' on joiner_nodes


seqno:


if running_mysqld:
select wsrep_cluster_state_uuid, wsrep_last_committed
else
mysqld --wsrep-recover, look for WSREP: Recovered position:


select_bootstrap_node:

seqno_from_hosts = seqno function run over all hosts, and if mysqld running
filter out unresponsive hosts from seqno_from_hosts
assert that cluster_state_uuid is the same across all hosts, otherwise abort
sort seqno from hosts on wsrep_last_committed descending, and mysqld running
pick first in sorted list as bootstrap node

Am I introducing too much complexity? Should/can this type of recovery from complete loss of a single node be handled automatically?

Should gvwstate.dat help me with starting the cluster from a clean shutdown of all nodes, so that the grastate.dat == -1 (unclean shutdown) check is wrong? I want to avoid re-bootstrapping when not needed.

Thanks in advance for any answers/comments/opinions,

Ragnar

Ragnar Rova

unread,

Mar 24, 2015, 12:37:08 PM3/24/15

to codersh...@googlegroups.com

Hi there again,

Any comments on my previous questions? Does my approach make sense?

Nirbhay Choubey

unread,

Apr 24, 2015, 11:26:54 AM4/24/15

to codersh...@googlegroups.com

Hi Ragnar!

On Thursday, March 12, 2015 at 4:32:43 AM UTC-4, Ragnar Rova wrote:

Hi,

Planning to use galera for a HA scenario, and thinking of how to deal with monitoring and automated cluster restart and recovery.

Looking for comments on my setup and have some questions below.

Background

Three-node cluster, not using garbd. wsrep version 25.3.9.

Low volume writes, mostly reads, relatively small database size and simple schema. Small transactions. During normal operations all writes are preferably sent to one node.

Clients use an in-application list of mysql nodes, each having writable/readable status. Particular requirements are that it is more important to be available, we can even loose some seconds worth of transactions in favor of automatically recovering at startup, as long as we recover to some point-in-time consistent state between transactions.

We monitor wsrep_local_state == 4 on each node and use this to dynamically update the list of writable nodes used by the applications. The list of readable nodes is determined by if it is possible to perform a mysql connect to it, but will probably include a health check query later.

We use the wsrep_dirty_reads setting to serve reads with only a single node. We can cope with stale data in our apps and prefer this over not being available.

Config is default, meaning that pc.recovery = true. The only wsrep setting defined is wsrep_cluster_address = %(other_nodes).

Questions

1. Is wsrep_local_state == 4 a good way to determine if a node is eligible for write transactions? Maybe add SELECT @@global.read_only?

Yes. You may also want to take a look at this script :

https://github.com/olafz/percona-clustercheck/blob/master/clustercheck

2. Is it advisable to have identical wsrep_cluster_address on all nodes, or should we filter out our own address on each node? (we do that
now). Seems like self-address detection can have bugs?

Use of self-address in wsrep_cluster_address should work without any problem.

3. Is it possible to automatically recover the cluster in a safe way from the state of all nodes being stopped and one node being completely unresponsive or reinstalled (only having correct my.cnf, all data lost)? In the event of a partition, having two nodes stopped but reachable out of three should be enough to start the cluster and form a primary component?

Yes, I think so. Just form a cluster by bootstrapping the node with most recent changes.

The script looks OK to me. What's you experience so far running this script?

Should gvwstate.dat help me with starting the cluster from a clean shutdown of all nodes, so that the grastate.dat == -1 (unclean shutdown) check is wrong? I want to avoid re-bootstrapping when not needed.

gvwstate.dat is deleted/discarded on a clean shutdown. Take a look at the following for some good insights :

http://galeracluster.com/documentation-webpages/pcrecovery.html

Nirbhay Choubey

unread,

Apr 24, 2015, 11:39:38 AM4/24/15

to codersh...@googlegroups.com

mysqld --wsrep-recover, look for WSREP: Recovered position:

mysqld_safe script performs something similar. You can take a look at it for reference.

Ragnar Rova

unread,

Apr 28, 2015, 10:52:54 AM4/28/15

to codersh...@googlegroups.com

Hi Nirbhay,

We have not started using the scripts yet in production, but testing so far has been good. I basically wanted to get a second opinion on if this makes sense to automate or if its more trouble than it's worth. What it means is that we only instruct ops people to use this scripts to start the cluster and not issuing 'service mysql start' independently.

For the client-side code, we opted for a simpler approach then using healthchecks to determine which node is up: We use a connection pool in each application which knows about the installed list of available servers (not dynamically discovered), and uses that. We've had some internal debate on what is best: To use a single connection pool and fill that with connections round-robin, or to have three separate connection pools. We handle deadlock exceptions by retrying, and we have relatively few and small writes, so I do not think we must pin write transactions to a single node. The important change however is that we do not rely on anything else being up than the mysql nodes and the client app itself, so no load-balancer or healtchecker jobs.