Hi,
Planning to use galera for a HA scenario, and thinking of how to deal with monitoring and automated cluster restart and recovery.
Looking for comments on my setup and have some questions below.
Background
Three-node cluster, not using garbd. wsrep version 25.3.9.
Low volume writes, mostly reads, relatively small database size and simple schema. Small transactions. During normal operations all writes are preferably sent to one node.
Clients use an in-application list of mysql nodes, each having writable/readable status. Particular requirements are that it is more important to be available, we can even loose some seconds worth of transactions in favor of automatically recovering at startup, as long as we recover to some point-in-time consistent state between transactions.
We monitor wsrep_local_state == 4 on each node and use this to dynamically update the list of writable nodes used by the applications. The list of readable nodes is determined by if it is possible to perform a mysql connect to it, but will probably include a health check query later.
We use the wsrep_dirty_reads setting to serve reads with only a single node. We can cope with stale data in our apps and prefer this over not being available.
Config is default, meaning that pc.recovery = true. The only wsrep setting defined is wsrep_cluster_address = %(other_nodes).
Questions
1. Is wsrep_local_state == 4 a good way to determine if a node is eligible for write transactions? Maybe add SELECT @@global.read_only?
2. Is it advisable to have identical wsrep_cluster_address on all nodes, or should we filter out our own address on each node? (we do that
now). Seems like self-address detection can have bugs?
3. Is it possible to automatically recover the cluster in a safe way from the state of all nodes being stopped and one node being completely unresponsive or reinstalled (only having correct my.cnf, all data lost)? In the event of a partition, having two nodes stopped but reachable out of three should be enough to start the cluster and form a primary component?
To achieve 3, I started writing a start script which would be used in combination with that mysql service on each node is set to start on boot. It is run on every start of the application to ensure mysql is up.
The start script is run manually by an operator to start any stopped nodes, perform bootstrap at system setup and perform recovery from an outage.
The script should not require any knowledge of galera internals in particular and try its best to form a running cluster after an outage.
I am unsure if this is really such a smart thing to do in an automated fashion. Comments are welcome. We want to simplify cluster operations.
Pseudocode for mysql start scripts
Uses a fourth machine as control node which uses ssh to
provision the mysql nodes.
start_cluster:
running_hosts = hosts reachable over ssh with a running mysqld process
stopped_hosts = hosts reachable over ssh with a stopped mysqld process
unknown_hosts = hosts not reachable over ssh
unclean_shutdown_recoverable = hosts with gvwstate.dat present and
-1 in grastate.dat seqno
if #unknown_hosts > 1
abort 'cannot proceed automatically'
if #running_hosts > 1
run_in_parallel 'service mysql start' on stopped_hosts
else if #unclean_shutdown_recoverable == #all_hosts
run_in_parallel 'service mysql start' on stopped_hosts
else bootstrap_cluster
bootstrap_cluster:
bootstrap_node = select_bootstrap_node over all hosts
'service mysql bootstrap' on bootstrap_node
run_in_parallel 'service mysql start' on joiner_nodes
seqno:
if running_mysqld:
select wsrep_cluster_state_uuid, wsrep_last_committed
else
mysqld --wsrep-recover, look for WSREP: Recovered position:
select_bootstrap_node:
seqno_from_hosts = seqno function run over all hosts, and if mysqld running
filter out unresponsive hosts from seqno_from_hosts
assert that cluster_state_uuid is the same across all hosts, otherwise abort
sort seqno from hosts on wsrep_last_committed descending, and mysqld running
pick first in sorted list as bootstrap node
Am I introducing too much complexity? Should/can this type of recovery from complete loss of a single node be handled automatically?
Should gvwstate.dat help me with starting the cluster from a clean shutdown of all nodes, so that the grastate.dat == -1 (unclean shutdown) check is wrong? I want to avoid re-bootstrapping when not needed.
Thanks in advance for any answers/comments/opinions,
Ragnar