Hello,
Synopsis: We have a few 3 member MySQL CE 8.0.34 Galera Clusters. We perform automated patching of servers using dnf-automatic install timers on a weekly basis, which includes a reboot if applicable. For Galera cluster members, we space server reboots 15 minutes apart. We do not automate the update of wsrep-mysql-server or galera-4, and only perform that manually.
Problem: We've noticed that our Galera cluster members do not rejoin during these clean sequential reboots. /var/log/mysqld reports the 1st member attempting to perform a rsync sst operation to the cluster, but it times out and fails to start. This is followed by the 2nd doing the same. The 3rd member will then reboot gracefully and write out safe_to_bootstrap: 1 to grastate.dat. This isn't desirable, as we'd expect the 1st and 2nd to have been transferred the primary view before the 3rd rebooted. Our server VMs consistently take 1-1.5 minutes to reboot. I suspect as each member goes down to reboot, the "surviving" cluster members evict the member that is rebooting. When its evicted, the evicted member comes back up and is no longer allowed to sync back due to some condition or value in place.
Sample Error:
[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at /home/buildbot/buildbot/build/gcomm/src/pc.cpp:connect():160
[ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs_core.cpp:gcs_core_open():222: Failed to open backend connection: -110 (Connection timed out)
[ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs.cpp:gcs_open():1670: Failed to open channel 'dev-mysql-8' at 'gcomm://<REDACTED - this is a list of Galera servers>': -110 (Connection timed out)
[ERROR] WSREP: gcs connect failed: Connection timed out
[ERROR] WSREP: wsrep::connect(gcomm://<REDACTED - this is a list of Galera servers>) failed: 7
Here's our Galera configuration options in /etc/my.cnf:
wsrep_cluster_address = "gcomm://<REDACTED - This is the FQDNs of the 3 servers in a comma separated list>"
wsrep_cluster_name = dev-mysql-8
wsrep_node_address = <REDACTED - This is the IP address of this member server>
wsrep_node_name = <REDACTED - this is the FQDN>
wsrep_on = ON
wsrep_provider = /usr/lib64/galera-4/libgalera_smm.so
wsrep_provider_options = "socket.ssl=1;socket.ssl_key=/etc/pki/tls/private/mysql.key;socket.ssl_cert=/etc/pki/tls/certs/mysql.crt;socket.ssl_cipher=TLS_AES_256_GCM_SHA384:TLS_AES_128_GCM_SHA256:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_CCM_SHA256:TLS_AES_128_CCM_8_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384;socket.ssl_ca=/etc/pki/tls/certs/mysql-ca.crt;evs.keepalive_period=PT3S;evs.suspect_timeout=PT45S;evs.inactive_timeout=PT2M;evs.install_timeout=PT2M"
wsrep_sst_auth = root:<REDACTED - This is the SST Password>
wsrep_sst_method = rsync
Note: We do use TLS for Galera, and enforce a list of stronger ciphers.
Current Solution: We'll recover the cluster manually, then wait for it to occur again. For Production, we've disabled the automated patching for stability and patch as needed.
Questions:
- Is there a set of configuration options that allows a Galera cluster to be rebooted in a methodical way?
- Is there an issue with our current configuration options at present?
- Does Galera recognize a scheduled reboot through systemd, and prevent rejoining members from performing a rsync sst?
Thanks for any help! We've hunted down this issue for a while, but wanted to get some ideas from the community.