Galera Cluster Sequential Reboot Questions

98 views
Skip to first unread message

Tanner

unread,
Aug 29, 2023, 12:19:04 AM8/29/23
to codership
Hello,

Synopsis: We have a few 3 member MySQL CE 8.0.34 Galera Clusters. We perform automated patching of servers using dnf-automatic install timers on a weekly basis, which includes a reboot if applicable. For Galera cluster members, we space server reboots 15 minutes apart. We do not automate the update of wsrep-mysql-server or galera-4, and only perform that manually.

Problem: We've noticed that our Galera cluster members do not rejoin during these clean sequential reboots. /var/log/mysqld reports the 1st member attempting to perform a rsync sst operation to the cluster, but it times out and fails to start. This is followed by the 2nd doing the same. The 3rd member will then reboot gracefully and write out safe_to_bootstrap: 1 to grastate.dat. This isn't desirable, as we'd expect the 1st and 2nd to have been transferred the primary view before the 3rd rebooted. Our server VMs consistently take 1-1.5 minutes to reboot. I suspect as each member goes down to reboot, the "surviving" cluster members evict the member that is rebooting. When its evicted, the evicted member comes back up and is no longer allowed to sync back due to some condition or value in place.

Sample Error:
[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at /home/buildbot/buildbot/build/gcomm/src/pc.cpp:connect():160
[ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs_core.cpp:gcs_core_open():222: Failed to open backend connection: -110 (Connection timed out)
[ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs.cpp:gcs_open():1670: Failed to open channel 'dev-mysql-8' at 'gcomm://<REDACTED - this is a list of Galera servers>': -110 (Connection timed out)
[ERROR] WSREP: gcs connect failed: Connection timed out
[ERROR] WSREP: wsrep::connect(gcomm://<REDACTED - this is a list of Galera servers>) failed: 7

Here's our Galera configuration options in /etc/my.cnf:
wsrep_cluster_address = "gcomm://<REDACTED - This is the FQDNs of the 3 servers in a comma separated list>"
wsrep_cluster_name = dev-mysql-8
wsrep_node_address = <REDACTED - This is the IP address of this member server>
wsrep_node_name = <REDACTED - this is the FQDN>
wsrep_on = ON
wsrep_provider = /usr/lib64/galera-4/libgalera_smm.so
wsrep_provider_options = "socket.ssl=1;socket.ssl_key=/etc/pki/tls/private/mysql.key;socket.ssl_cert=/etc/pki/tls/certs/mysql.crt;socket.ssl_cipher=TLS_AES_256_GCM_SHA384:TLS_AES_128_GCM_SHA256:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_CCM_SHA256:TLS_AES_128_CCM_8_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384;socket.ssl_ca=/etc/pki/tls/certs/mysql-ca.crt;evs.keepalive_period=PT3S;evs.suspect_timeout=PT45S;evs.inactive_timeout=PT2M;evs.install_timeout=PT2M"
wsrep_sst_auth = root:<REDACTED - This is the SST Password>
wsrep_sst_method = rsync

Note: We do use TLS for Galera, and enforce a list of stronger ciphers.
Note 2: The evs. config keys were added on recommendation of the WAN replication KB with some adjusted values: https://galeracluster.com/library/kb/wan-replication.html

Current Solution: We'll recover the cluster manually, then wait for it to occur again. For Production, we've disabled the automated patching for stability and patch as needed.

Questions
  • Is there a set of configuration options that allows a Galera cluster to be rebooted in a methodical way?
  • Is there an issue with our current configuration options at present?
  • Does Galera recognize a scheduled reboot through systemd, and prevent rejoining members from performing a rsync sst?
Thanks for any help! We've hunted down this issue for a while, but wanted to get some ideas from the community.

Matt Horwood

unread,
Feb 19, 2025, 5:37:46 AMFeb 19
to codership
Hello Tanner,

Did you find a solution to this?

we are looking at rollout out an automated update system, but we cant control when an instance gets updated and rebooted.

Tanner Smith

unread,
Feb 19, 2025, 8:58:48 AMFeb 19
to Matt Horwood, codership
Hey Matt - We opted to turn dnf-automatic timers off and patch the servers manually. I also no longer support this environment, so I'm unable to test any new settings or tell you if any minor patch release helps.

Mayden
Mayden | The Old Dairy, Melcombe Road | Bath | BA2 3LR
Tel: 01249 701100 | www.mayden.co.uk
Twitter Facebook LinkedIn
This email and any attachments are confidential and intended for the addressee only. If you are not the named recipient, you must not use, disclose, reproduce, copy or distribute the contents of this communication; instead, please contact the sender and delete this email from your system.
Company information | Privacy Policy | Disclaimer

--
You received this message because you are subscribed to a topic in the Google Groups "codership" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/codership-team/go1l41fVJVI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to codership-tea...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/codership-team/e32ef4a6-c8d3-4055-8f11-445bd3d5d0a3n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages