Node restarting causes cluster to crash

67 views
Skip to first unread message

ben shalev

unread,
Mar 5, 2023, 8:23:07 AM3/5/23
to codership
Recently we had a couple of problems with our galera cluster, we have added a 3rd region and to it 3 more nodes, (we used to have 3 nodes on 2 regions, 1 garbd on one of those regions.)

A few days a go the computer the VM was on crashed, when the node went back up it crashed the cluster with SST problems and caused the cluster to go down being READ-only and needing to be bootstrapped.

we are using :
Galera 26.4.4
MariaDB 10.4.13

The configuration is as following and the same on all nodes (different ist.recv_bind ip and wsrep_node_address)

my.cnf:
```
[galera]
wsrep_on=ON
wsrep_cluster_name="powerdns"
binlog_format=ROW
default_storage_enginge=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
query_cache_size=0
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_address=gcomm://<9 ips of nodes>
wsrep_notify_cmd=/usr/bin/get-status.sh

wsrep_provider_options="gmcast.segment=<segment>; ist.recv_bind=<ip>; socket.ssl_cert=/etc/ssl/mysql/server-cert.pem;socket.ssl_key=/etc/ssl/mysql/server-key.pem;socket.ssl_ca=/etc/ssl/mysql/ca-cert.pem"
wsrep_dirty_reads=ON
wsrep-sync-wait=0
wsrep_node_address="<node_ip>"

[mysqld]
ssl-ca = /etc/ssl/mysq/ca-cert.pem
ssl-key = /etc/ssl/mysql/server-key.pem
ssl-ccert = /etc/ssl/mysql/server-cert.pem

[client]
ssl-ca = /etc/ssl/mysql/ca-cert.pem
ssl-key = /etc/ssl/mysql/client-key.pem
ssl-cert = /etc/ssl/mysql/client-cert.pem
```

The logs we see on the nodes that causes the crash: (JOINER nodes)
```
WSREP: Member 7.1 (db-<region-1>-1) request state transfer from '*any*'. Selected 6.1 (db-<region-1>-2)(SYNCED) as donor.
WSREP: Shifting PRIMARY -> JOINER (TO: 59319)
WSREP: Requesting state transfer: success, donor: 6
WSREP: forgetting f46bc950-abe6 (ssl://<ip>:4567)
version= 6,
component = PRIMARY,
conf_id = 75
members = 6/7 (joined/total),
act_id = 59324
last_appl. = 59214
protocols = 2/10/4 (gcs/repl/appl),
[Warning] WSREP: Donor f46bc950-9d7f-11ed-abe6-57fe7b2de322 is no longer in the group. State transfer cannot be completed, need to abort. Aborting
WSREP: /usr/bin/mysql: Terminated
systemd: mariadb.service: main process exited, code=killed, status=6/ABRT
mysqld: Terminated
WSREP_SST: [INFO] Joined cleanup. rsync PID:4389
rsyncd[4389]: sent 0 bytes recieved 0 bytes total size 0
mysql: WSREP_SST:[INFO] Joined cleanup done.
Failed to start MariaDB 10.4.13
```

The logs we see on the donor LOGS:
```
WSREP: Member 7.1 (db-<region-1>-1) request state transfer from '*any*'. Selected 6.1 (db-<region-1>-2)(SYNCED) as donor.
Shifting SYNCED -> DONOR/DESYNCED (TO: 59319)
WSREP: Detected STR version: 1, req_len: 120, req: STRv1
Cert index preload: 59215 -> 59319
IST sender using ssl
[ERROR] WSREP: Failed to process action STATE_REQUEST, g:59319, l:5187, ptr:0x7f6322974e78, size: 120: IST sender, failed to connect 'ssl://<server_ip>:4568': connect: No router to hose: 113 (No route to host)
```

Then after that the node continuned each one in the "line" of DONORS until he reached one that he didn't crash (the one we bootstrapped from)

This already happened twice to us and causes alot of problems and downtime, what is the cause to this? why does this sometimes happen?

Why sometimes the node succeeds and is able to sync, and other times it goes 1 by 1 to the nodes and causes them to crash?
Ty :)

ben shalev

unread,
Mar 5, 2023, 10:56:52 AM3/5/23
to codership
We been able to replicate it on the prep environment, we downed the compute server for 1hour.
Started it backup.

The server crashed the first server it connected to (prep-db-<region-1>-2).
it requested transfer from it, but it crashed with the above logs. (same logs)

Then after that it tried another DB from another region, succeeded and stayed up.
In this case it stayed up, and completed the SST, in other cases (like happened to us) it crashed the whole DB and caused the connection timeout on all DB's and thus crashing the whole cluster.

What way can we "force" SST to happened so we can replicate this without shutting down the compute for an hour?

Ty :)

ב-יום ראשון, 5 במרץ 2023 בשעה 15:23:07 UTC+2, ‪ben shalev‬‏ כתב/ה:

ben shalev

unread,
Mar 14, 2023, 6:52:08 AM3/14/23
to codership
Hey,any updates on this?

ב-יום ראשון, 5 במרץ 2023 בשעה 17:56:52 UTC+2, ‪ben shalev‬‏ כתב/ה:
Reply all
Reply to author
Forward
0 new messages