Galera cluster goes down every 5-10 days

77 views

Skip to first unread message

François Delpierre

unread,

Oct 10, 2022, 2:59:20 AM10/10/22

to codership

Hi,

I have a problem for several months I need to understand. Config:

3 nodes (galera1,galera2,galera3)
Server version: 10.6.7-MariaDB-2ubuntu1.1-log Ubuntu 22.04
Running in LXC / Proxmox VE

The load balancer is ipvs TCP with higher weight on first node, and properly removed the failing node:

root@burns2:~# ipvsadm --list --tcp-service "192.168.232.1:3306"
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP burns2:mysql rr persistent 180
-> 192.168.10.51:mysql          Masq    4      0          34
-> 192.168.10.53:mysql          Masq    1      0          32

Problem, about every 5-10 days:

The second nodes gangs (galera2), but galera does not evict it. Instead, the entire cluster switch to Read-Only. More details:

The galera2 instance becomes unresponsive (hangs).
The LB in front of the cluster properly remove galera2 from the backends. But on the clients connecting to the 2 remaining nodes, logs shows:
WordPress database error Lock wait timeout exceeded; try restarting transaction for query UPDATE `wp_options` SET `option_value` = ...
I can connect from localhost via socket on galera2, but even a show status hangs.
I cannot connect from remote network on the failing node ( Access denied for user... )
A client from localhost on galera2 (network) gets a more explicit error:
ERROR 1226 (42000): User 'exporter' has exceeded the 'max_user_connections' resource (current value: 5)
I can connect on the 2 other nodes via network. But they are "Read Only".

On the failing node (galera2):

The /var/log/mysql/mysql.log.1 only shows regular

221007 9:32:39 153380 Connect exporter@localhost on using TCP/IP

To recover galera2:

The DB won't stop (systemctl restart or stop hangs)
I need to SigKill it (-9) the mariadbd process and `systemctl restart mariadb`
Then the cluster immediately goes back online, until next crash...

I need to figure out:

Why Galera does not evict (fencing) the failing galera2 node ?
Why does MariaDB on galera2 hangs ?

Screenshot 2022-10-07 at 11-24-16 Linux Hosts Metrics Base - Dashboards - Grafana.png

Screenshot 2022-10-07 at 11-31-10 Galera_MariaDB - Overview - Dashboards - Grafana.png

Screenshot 2022-10-07 at 11-26-49 Linux Hosts Metrics Base - Dashboards - Grafana.png

François Delpierre

unread,

Oct 14, 2022, 12:47:11 PM10/14/22

to codership

Got similar freeze / Read-Only cluster this morning.

Just shutting down the failing node (always the same node #2) fixes the problem. Cluster goes back to RW, and performs normally. Galera is just not evicting the failing node.

I still have logs for investigations, but decided to reinstall this node #2 from scratch... Let's wait and see...

François Delpierre

unread,

Oct 21, 2022, 6:31:56 AM10/21/22

to codership

Reinstallation of the node did not fix the problem. Sequence of events:

- A snapshot backup freezes the server for about 50s.

- The cluster continues to run, for some 4-5 hours.

- After 4-5 hours, the mariaDB seems to partily freeze : and we see this message appearing on all 3 nodes : "Got an error reading communication packets" when a user connects to it.

- The other nodes says up, and turns into read-only mode instead of evicting the faulty node.

- When I notice the problem, I just isolate the network of the faulty node, it gets quickly evicted : WSREP: forgetting 0e02d3a3-8df4 (tcp://192.168.10.52:4567)
- Then 2 remaining nodes are going back to normal operation (RW mode).

- The reboot of the faulty node takes a lot of time, likely the MariaDB needs to be killed after timeout.

Conclusion:

- There might be some queue/log stacking for some 4-5 hours before the cluster completely goes down, rejecting connections with a "Got an error reading communication packets"

- Galera fails to evict a faulty node if it still answers. The prometheus mysql exporter however well identify the node down (it can't connect to it with a regular user). Some improvements are needed in Galera node failure detection.

- Galera/mariaDB does not support properly snapshot backups, even if it last less than 1 minute.

- I'll switch to power-off backups, and definitely forget about snapshot backups on a galera cluster.