Reinstallation of the node did not fix the problem. Sequence of events:
- A snapshot backup freezes the server for about 50s.
- The cluster continues to run, for some 4-5 hours.
- After 4-5 hours, the mariaDB seems to partily freeze : and we see this message appearing on all 3 nodes : "Got an error reading communication packets" when a user connects to it.
- The other nodes says up, and turns into read-only mode instead of evicting the faulty node.
- When I notice the problem, I just isolate the network of the faulty node, it gets quickly evicted :
WSREP: forgetting 0e02d3a3-8df4 (tcp://192.168.10.52:4567)
- Then 2 remaining nodes are going back to normal operation (RW mode).
- The reboot of the faulty node takes a lot of time, likely the MariaDB needs to be killed after timeout.
Conclusion:
- There might be some queue/log stacking for some 4-5 hours before the cluster completely goes down, rejecting connections with a "Got an error reading communication packets"
- Galera fails to evict a faulty node if it still answers. The
prometheus mysql exporter however well identify the node down (it can't
connect to it with a regular user). Some improvements are needed in
Galera node failure detection.
- Galera/mariaDB does not support properly snapshot backups, even if it last less than 1 minute.
- I'll switch to power-off backups, and definitely forget about snapshot backups on a galera cluster.
Regards,