Certification lag

38 views

Skip to first unread message

Dmitriy Ugnich

unread,

Oct 26, 2024, 5:36:01 AM10/26/24

to codership

Hello.

I have a Galera cluster with 5 nodes. These are 5 similar servers with AMD EPYC 9654 96-Core and 755Gi RAM and NVMe drives. I am migrating websites to those at the moment. Web servers are on separate old servers right now, but I am trying to move MySQL there first. Previously these sites were using standalone Mariadb servers.

While trying to balance connections to it I noticed a lot update queries waiting for certification on all nodes for 10-20 seconds.

After that I tried to send all connections to each of them to see if any is particularly slow and figured out that 3 nodes are working perfectly fine without any lag. While 2 are quite slow and if connections are routed there queries are getting stuck in "Waiting for certification" for 20-30 seconds most of the time. It results in slowness on the websites in everything that requires update queries. As result recv queue is growing and it is triggering flow control almost immediately.

What was done so far during investigation:
1. Mariadb updated from 10.11 to 11.0 (originally I wanted 11.4), but for some reason it was receiving "broken pipe" on attempts to receive SST. So I ended up with 11.0, which worked fine.
2. Updated kernel to the same version because I noticed a couple of them running an older one. Currently it is 4.18.0-513.24.1.el8_9.x86_64, Rocky Linux 8.9
3. Tested disks, RAM and CPU. All 5 servers showed similar numbers in all tests. And the slow node doesn't overload anything when the issue is happening. Disks are mostly idle, CPU load is low as well as RAM consumption. According to sensors it is not overheating and not throwing anything to dmesg.
4. Made sure that sysctl.conf is identical as well as ulimit.
5. Mariadb settings are also identical. Attached is one of them. Some items there may seem wrong, i know that, but it was taken from the old server. It worked fine there and is working fine on 3/5 nodes. I've tried playing with those but the best I could achieve is getting those queries to get stuck for 10-15 seconds instead of 20-30. So that's just addressing symptoms, but not the core issue, which is still unknown.
6. Checked network connection between nodes, it is 100gbit LAN, works absolutely fine, connection is fast and stable between all nodes.

I would appreciate any suggestions about what else I can check or any other insights.

Thanks in advance!