Hi.
I get a fair few deadlocks in my application and handle them by retrying transactions. This works well, except when recovery fails. Then I find that not one node, but the entire cluster becomes useless.
It's rare (like once a month) but it happened today. A deadlock occurred at the COMMIT phase of a transaction. The application tried to replay it, but it hung for nearly a minute and ultimately failed again.
The real problem was that all other connections from all nodes then began getting error 1205 (Lock wait timeout exceeded). Effectively this brought down the website. The nodes didn't crash, but were unresponsive. The cluster could not recover without manual intervention.
Is this to be expected, that a failed transaction would never recover itself? Is the error simply a build up of uncommitted transactions while the first node was hanging?
Pointers in avoiding this would be much appreciated. I am currently unable to guarantee the uptime of my site, and may have to revert to standalone MySQL.
I am running MariaDB 10.2.6 (InnoDB 5.7.14) with a Galera cluster of 3 nodes. Connections from the application are via PHP's mysqli extension.
Here's my config:
https://pastebin.com/raw/1e2bcrks
Thanks in advance