Retrying deadlocked transaction causes Lock Wait Timeout Exceeded

Tim

unread,

Aug 2, 2017, 10:59:42 AM8/2/17

to codership

Hi.

I get a fair few deadlocks in my application and handle them by retrying transactions. This works well, except when recovery fails. Then I find that not one node, but the entire cluster becomes useless.

It's rare (like once a month) but it happened today. A deadlock occurred at the COMMIT phase of a transaction. The application tried to replay it, but it hung for nearly a minute and ultimately failed again.

The real problem was that all other connections from all nodes then began getting error 1205 (Lock wait timeout exceeded). Effectively this brought down the website. The nodes didn't crash, but were unresponsive. The cluster could not recover without manual intervention.

Is this to be expected, that a failed transaction would never recover itself? Is the error simply a build up of uncommitted transactions while the first node was hanging?

Pointers in avoiding this would be much appreciated. I am currently unable to guarantee the uptime of my site, and may have to revert to standalone MySQL.

I am running MariaDB 10.2.6 (InnoDB 5.7.14) with a Galera cluster of 3 nodes. Connections from the application are via PHP's mysqli extension.
Here's my config: https://pastebin.com/raw/1e2bcrks

Thanks in advance

igre...@labattfood.com

unread,

Aug 2, 2017, 2:49:08 PM8/2/17

to codership

I'm not sure it's related, but we had a similar issue with using xtrabackup for a backup. It would lock the entire cluster briefly, and occasionally would not unlock the cluster. Using xtrabackup with the nolock option fixed our issue.

Tim

unread,

Aug 2, 2017, 4:15:30 PM8/2/17

to codership

Thanks for the suggestion. I'll look into the nolock option.

I'm using xtrabackup for SST, but I was under the impression this was only used for initial sync, not for normal operation.

Do you think it's possible that SST kicked in due to node replication falling too far behind? That may explain it if my SST is locking.

igre...@labattfood.com

unread,

Aug 2, 2017, 4:24:00 PM8/2/17

to codership

I'm not sure. I haven't experienced a cluster node getting too far behind. I would think it would be hard to fall too far behind unless a node was disconnected from the others for a long while.

Nathan Shirlberg

unread,

Aug 2, 2017, 10:05:40 PM8/2/17

to codership

We had the same issue a few months back with the xtrabackup causing a lock that caused every write to hang on all cluster nodes. Manually stopping all nodes and starting back up was the only way to recover. Since switching to the "no lock" option we have not had that issue again.

Tim

unread,

Aug 17, 2017, 2:04:03 PM8/17/17

to codership

I couldn't find a SST config option to pass "no lock" to xtrabackup. However, I don't think this is SST related.

I had the same problem today. It seemed to be aggravated by a client sending multi-threaded requests. So while one thread is recovering the deadlock, another gets into the same state.

Regardless of the exact cause, which I still don't understand, this spirals into an endless "Lock wait timeout exceeded" for all clients, effectively bringing the site offline.

Even if I can't avoid the original error, I can't have this bring the site down every time it happens. Is there any way to avoid the chain of events that causes this permanent lock?

Reply all

Reply to author

Forward