Entire MariaDB 10.1 Galera cluster locks - wsrep slave threads stuck on comitted (?)

1,296 views

Skip to first unread message

Niels Hendriks

unread,

May 4, 2017, 8:35:34 AM5/4/17

to codership

Hi,

We have a 3-node MariaDB 10.1 Galera cluster where we had 2 incidents of the entire cluster locking. When we check the processlist, we see a lot of comitted <id> threads that have been busy a very long time, I think these are the wsrep_slave_threads as there are 12 of them (which we have configured):

All the other queries are stuck on query_end across all 3 nodes. Manually killing queries kept them stuck in "killed" state. The only way to restart mysql was to do a kill -9 on it. Once we did so though on one of the nodes everything went fine again and we could restart the other node back up with an IST just fine.

The nodes have 24 CPU threads and 12 cores, so we set the wsrep_slave_threads to 12. The data is on an nvme disk and we use 10Gbit NICs / cables to connect the nodes with each other.

We have the following galera related settings (obviously without user/pass for sst)

wsrep_on=ON

wsrep_provider=/usr/lib/galera/libgalera_smm.so

wsrep_node_address=10.0.0.3

wsrep_cluster_address='gcomm://10.0.0.1,10.0.0.2,10.0.0.3'

wsrep_sst_method=xtrabackup-v2

wsrep_retry_autocommit=4

wsrep_slave_threads=12

wsrep_provider_options="gcache.size=10G"

binlog_format=ROW

innodb_autoinc_lock_mode=2

innodb_doublewrite=1

innodb_log_file_size=512M

innodb_flush_log_at_trx_commit=0

enforce_storage_engine=innodb

sql_mode="NO_AUTO_CREATE_USER"

I'm not sure what causes it, are there any things I can check now or do to prevent this? Or things I should check in particular if it happens again?

Thanks,

Niels

John Daniels

unread,

May 4, 2017, 10:14:13 AM5/4/17

to codership

This was happening to us for quite a while. For us this turned out to be an innodb gap deadlock.

With log error enabled you can verify if it's the same thing by checking if you get some log lines that end like:

lock_mode X locks rec but not gap

Fixed by changing isolation level off default to READ-COMMITTED

That being said, my troubleshooting method when it happened was something like:

show ENGINE innodb status;

show status like '%';

show processlist;

Show status like "Threads%";

Select substring_index(host, ':', 1) as host_name,state,count(*) from information_schema.processlist group by state,host_name;

Show global variables like '%';

- John

Niels Hendriks

unread,

May 4, 2017, 4:17:43 PM5/4/17

to codership

Thanks John, I checked our logs and I do not see anything regarding the errors you mention ("lock_mode X locks rec but not gap"). In fact, when it went unresponsive I did not see any logging from mysql at all to indicate the possible cause.

I'll use the commands you mentioned to see if I can determine the cause the next time it happens. I also saw wsrep has a debug option as well so I am thinking of giving that a try as well to see if it gives any additional information.

Cheers,

Niels

Niels Hendriks

unread,

Jun 23, 2017, 1:41:32 AM6/23/17

to codership

This just happened again for us, it's still a pain to debug since it's so rare. I've got output from the commands mentioned by John but I don't really see a clear indicator in there. I also enabled wsrep_debug as it happened and waited 1-2 minutes but no special messages that give a clear indication at that point either.

What I did notice was that the last 2 times it happened it was caused when our backup software ran - which also runs the query "flush tables" (on a single node in the cluster). Could that be the cause?

The processlist looks like this - changed the usernames / db names a bit since it's also some personal information:

+---------+----------------------+-----------------+-----------------------+---------+--------+-----------------------------+------------------------------------------------------------------------------------------------------+----------+

| 2345718 | usera1234 | 127.0.0.1:46503 | useradb123 | Query | 65721 | Sending data | SELECT <query> | 0.000 |

| 2580876 | backup | 127.0.0.1:44012 | NULL | Query | 6826 | Waiting for table flush | flush tables | 0.000 |

| 2581798 | usera1234 | 127.0.0.1:44698 | useradb123 | Query | 6577 | Waiting for table flush | SELECT <query> | 0.000 |

| 2584284 | usera1234 | 127.0.0.1:46188 | useradb123 | Query | 5892 | Waiting for table flush | SELECT <query> | 0.000 |

| 2587531 | usera1234 | 127.0.0.1:48241 | useradb123 | Query | 5001 | Waiting for table flush | SELECT <query> | 0.000 |

| 2590650 | usera1234 | 127.0.0.1:50285 | useradb123 | Query | 4138 | Waiting for table flush | SELECT <query> | 0.000 |

| 2594425 | usera1234 | 127.0.0.1:53210 | useradb123 | Query | 3121 | Waiting for table flush | SELECT <query> | 0.000 |

| 2597340 | usera1234 | 127.0.0.1:54767 | useradb123 | Query | 2309 | Waiting for table flush | SELECT <query> | 0.000 |

| 2599707 | usera1234 | 127.0.0.1:56133 | useradb123 | Query | 1649 | Waiting for table flush | SELECT <query> | 0.000 |

| 2602109 | userab12345678912345 | 127.0.0.1:57396 | userabb12345678912345 | Query | 969 | query end | UPDATE <query> | 0.000 |

| 2602118 | userc12345 | 127.0.0.1:57399 | userdb12345 | Query | 966 | query end | UPDATE <query> | 0.000 |

| 2602160 | userd1234567111 | 127.0.0.1:57420 | userdb1234567111 | Query | 956 | query end | UPDATE <query> | 0.000 |

| 2602193 | usere12345111 | 127.0.0.1:57429 | usere1d2345111 | Query | 945 | query end | DELETE <query> | 0.000 |

Another thing I noticed is that the wsrep threads (system user in processlist) during normal operation are pretty much always showing "committed <id>". Is this an indication that I do more queries than that can be committed? Do I need to increase the wsrep slave threads in that case?

Cheers,

Niels

Reply all

Reply to author

Forward

0 new messages