This just happened again for us, it's still a pain to debug since it's so rare. I've got output from the commands mentioned by John but I don't really see a clear indicator in there. I also enabled wsrep_debug as it happened and waited 1-2 minutes but no special messages that give a clear indication at that point either.
What I did notice was that the last 2 times it happened it was caused when our backup software ran - which also runs the query "flush tables" (on a single node in the cluster). Could that be the cause?
The processlist looks like this - changed the usernames / db names a bit since it's also some personal information:
+---------+----------------------+-----------------+-----------------------+---------+--------+-----------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+---------+----------------------+-----------------+-----------------------+---------+--------+-----------------------------+------------------------------------------------------------------------------------------------------+----------+
| 1 | system user | | NULL | Sleep | 660096 | wsrep aborter idle | NULL | 0.000 |
| 2 | system user | | NULL | Sleep | 970 | applied write set 185104091 | NULL | 0.000 |
| 3 | system user | | NULL | Sleep | 970 | applied write set 185104095 | NULL | 0.000 |
| 4 | system user | | NULL | Sleep | 970 | Waiting for table flush | NULL | 0.000 |
| 5 | system user | | NULL | Sleep | 970 | applied write set 185104090 | NULL | 0.000 |
| 6 | system user | | NULL | Sleep | 970 | applied write set 185104089 | NULL | 0.000 |
| 7 | system user | | NULL | Sleep | 970 | applied write set 185104092 | NULL | 0.000 |
| 9 | system user | | NULL | Sleep | 970 | applied write set 185104086 | NULL | 0.000 |
| 10 | system user | | NULL | Sleep | 970 | applied write set 185104088 | NULL | 0.000 |
| 11 | system user | | NULL | Sleep | 970 | applied write set 185104087 | NULL | 0.000 |
| 12 | system user | | NULL | Sleep | 971 | Waiting for table flush | NULL | 0.000 |
| 13 | system user | | NULL | Sleep | 970 | Waiting for table flush | NULL | 0.000 |
| 14 | system user | | NULL | Sleep | 970 | applied write set 185104094 | NULL | 0.000 |
| 2345718 | usera1234 |
127.0.0.1:46503 | useradb123 | Query | 65721 | Sending data | SELECT <query> | 0.000 |
| 2580876 | backup |
127.0.0.1:44012 | NULL | Query | 6826 | Waiting for table flush | flush tables | 0.000 |
| 2581798 | usera1234 |
127.0.0.1:44698 | useradb123 | Query | 6577 | Waiting for table flush | SELECT <query> | 0.000 |
| 2584284 | usera1234 |
127.0.0.1:46188 | useradb123 | Query | 5892 | Waiting for table flush | SELECT <query> | 0.000 |
| 2587531 | usera1234 |
127.0.0.1:48241 | useradb123 | Query | 5001 | Waiting for table flush | SELECT <query> | 0.000 |
| 2590650 | usera1234 |
127.0.0.1:50285 | useradb123 | Query | 4138 | Waiting for table flush | SELECT <query> | 0.000 |
| 2594425 | usera1234 |
127.0.0.1:53210 | useradb123 | Query | 3121 | Waiting for table flush | SELECT <query> | 0.000 |
| 2597340 | usera1234 |
127.0.0.1:54767 | useradb123 | Query | 2309 | Waiting for table flush | SELECT <query> | 0.000 |
| 2599707 | usera1234 |
127.0.0.1:56133 | useradb123 | Query | 1649 | Waiting for table flush | SELECT <query> | 0.000 |
| 2602109 | userab12345678912345 |
127.0.0.1:57396 | userabb12345678912345 | Query | 969 | query end | UPDATE <query> | 0.000 |
| 2602118 | userc12345 |
127.0.0.1:57399 | userdb12345 | Query | 966 | query end | UPDATE <query> | 0.000 |
| 2602160 | userd1234567111 |
127.0.0.1:57420 | userdb1234567111 | Query | 956 | query end | UPDATE <query> | 0.000 |
| 2602193 | usere12345111 |
127.0.0.1:57429 | usere1d2345111 | Query | 945 | query end | DELETE <query> | 0.000 |
<many more lines with query_end>
Another thing I noticed is that the wsrep threads (system user in processlist) during normal operation are pretty much always showing "committed <id>". Is this an indication that I do more queries than that can be committed? Do I need to increase the wsrep slave threads in that case?
Cheers,
Niels