Increased Deadlocks?

44 views
Skip to first unread message

igre...@labattfood.com

unread,
Jul 28, 2017, 12:31:11 PM7/28/17
to codership
Hello,

We recently upgraded from Mariadb 10.1.17 to Mariadb 10.1.25.  The upgrade went smoothly in our development environment so then we pushed the update to our production environment as well.  However, we quickly realized that we were seeing a significant increase in the number of deadlocks.  Our cluster is a 3 node cluster for both dev and prod.  Each node is in a different city with a 10ms ping time between nodes (triangle).  We have been running successfully in a WAN environment for about a year now.  To the best of my knowledge we did not make any settings or code changes during the upgrade.  I have also been trying to diagnose this issue for about a week now, but I have no more ideas to try.  I may have to downgrade, but I would like to figure out what is going on.  The following patterns are what seem to be causing new deadlock behavior (which worked before).  I notice most of the errors tend to fail on transaction commit (rather than statements outside of transactions).

Do you have any thoughts or ideas to reduce deadlocks?  I appreciate your help.  For the time being I am putting try/catch protections around these issues, but I would like to try to understand and fix the real underlying issue.

1) Purging logs - We have a logs table.  Primary key is a time prefixed uuid in char-36 format (examples 015d84c3-fb6d-4976-9d50-e97815dc62fe, 015d84c4-3fdc-4012-be41-c804ffb3d1bc).
We have a loop that runs every ~10 minutes that will delete at most 1000 log messsages.
Pseudocode:

start_transaction()
for log in connection.exec("Select * from logs where timestamp < oldest_timestamp limit 1000"):
     log.delete()
commit_transaction()

This code never caused a deadlock before, and now it occurs 50% or more of the time.  This code also does not involve any other actors.  These logs are old and not used by any other threads or active connections, so I'm not sure how this could end up causing a deadlock.  There is only one purge logs pass running at any given time.

2) Saving audits
Pseudocode:
start_transaction()
connection.exec( "Insert into audits ...")
for line item in line_items:
   connection.exec("Insert into audit_line_items ...")            <----- fails here often with deadlock
commit_transaction()

Once again, this code creates new audits and line items that shouldn't be referenced by any other connections.  The primary keys are UUIDs.  This one tends to fail on the insert of the line items.

3) Updating task lock:

We have a table that we use as a locking mechanism.  I expect this table to be highly contentious and cause occasional deadlocks, but so far it has not been a problem.  I updated the code to handle this case (with try/catch), but that seems like avoiding the problem.

Here is a count of the deadlocks we saw before and after the update.

Do you have any thoughts or suggestions on what might be causing this or what solutions are to fix this?

Thank you very much.  We appreciate your help.


Inline image 1

igre...@labattfood.com

unread,
Jul 28, 2017, 12:49:52 PM7/28/17
to codership
My apologies regarding case #2.  After further digging it does appear the deadlock is occurring on the commit transaction rather than on saving the line item.

igre...@labattfood.com

unread,
Jul 28, 2017, 12:52:18 PM7/28/17
to codership
deadlocks.png

igre...@labattfood.com

unread,
Jul 28, 2017, 12:55:26 PM7/28/17
to codership
Maybe the issue is the large number of certification failures?
 
MariaDB [(none)]> show status like 'wsrep%';
+------------------------------+-----------------------------------------------------------+
| Variable_name                | Value                                                     |
+------------------------------+-----------------------------------------------------------+
| wsrep_apply_oooe             | 0.042225                                                  |
| wsrep_apply_oool             | 0.010026                                                  |
| wsrep_apply_window           | 1.047500                                                  |
| wsrep_causal_reads           | 755841                                                    |
| wsrep_cert_deps_distance     | 24.651175                                                 |
| wsrep_cert_index_size        | 90                                                        |
| wsrep_cert_interval          | 0.309751                                                  |
| wsrep_cluster_conf_id        | 66                                                        |
| wsrep_cluster_size           | 3                                                         |
| wsrep_cluster_state_uuid     | 06b8cd86-bd6b-11e6-bb84-c2ba754df11a                      |
| wsrep_cluster_status         | Primary                                                   |
| wsrep_commit_oooe            | 0.000000                                                  |
| wsrep_commit_oool            | 0.001090                                                  |
| wsrep_commit_window          | 1.007495                                                  |
| wsrep_connected              | ON                                                        |
| wsrep_desync_count           | 0                                                         |
| wsrep_evs_delayed            |                                                           |
| wsrep_evs_evict_list         |                                                           |
| wsrep_evs_repl_latency       | 0.0100922/0.0500437/0.674073/0.116579/518                 |
| wsrep_evs_state              | OPERATIONAL                                               |
| wsrep_flow_control_paused    | 0.000343                                                  |
| wsrep_flow_control_paused_ns | 1001237222                                                |
| wsrep_flow_control_recv      | 5                                                         |
| wsrep_flow_control_sent      | 0                                                         |
| wsrep_gcomm_uuid             | 715892a7-73ae-11e7-a174-63b615fe3b57                      |
| wsrep_incoming_addresses     | 172.16.13.114:3306,172.16.23.114:3306,172.16.220.114:3306 |
| wsrep_last_committed         | 133495251                                                 |
| wsrep_local_bf_aborts        | 3                                                         |
| wsrep_local_cached_downto    | 133442233                                                 |
| wsrep_local_cert_failures    | 619                                                       |
| wsrep_local_commits          | 32567                                                     |
| wsrep_local_index            | 0                                                         |
| wsrep_local_recv_queue       | 0                                                         |
| wsrep_local_recv_queue_avg   | 0.009360                                                  |
| wsrep_local_recv_queue_max   | 9                                                         |
| wsrep_local_recv_queue_min   | 0                                                         |
| wsrep_local_replays          | 0                                                         |
| wsrep_local_send_queue       | 0                                                         |
| wsrep_local_send_queue_avg   | 0.003092                                                  |
| wsrep_local_send_queue_max   | 4                                                         |
| wsrep_local_send_queue_min   | 0                                                         |
| wsrep_local_state            | 4                                                         |
| wsrep_local_state_comment    | Synced                                                    |
| wsrep_local_state_uuid       | 06b8cd86-bd6b-11e6-bb84-c2ba754df11a                      |
| wsrep_protocol_version       | 7                                                         |
| wsrep_provider_name          | Galera                                                    |
| wsrep_provider_vendor        | Codership Oy <in...@codership.com>                         |
| wsrep_provider_version       | 25.3.20(r3703)                                            |
| wsrep_ready                  | ON                                                        |
| wsrep_received               | 20620                                                     |
| wsrep_received_bytes         | 77166908                                                  |
| wsrep_repl_data_bytes        | 104486655                                                 |
| wsrep_repl_keys              | 272349                                                    |
| wsrep_repl_keys_bytes        | 2942650                                                   |
| wsrep_repl_other_bytes       | 0                                                         |
| wsrep_replicated             | 33187                                                     |
| wsrep_replicated_bytes       | 143744401                                                 |
| wsrep_thread_count           | 5                                                         |
+------------------------------+-----------------------------------------------------------+
58 rows in set (0.01 sec)

igre...@labattfood.com

unread,
Jul 31, 2017, 9:30:18 AM7/31/17
to codership
 Just to let everyone know, we were able to wrap these deadlock issues with retry loops to overcome the deadlocks, so we are OK for now.  I'm still curious if anything changed in the past year or so that would affect the frequency of deadlocks, but that is a low priority.

Thank you. 
Reply all
Reply to author
Forward
0 new messages