DEADLOCKs between two or more scheduler worker

127 views
Skip to first unread message

Erwn Ltmann

unread,
Oct 27, 2016, 5:21:34 AM10/27/16
to web2py-users
Hi,

I have a deadlock after a while of running more than one worker, e.g. output for both appearences:

10:29:07,903 [31505] Scheduler - ERROR - Error cleaning up
10:29:07,904 [31505] Scheduler - ERROR - (1213, u'Deadlock found when trying to get lock; try restarting transaction')
10:29:07,904 [31499] Scheduler - ERROR - Error retrieving status
10:29:07,904 [31499] Scheduler - ERROR - (1213, u'Deadlock found when trying to get lock; try restarting transaction')


A typical output on database (MariaDB) console is for example:

------------------------
LATEST DETECTED DEADLOCK
------------------------
 
...
LOCK WAIT 3 lock struct(s), heap size 1184, 2 row lock(s)
...
UPDATE scheduler_worker SET status='ACTIVE',last_heartbeat='2016-10-27 10:10:22',worker_stats='{"status": "ACTIVE", "errors": 0, "workers": 0, "queue": 0, "empty_runs": 3, "sleep": 5, "distribution": null, "total": 152}' WHERE (scheduler_worker.worker_name = 'aradev2.ad-00.ent-01.adgroup#31499')
...
DELETE FROM scheduler_worker WHERE (((scheduler_worker.last_heartbeat < '2016-10-27 10:10:07') AND (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat < '2016-10-27 10:06:37') AND (scheduler_worker.status <> 'ACTIVE')))

It seems so, two worker wants to do something equal at the same time on the same scheduler_worker's records.

I read the code of function send_heartbeat and saw it can happen that two or more worker does the same steps.

Anyway, with an extra db.commit() after worker's status update inside send_heartbeat function in scheduler.py I could reduce that appearance concerning the 'cleaning up' situation; remains the 'retrieving status' error.

Here a database output for this case:

------------------------
LATEST DETECTED DEADLOCK
------------------------
...
LOCK WAIT 3 lock struct(s), heap size 1184, 2 row lock(s)
...
UPDATE scheduler_worker SET status='ACTIVE',last_heartbeat='2016-10-27 11:03:06',worker_stats='{"status": "ACTIVE", "errors": 0, "workers": 0, "queue": 0, "empty_runs": 4, "sleep": 5, "distribution": null, "total": 60}' WHERE (scheduler_worker.worker_name = 'aradev2.ad-00.ent-01.adgroup#14665')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 915 page no 3 n bits 80 index `PRIMARY` of table `arm`.`scheduler_worker` trx table locks 1 total table locks 3  trx id 14202929 lock_mode X locks rec but not gap waiting lock hold time 0 wait time before grant 0
*** (2) TRANSACTION:
TRANSACTION 14202927, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
7 lock struct(s), heap size 1184, 47 row lock(s)
MySQL thread id 60199, OS thread handle 0x7f73df089b00, query id 13827564 aradev2.ad-00.ent-01.adgroup 10.220.43.234 armdba updating
DELETE FROM scheduler_worker WHERE (((scheduler_worker.last_heartbeat < '2016-10-27 11:02:51') AND (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat < '2016-10-27 10:59:21') AND (scheduler_worker.status <> 'ACTIVE')))

The first worker wants to update their status but the other worker tried to delete obsolete worker parallel.

Any idea, what I have to do?

Thx, Erwn

my environment
~~~~~~~~~~~~~~
   web2py: 2.14.6
 database: 10.1.14-MariaDB

Pierre

unread,
Oct 27, 2016, 11:07:18 AM10/27/16
to web2py-users
I have got deadlocks too but  postgresql knows how to resolve this so i don't need worry about it.

take a look here:

https://www.postgresql.org/docs/9.1/static/explicit-locking.html

/*-------------------excerpt------------------------------------------------------------------------------*/

13.3.3. Deadlocks

The use of explicit locking can increase the likelihood of deadlocks, wherein two (or more) transactions each hold locks that the other wants. For example, if transaction 1 acquires an exclusive lock on table A and then tries to acquire an exclusive lock on table B, while transaction 2 has already exclusive-locked table B and now wants an exclusive lock on table A, then neither one can proceed. PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the transactions involved, allowing the other(s) to complete. (Exactly which transaction will be aborted is difficult to predict and should not be relied upon.)


/*-----------------------------------------------------------------------------------------------------------*/


Niphlod

unread,
Oct 27, 2016, 3:56:43 PM10/27/16
to web2py-users
the only thing you can do is either beefing up the database instance (less deadlocks because of faster execution of queries) or lower the db pressure (lower number of workers, higher heartbeat).

Erwn Ltmann

unread,
Oct 31, 2016, 9:04:24 AM10/31/16
to web2py-users
Hi,

thank you for your reply.

@Pierre: MariaDB (in my case) handled deadlocks automaticly too. Good to known, I don't have to be worry about that.

@Niphlod: I tried to beef up my database host. No effects. Another suggestion is to prevent the cases for such situation. I did it by an another extra code line in your worker function send_heartbeat:

if len(db.executesql(dead_workers_name)) > 0:
  db(
   (st.assigned_worker_name.belongs(dead_workers_name)) &
   (st.status == RUNNING)
  ).update(assigned_worker_name='', status=QUEUED)
  dead_workers.delete()



Erwn

Niphlod

unread,
Oct 31, 2016, 10:02:42 AM10/31/16
to web2py-users
sorry, but it doesn't really make sense. 
You're executing twice the same command (the call enclosed in len() and the actual .delete() call), which is the counter-arg for relaxing a pressured database environment. 

Erwn Ltmann

unread,
Nov 1, 2016, 9:42:32 AM11/1/16
to web2py-users
Hi Niphold,

you are right: I have an extra database select in order to get the list of dead workers.

Usually I have four workers for example. They are static and shouldn't terminate often. In this case, I call only once the database in order to get the list of dead workers and I assume this list is always empty. In this case nothing is to do. The inner part of my condition will be important very rare and because of that I can ignore this within my runtime complexity calculation. In our original code we call always twice the question how many dead workers there are (update and delete). My suggestion reflects a runtime rate of 1 for 2.

Anyway, if I run the worker with my suggested extra condition I could eliminate the deadlock cases. This works very well because the extra condition. I am happy :)

Thx a lot.
Erwn

Niphlod

unread,
Nov 2, 2016, 7:06:00 AM11/2/16
to web2py-users
ahem .... until "it makes sense" any modification (and discussion about it) doesn't really help anyone ^_^ . 
The fact that you have 4 workers and a congestion problem gives me the hint that your db is on the lower side of the needed specs for a normal server. 
These kind of issues starts to show with 40-50 workers with a disk on an usb drive.... not on a production server (even built 10 years ago).
Reply all
Reply to author
Forward
0 new messages