Advisory lock deadlock

183 views
Skip to first unread message

David Rosenstrauch

unread,
Jun 7, 2017, 11:23:57 AM6/7/17
to citus-users
I have a Spark streaming job that's writing to a Citus database. The
job performs writes to Citus using JDBC, prepared statements, batches,
and UPSERTs. (I.e., open a JDBC connection, create a prepared
statement, use the prep statement to perform a batch of UPSERTs using
addBatch(), then perform all of the writes using executeBatch().)

However, we've started running into some deadlock issues, and I'm not
sure what's causing them. Our Spark job has been crashing with the
following errors:

2017-06-06 16:05:17.718 UTC [36661] dmx@dmx ERROR: deadlock detected
2017-06-06 16:05:17.718 UTC [36661] dmx@dmx DETAIL: Process 36661 waits
for ExclusiveLock on advisory lock [16649,0,102440,5]; blocked by
process 36662.
Process 36662 waits for ExclusiveLock on advisory lock
[16649,0,102439,5]; blocked by process 36661.

But I'm never creating any advisory locks myself, and I'm not sure where
they're coming from. Could Citus somehow be creating advisory locks
behind the scenes? If so, how might I go about finding - and fixing -
this issue?

Thanks,

DR

Andres Freund

unread,
Jun 7, 2017, 1:05:38 PM6/7/17
to David Rosenstrauch, citus-users
Hi David,

A quick inline response:

On 2017-06-07 11:23:49 -0400, David Rosenstrauch wrote:
> 2017-06-06 16:05:17.718 UTC [36661] dmx@dmx ERROR: deadlock detected
> 2017-06-06 16:05:17.718 UTC [36661] dmx@dmx DETAIL: Process 36661 waits for
> ExclusiveLock on advisory lock [16649,0,102440,5]; blocked by process 36662.
> Process 36662 waits for ExclusiveLock on advisory lock
> [16649,0,102439,5]; blocked by process 36661.
>
> But I'm never creating any advisory locks myself, and I'm not sure where
> they're coming from. Could Citus somehow be creating advisory locks behind
> the scenes? If so, how might I go about finding - and fixing - this issue?

Citus uses advisory locks internally, to implement locks on shards (and
other things). The last number in the lock identifier ('5') here
identifies it's a shard lock (1 and 2 would be plain SQL acquirable
advisory locks). The second to last number is the id of the shard the
lock is acquired on.

What you're presumably seeing is that you're acquiring multiple locks
inside the same transaction in an order that's not consistent between
the different processes. It's a bit hard to give further advice with
the available information, but perhaps that already helps?

Regards,

Andres

David Rosenstrauch

unread,
Jun 7, 2017, 2:50:55 PM6/7/17
to citus...@googlegroups.com
Hi Andres. That helps a bit. (Good to be able to pinpoint that Citus
is the source of these errors.) However, I'm not sure I'm closer to a
solution.

From the information in your message, and the contents of the error
message I saw, it sounds like what happened is that 2 different
processes were trying to acquire locks on the same 2 shards, and
deadlocked each other. But this raises a bunch of additional questions
for me:

* When does Citus lock a shard? I'm not clear how exactly I'm managing
to get a deadlock, as I'm not using transactions (I'm using JDBC
auto-commit). So it seems surprising that I would ever be locking
anything for any significant period of time, let alone having multiple
processes trying to lock the same things at the same time.

* Why does Citus lock an entire shard? Wouldn't locking at the row
level be less restrictive?

* Is there any way to turn off (or relax) this shard-level locking?

* When I received this error, I was also experiencing another issue
where I had "max_connections" set too low on my worker machines, and so
the master was getting errors where it was being blocked from connecting
to the workers. Any chance that might have been part of the cause of
this issue? (I.e., master acquires lock, tries to update worker, fails,
another process does the same and results in a deadlock.)

Thanks,

DR

Sai Srirampur

unread,
Jun 7, 2017, 8:14:58 PM6/7/17
to David Rosenstrauch, citus-users
Hey David,

Thanks for getting back.

Replies inlined,



On Wed, Jun 7, 2017 at 11:50 AM, David Rosenstrauch <dar...@darose.net> wrote:
Hi Andres.  That helps a bit.  (Good to be able to pinpoint that Citus is the source of these errors.)  However, I'm not sure I'm closer to a solution.

From the information in your message, and the contents of the error message I saw, it sounds like what happened is that 2 different processes were trying to acquire locks on the same 2 shards, and deadlocked each other.  But this raises a bunch of additional questions for me:

* When does Citus lock a shard?  I'm not clear how exactly I'm managing to get a deadlock, as I'm not using transactions (I'm using JDBC auto-commit).  So it seems surprising that I would ever be locking anything for any significant period of time, let alone having multiple processes trying to lock the same things at the same time.
 
 Citus locks shards in operations such as updates/move/repair etc.


* Why does Citus lock an entire shard?  Wouldn't locking at the row level be less restrictive?

* Is there any way to turn off (or relax) this shard-level locking?
 
One example which I see relevant your use-case is, when there is an UPDATE/UPSERT on a table, as those operations are inherently not commutative, for each operation, we acquire locks on all the shard replicas(which the DML is being routed)  and release the lock only after the operation is completed on all the replicas i.e we allow only one UPDATE/UPSERT to happen first and then only let the other. This is how consistency is maintained across all the replicas. Hence row level wouldn't suffice and the locking cannot be turned-off/relaxed.
 

* When I received this error, I was also experiencing another issue where I had "max_connections" set too low on my worker machines, and so the master was getting errors where it was being blocked from connecting to the workers.  Any chance that might have been part of the cause of this issue?  (I.e., master acquires lock, tries to update worker, fails, another process does the same and results in a deadlock.)

This could be a possible reason, Could you share the following, so that we give you an informed answer on why you are seeing this issue:

1) Enable query logging and share the output. This would help to know what SQL queries are causing this dead-lock.
2) What is the replication factor set to?
SHOW citus.shard_replication_factor; would output the result.
3) Are there multiple JDBC connections trying to do batch upserts (or other operations)?

Cheers,
Sai

Thanks,

DR
--
You received this message because you are subscribed to the Google Groups "citus-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to citus-users+unsubscribe@googlegroups.com.
To post to this group, send email to citus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/citus-users/a68b607c-5985-ef32-dc30-075236dbe435%40darose.net.

For more options, visit https://groups.google.com/d/optout.

David Rosenstrauch

unread,
Jun 8, 2017, 10:19:13 AM6/8/17
to citus...@googlegroups.com
Thanks for the response Sai - very helpful.

Yes, UPSERTs seem to be the issue. Since I last emailed, I read up in
your docs about the shard locks, and the use of
citus.all_modifications_commutative. We currently are using a
replication factor of 1, which should make it safe for us to turn on the
all_modifications_commutative setting. I'll test with that and see if
it makes the issue go away.

Thanks,

DR
> <https://www.postgresql.org/docs/current/static/runtime-config-logging.html>
>> email to citus-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages