deadlocks and what I did about them

20 views
Skip to first unread message

Lex Linden

unread,
Jul 6, 2011, 5:00:24 PM7/6/11
to clusto
We just ran into a pretty huge blocker: when a handful of processes all
tried to do similar things with clusto at the same time, they deadlocked.

Here's how it happened:

Each process is following the same set of steps:

1. allocate a new name from a SimpleNameManager
2. instantiate a driver using that name

This is much like what SimpleEntityNameManager does, except we had to
roll our own. See LindenServer in here if you're interested:

http://github.com/lindenlab/python-llclusto-linden

The deadlock happens because of how clusto versioning is handled. Each
transaction looks like this:

1. Insert a new row into clustoversioning (acquiring a DB lock only on
this row).
2. Increment the row in the counters table, acquiring a DB lock on it.
3. Insert an entry into the entities table, with version = (SELECT
COALESCE(MAX(clustoversioning.version), 1)).

The first process runs through to step 2. Then the second process comes
along and runs step 1, inserting a new row into clustoversioning. The
third process then tries to run the subselect and blocks on the row lock
that the second transaction got on its row in clustoversioning -- the
first transaction is trying to scan the entire primary key, and gets
stuck. Then the second transaction tries to access the counter, and
boom, deadlock.

To solve this, I store the version number created in step 1 and use it
throughout the transaction:

https://github.com/lexlinden/clusto/commit/50bfbe935b89a609550477008f0705152492681e

It works and unit tests pass, but I'd love some input on whether this is
horrible/broken.

Ron

unread,
Jul 6, 2011, 8:23:55 PM7/6/11
to clusto
There are some concurrency related unit tests. It turns out to be
tricky cause sqlite and postgres tend to handle things better than
mysql when concurrent transactions are involved, so there are some
hacks to make mysql play nice. If you can produce a unittest that
recreates the problem that would be great. If you don't get to it
I'll try to take a look at it soon.

In general I'd like to say that if it passes the unit tests it's
probably fine, but this is a particularly nuanced case and it might
pay off to be a little extra careful.

-Ron
> https://github.com/lexlinden/clusto/commit/50bfbe935b89a609550477008f...

Lex Linden

unread,
Jul 7, 2011, 1:55:43 PM7/7/11
to clu...@googlegroups.com, Ron
On 07/06/2011 08:23 PM, Ron wrote:
> There are some concurrency related unit tests. It turns out to be
> tricky cause sqlite and postgres tend to handle things better than
> mysql when concurrent transactions are involved, so there are some
> hacks to make mysql play nice. If you can produce a unittest that
> recreates the problem that would be great. If you don't get to it
> I'll try to take a look at it soon.
>
> In general I'd like to say that if it passes the unit tests it's
> probably fine, but this is a particularly nuanced case and it might
> pay off to be a little extra careful.
>

After several hours of trying, I have yet to manage to create a unit
test that reproduces this. I did, however, observe the existing
concurrency test that you wrote failing with 6L != 7L, something I've
seen in the past. I think this might be caused by the same problem I
found that causes deadlocking.

Furthermore, I discovered that storing things in SESSION (as opposed to
SESSION()) is not thread safe; all threads share SESSION. Currently
clusto stores a fair number of things in SESSION (clusto_version,
clusto_description, flushed) and one thing in SESSION()
(TRANSACTIONCOUNTER). This means that clusto is not thread-safe even
though it's using ScopedSession.

You can see my test in my deadlock branch (the latest two commits):

https://github.com/lexlinden/clusto/commits/deadlock

I'm using multiprocessing instead of threading to avoid the SESSION
issues. I'm getting some really weird, kind of scary results: mysql
doesn't seem to enforce any locking on the Counter used by the
SimpleNameManager, so both processes end up trying to create an entity
named 'foo1'. Here's the bad part: they both succeed, and the db ends
up with two entities with the same version and name:

mysql> select * from entities;
+-----------+-------------+-----------------+-------------------+---------+--------------------+
| entity_id | name | type | driver |
version | deleted_at_version |
+-----------+-------------+-----------------+-------------------+---------+--------------------+
| 1 | clustometa | clustometa | clustometa |
1 | NULL |
| 2 | namemanager | resourcemanager | simplenamemanager |
3 | NULL |
| 3 | foo1 | generic | entity |
9 | NULL |
| 4 | foo1 | generic | entity |
9 | NULL |
+-----------+-------------+-----------------+-------------------+---------+--------------------+


I'm having a hard time understanding why mysql doesn't cause thread B to
block when trying to increment the counter. I think the fact that both
entities were allowed to be created with the same version is possibly a
symptom of the underlying bug I'm dealing with. If I run this test
under my development branch with the 'active_version()' fix, I don't end
up with duplicate entities, but I still end up with both A and B trying
to create "foo1", which just ain't right.

Ideas? I'm kind of going nuts here.

Reply all
Reply to author
Forward
0 new messages