Here's how it happened:
Each process is following the same set of steps:
1. allocate a new name from a SimpleNameManager
2. instantiate a driver using that name
This is much like what SimpleEntityNameManager does, except we had to
roll our own. See LindenServer in here if you're interested:
http://github.com/lindenlab/python-llclusto-linden
The deadlock happens because of how clusto versioning is handled. Each
transaction looks like this:
1. Insert a new row into clustoversioning (acquiring a DB lock only on
this row).
2. Increment the row in the counters table, acquiring a DB lock on it.
3. Insert an entry into the entities table, with version = (SELECT
COALESCE(MAX(clustoversioning.version), 1)).
The first process runs through to step 2. Then the second process comes
along and runs step 1, inserting a new row into clustoversioning. The
third process then tries to run the subselect and blocks on the row lock
that the second transaction got on its row in clustoversioning -- the
first transaction is trying to scan the entire primary key, and gets
stuck. Then the second transaction tries to access the counter, and
boom, deadlock.
To solve this, I store the version number created in step 1 and use it
throughout the transaction:
https://github.com/lexlinden/clusto/commit/50bfbe935b89a609550477008f0705152492681e
It works and unit tests pass, but I'd love some input on whether this is
horrible/broken.
After several hours of trying, I have yet to manage to create a unit
test that reproduces this. I did, however, observe the existing
concurrency test that you wrote failing with 6L != 7L, something I've
seen in the past. I think this might be caused by the same problem I
found that causes deadlocking.
Furthermore, I discovered that storing things in SESSION (as opposed to
SESSION()) is not thread safe; all threads share SESSION. Currently
clusto stores a fair number of things in SESSION (clusto_version,
clusto_description, flushed) and one thing in SESSION()
(TRANSACTIONCOUNTER). This means that clusto is not thread-safe even
though it's using ScopedSession.
You can see my test in my deadlock branch (the latest two commits):
https://github.com/lexlinden/clusto/commits/deadlock
I'm using multiprocessing instead of threading to avoid the SESSION
issues. I'm getting some really weird, kind of scary results: mysql
doesn't seem to enforce any locking on the Counter used by the
SimpleNameManager, so both processes end up trying to create an entity
named 'foo1'. Here's the bad part: they both succeed, and the db ends
up with two entities with the same version and name:
mysql> select * from entities;
+-----------+-------------+-----------------+-------------------+---------+--------------------+
| entity_id | name | type | driver |
version | deleted_at_version |
+-----------+-------------+-----------------+-------------------+---------+--------------------+
| 1 | clustometa | clustometa | clustometa |
1 | NULL |
| 2 | namemanager | resourcemanager | simplenamemanager |
3 | NULL |
| 3 | foo1 | generic | entity |
9 | NULL |
| 4 | foo1 | generic | entity |
9 | NULL |
+-----------+-------------+-----------------+-------------------+---------+--------------------+
I'm having a hard time understanding why mysql doesn't cause thread B to
block when trying to increment the counter. I think the fact that both
entities were allowed to be created with the same version is possibly a
symptom of the underlying bug I'm dealing with. If I run this test
under my development branch with the 'active_version()' fix, I don't end
up with duplicate entities, but I still end up with both A and B trying
to create "foo1", which just ain't right.
Ideas? I'm kind of going nuts here.