select_for_update().get(...), what happens on DoesNotExist?

1,302 views
Skip to first unread message

Carsten Fuchs

unread,
Dec 12, 2012, 1:06:02 PM12/12/12
to django...@googlegroups.com
Dear Django group,

we try to understand and solve a concurrency problem, and would kindly
like to ask for your help.

We use Django 1.4, and a model like this:

class MonatsSalden(models.Model):
id = models.AutoField(primary_key=True)
key = models.ForeignKey(Mitarbeiter)
jahr = models.SmallIntegerField()
monat = models.SmallIntegerField()
saldo = models.DecimalField(max_digits=10, decimal_places=2)

class Meta:
unique_together = ('key', 'jahr', 'monat')


This is our code (for clarity, reduced to the essentials):

try:
CE = MonatsSalden.objects.get(jahr=Jahr, monat=Monat)

if meets_requirements(CE):
return CE
except MonatsSalden.DoesNotExist:
CE = MonatsSalden(key=self, jahr=Jahr, monat=Monat)

### Long and expensive computation to set the other fields of CE...
### (after it, meets_requirements(CE) returns True)
### ...

CE.save()
return CE


The code essentially implements the query of a cache ("CE" == "cache
entry"), and returns the cached entry if it is not yet expired
("meets_requirements"), or else completes and updates the cache.

The problem is that the long and expensive computation that is indicated
with the ### comment lines can run long enough so that another view
request can enter the same code in the meanwhile.

As a result, CE.save() can be called twice. (If CE was created anew,
following the DoesNotExist exception, only the unique_together
constraint prevents that duplicate rows are inserted, and raises a nice
database error exception instead.)


To fix this, our plan was to first enable the TransactionMiddleware,
because as we understand it, using select_for_update() is not very
useful with the default open transaction behaviour, because the
auto-commit that follows the end of the query immediately expires the
locks. (Is this right??)

With the TransactionMiddleware, we would next add select_for_update():

CE = MonatsSalden.objects.select_for_update().get(jahr=Jahr,
monat=Monat)

The intention is that if the code is re-entered in another process, it
has to wait until the first process has finished dealing with its CE.

It seems like things work well when the DoesNotExist exception *not*
occurs, because CE is locked, eventually returned, the view request is
elsewhere soon completed, the transaction committed, and things are ok.

But... how does it behave when DoesNotExist occurs?

Will the previously established lock cover the newly created CE object
even though it did not exist at the time of the select_for_update(), or not?

Many thanks in advance, any help is very much appreciated!
:-)

Best regards,
Carsten



--
Cafu - the open-source Game and Graphics Engine
for multiplayer, cross-platform, real-time 3D Action
Learn more at http://www.cafu.de

Chris Cogdon

unread,
Dec 12, 2012, 4:42:59 PM12/12/12
to django...@googlegroups.com
The question is going to be very database specific. "select for update" works by putting in row-level locks in the database. If the row does not exist, then it won't get a lock on it, and I know of no databases that will maintain the query so that any newly created rows automatically get a lock on it.

However, not all is lost here.

When you create new rows, those rows are _invisible_ to any other transaction until you do a commit. So, as long as you arrange things so that there's a final commit after you've done all the manipulations you want, you won't have those rows stomped on by any other transaction.


Carsten Fuchs

unread,
Dec 13, 2012, 1:23:52 PM12/13/12
to django...@googlegroups.com
Hi Chris,

thank you very much for your reply!

Am 12.12.2012 22:42, schrieb Chris Cogdon:
> The question is going to be very database specific. "select for update"
> works by putting in row-level locks in the database. If the row does not
> exist, then it won't get a lock on it, and I know of no databases that
> will maintain the query so that any newly created rows automatically get
> a lock on it.

Our database is Oracle;
with
<https://docs.djangoproject.com/en/1.4/ref/models/querysets/#select-for-update>
saying both "Returns a queryset that will lock rows ..." and "All
matched entries will be locked ..." I nourished the silent hope that
maybe it kept and matched the query rather than flagging individual
rows. ;-)

> However, not all is lost here.
>
> When you create new rows, those rows are _invisible_ to any other
> transaction until you do a commit. So, as long as you arrange things so
> that there's a final commit after you've done all the manipulations you
> want, you won't have those rows stomped on by any other transaction.

Ok, thanks, I was in fact not quite aware that the newly created rows
are invisible until the end of the transaction.

However, that still doesn't stop a second process that enters the same
code while the first is still running, from doing the same computations,
too, eventually committing a transaction that creates (attempts to)
another row with the same data again, doesn't it?

Is it a sound idea to work-around this problem by running a
select_for_update() query on *other* rows (a few that are known to
exist, or even all) of the same model, just to obtain a dummy lock until
the first process is done?

Or even better, lock the single row in another model that tells us that
meets_requirements() == False, i.e. our cache needs updating? It would
still only be a "dummy" lock, of course, so I'm unsure if this is
reasonable at all.

Or does some kind of best practice exist that suggests how such a
problem is normally solved?

Chris Cogdon

unread,
Dec 13, 2012, 1:39:58 PM12/13/12
to django...@googlegroups.com


On Thursday, December 13, 2012 10:23:52 AM UTC-8, Carsten Fuchs wrote:
Our database is Oracle;
with
<https://docs.djangoproject.com/en/1.4/ref/models/querysets/#select-for-update>
saying both "Returns a queryset that will lock rows ..." and "All
matched entries will be locked ..." I nourished the silent hope that
maybe it kept and matched the query rather than flagging individual
rows.  ;-)

No nourishment, but fortunately no calories either :)
 

> However, not all is lost here.
>
> When you create new rows, those rows are _invisible_ to any other
> transaction until you do a commit. So, as long as you arrange things so
> that there's a final commit after you've done all the manipulations you
> want, you won't have those rows stomped on by any other transaction.

Ok, thanks, I was in fact not quite aware that the newly created rows
are invisible until the end of the transaction.

However, that still doesn't stop a second process that enters the same
code while the first is still running, from doing the same computations,
too, eventually committing a transaction that creates (attempts to)
another row with the same data again, doesn't it?

Is it a sound idea to work-around this problem by running a
select_for_update() query on *other* rows (a few that are known to
exist, or even all) of the same model, just to obtain a dummy lock until
the first process is done?

Or even better, lock the single row in another model that tells us that
meets_requirements() == False, i.e. our cache needs updating? It would
still only be a "dummy" lock, of course, so I'm unsure if this is
reasonable at all.

Or does some kind of best practice exist that suggests how such a
problem is normally solved?

This will depend on the problem, but let me site a simple example for how this is often solved.

Lets say your process is to read in rows matching a certain criteria, and write a new row somewhere else. Since you say you cant guarantee that, during this process, another process might kick off that will try to do the same thing, here's the avoidance strategies:

1. As part of the process, the "input rows" are changed so that they no longer match the criteria. This means that you're putting a "select for update" on those rows and that will stop a second process trying the same row-level lock. The second process will block until the first completes/commits, and when unblocked those rows no longer match the criteria, and are no longer returned in the query.

This sounds most like what you're doing now.


2. The output rows have a unique constraint put on them, so if the same input rows are processed, the INSERT will fail. In this case a) you'll need to make sure you trap the exception and handle it gracefully AND b) ensure Django isn't going to undo your work and use an UPDATE instead

This is generally only feasible if you can ensure that there's a consistent mapping between output and input rows


3. Make the whole process "idempotent"... meaning that you can re-run it any number of times and it will have no further impact on your data model.

Eg: if your input is "all data in time range :00-:05" and your output is a summary then all you're losing if you re-run it is a touch of performance. This wont work, of course, if your process is "summarise and delete source data", in which case you'll need to lock input rows AND possibly the output row, and also adjust your code so that if it thought it was creating an output row, but it suddenly exists, that you can switch over to updating it instead.



I hope that gives you sufficient options.







Carsten Fuchs

unread,
Dec 15, 2012, 4:41:21 AM12/15/12
to django...@googlegroups.com
Hi Chris,

Am 2012-12-13 19:39, schrieb Chris Cogdon:
> This will depend on the problem, but let me site a simple example for how this is often
> solved.
> [...]
>
> I hope that gives you sufficient options.

Yes, it does; I guess I'll just have to experiment with locking the "input" rows a bit. :-)

Thank you very much for your help!
Reply all
Reply to author
Forward
0 new messages