True, but this is not the way to approach a locking design, because it
doesn't describe who is involved in the lock and when, which really
drives the implementation design (the above bullet points are
implementation-related, not functional). It also doesn't describe
whether locks are intra-thread, inter-thread or inter-process and
different cases have different requirements. Adding calls to locking
primitives everywhere data might be changed isn't an appropriate
approach here.
Start from which things (and it should be a minimal set, due to the
management overhead) that might need to be guarded against simultaneous
access and identify the points in the pipeline where they are in
"critical sections". Then we can work out the scope and timeframe of any
synchronisation points.
This *does* have to be handled very much on a case-by-case basis because
locking overhead is significant. So you want locks to be as few as
possible and as fine-grained as possible. #5632 adds a few too many
synchronisation points -- get_or_create doesn't need one, for example,
since the database handles that already -- and it's only thread-safe,
not process-safe, which would seem to defeat a lot of the purpose of
syncrhonising there (which is why you try to push as much as possible to
the real choke point ... the database).
There's certainly some work we can do here, but you've skipped a couple
of steps, so we're going to miss things. Let's starts with a list of
which accesses must be synchronised, which ones might need to be
synchronised (on a per-app basis, so maybe we provide userspace with a
mutex structure) and which ones aren't really issues.
- Session updating is a "sometimes" (many apps will be happy to
say "only use a session from one computer at a time"). However,
if you need it, it needs to be multi-process, not just
multi-threaded which probably means using the database as a
synchronisation method (e.g. adding a version attribute to the
session structure).
- We do need to work out some kind of "lock for update"
equivalent at the Python level so that a user can reliably do
my_model.score += 1
my_model.save()
and have the right value saved. Holding a full lock for every
model instance in memory on the off-chance it might be saved and
the save might need to make incremental changes like this is a
bit heavy-weight, though (a lot of those cases are why manual
transaction support is available and we should be making sure
that works smoothly).
- Current trunk's save() method doesn't need synchronisation, as
far as I can see, but we might need it for multi-table
inheritance.
That's a start. You probably have other items in mind, so add to the
list.
You've called this post "threading improvements", but at least one of
the discussions you've pointed to, plus #5632, aren't just
threading-specific. They're about general multi-process access. So cast
your net a little wider here in the design.
Regards,
Malcolm
--
Why can't you be a non-conformist like everyone else?
http://www.pointy-stick.com/blog/
IMO, having synchronization in the app itself is just a dangerous,
performance-killing delusion :-) The second you deploy your second web
server your carefully crafted guarantees will just disappear.
What this means from a practical standpoint is that some things just can't
be done client-side and should be pushed down into the db.
FWIW, my experience is with scaling databases at Stamps.com (over a million
financial transactions a day using C++/MS SQLServer with no ORM) and at
ThisNext (Perl, MySQL, Class::DBI).
mrts wrote:
> Get or create
>
> Precondition: object with foo='bar' does not exist in the database
>
> 1. P_1: M.objects.get(foo='bar'): DoesNotExist, needs adding
> 2. P_2: M.objects.get(foo='bar'): DoesNotExist, needs adding
> 3. P_1: create m = M(foo='bar'), m.save()
> 4. P_2: create m = M(foo='bar'), m.save()
>
IMO, get_or_create() should be one database call that handles this in the
most appropriate way for the backend that is being used. The implementation
could (should?) differ based on the availability of transactions,
backend-specific SQL extensions, or other mechanisms (stored procedures,
etc.).
That said, Class::DBI for Perl[1] has a similar problematic implementation
and it has been popular for a long time now so maybe this isn't so urgent.
My assumption is that those that know they will be in a environment where
this is problematic (either high throughput or on a table where contention
is likely) simply won't use that method.
> Changes dependent on the previous value
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The classic increment problem.
>
> 1. P_1: m = M.objects.get(foo='bar')
> 2. P_2: m = M.objects.get(foo='bar')
> 3. P_1: m.score += 1, m.save()
> 4. P_2: m.score += 1, m.save()
>
This is really something that can only happen at the database level. With
the Django ORM, I think this is only possible by bypassing the ORM
altogether. Ideally it would be possible to mark the field as "read only"
or "temporary" so that the ORM would never attempt to save the value for
that field, but it would still be available for reading.
If this is a common enough need, perhaps new methods are needed to handle it
specifically (I'd propose a name, but I can't think of any that would be
non-controversial). They would go to the database immediately and return
the correctly updated value. I suspect that having the ability to mark
fields as temporary would be sufficient though -- then the developer could
issue the appropriate UPDATE and set any in memory copies to the new value
themselves.
>
> General CRUD behaviour
> ~~~~~~~~~~~~~~~~~~~~~~
>
> Delete without a test for previous existence is OK, but could fail
> without
> locking otherwise. Also, as above, it can't be assumed that deleting
> an object is assured to result in it's removal (P_1 and P_2 hold m,
> P_1
> deletes m, P_2 updates and saves m and m springs to life again).
>
I don't think this is correct. An update isn't going to insert a record.
Of course, a user could enter a new item that has the same values as the
old, but it would have a different pk and isn't solvable anyway.
>
> The solution
> ------------
>
> Locking primitives
> ~~~~~~~~~~~~~~~~~~
>
Ugh. I hope this route isn't taken. No amount of locking is going to
handle the multiple server case, unless that locking is in the DB or as a
separate network-based service. If the latter, then another point of
failure has been added to your server infrastructure, never a good thing.
Craig
On Sat, 2008-04-05 at 08:27 -0700, mrts wrote:
[...]
> Most operations in model level that have '.alters_data = True'
> have some concurrency issues,
This is not really the right language and it misleads things a little.
"Issues" implies there's a problem (by definition, an issue is an item
of controversy). First identify the requirements to work out if there's
a problem, rather than begging the question.
> Get or create
> ~~~~~~~~~~~~~
>
> This is where the problem started. Malcolm notes that "get_or_create
> doesn't need one, for example, since the database handles that
> already", but as get_or_create is not atomic in any way I doubt this.
Rather than using doubts, let's use the code as our guide...
After a call to get_or_create(), you are returned an object that is
known to be in the database and matches the filters you passed in. If no
object matching those constraints exists, it is created using the
default arguments passed in. This will always be true regardless of how
many threads or processes are simultaneously accessing the function.
However -- and this is important -- if the default arguments you pass in
do not constrain the item to be unique in the database, it is possible
that two such items are created via simultaneous access. Get_or_create()
does not say that it will work around that case. You have passed in
requirements for an object that are non-unique, so more than one
instance are created. If you only want a unique object, make sure that
the arguments you provide as the defaults argument specify it must be
unique at the database level. That's what I mean by the database
handling the problems. The database is the ultimate synchronisation
point here and we should use it as much as possible, since it's the best
cross-process thing we've got available: fine-grained, low-level (so
conflict detection and resolution happens at the most specific place
possible) and already in existence so we don't add extra overhead.
This is actually an area where I wish a little more progress had been
made on the database constraints project from last summer of code,
because it would help people push down things like unique_together to
the database if they needed this guarantee.
It's also not particularly hard to upgrade get_or_create() to be able to
guarantee uniqueness on insert and I think that would be nice to do.
That requires an addition we need to make to the model save() method
that we need anyway. We need the ability to differentiate at the save()
level between "must create a new object", "must update an existing
object" and "can update an existing object or else create a new one". I
intend to fix that particular problem once queryset-refactor (my main
Django time suck currently) is merged.
You're right that this should be documented better. I don't see the need
for Django providing inter-process locking, since it's unnecessary
complexity, in the scheme of things (and not always possible -- what if
your views are running on different machines?). The current lock-free
approach appears provably correct and seems useful, providing you have
the correct database constraints in place (note that this is using
recent subversion code which handles the insert/select race better for
when database constraints do prevent multiple inserts).
> Changes dependent on the previous value
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The classic increment problem.
>
> Precondition: m.score == x
>
> 1. P_1: m = M.objects.get(foo='bar')
> 2. P_2: m = M.objects.get(foo='bar')
> 3. P_1: m.score += 1, m.save()
> 4. P_2: m.score += 1, m.save()
>
> Expected postcondition: m.score == x + 2
> Actual postcondition: m.score == x + 1
>
> Malcolm points out that "Holding a full lock for every model instance
> in memory on the off-chance it might be saved and the save might need
> to make incremental changes like this is a bit heavy-weight though (a
> lot of those cases are why manual transaction support is available and
> we should be making sure that works smoothly)", so again this is up to
> a design decision: either document the behaviour and provide a
> userspace inter-process locking primitive OR use a locking mechanism
> by default (a low-overhead locking mechanism does not exist for this
> case probably).
This is the case where we can lift "SELECT FOR UPDATE..." transaction
control into the Python level. Again, it lets the database be the
serialisation point. In agreement about the need for a solution here,
not quite sure how to spell it at the Python API level yet.
> General CRUD behaviour
> ~~~~~~~~~~~~~~~~~~~~~~
>
> Apart from the 'Get or create' case above, Create is OK.
>
> Retrieve is OK, but it can't be assumed that the object has the same
> state is the database throughout its lifetime (as other processes may
> have changed the state). Users need to be aware of that.
Well, this is hardly revolutionary behaviour. It's true any time you
retrieve data from a database and use it in a program. If people are
expecting live updates, they've been using very rare languages
previously.
This is that annoying territory where we have to decide how much
documentation space to spend on people who are just learning to program
at the cost of drowning out the relevant stuff in amongst what are
essentially things that are pretty much always true in computer
programming. It's a hard problem with no perfect answer but I personally
favour keeping this kind of cruft out of the main docs (put them in a
"if you are just starting out with programming" doc by all means) since
beginners can always improve to become better. We're writing a framework
for professionals here; let's keep that userbase in mind and try to
convey information to them as efficiently as possible, too.
> Apart from 'Changes dependent on the previous value', Update is OK.
> Again, users need to be educated that it can't be assumed that m.foo =
> 'bar', m.save() actually means that field foo has value 'bar' in the
> database (state may be changed by other processes). As Malcolm
> pointed out though, multi-table inheritance would pose problems for
> save().
Maybe or maybe not. I haven't double-checked it yet. Have you? From
memory, there's a single transaction in place around the save method, so
it's quite possibly already safe. I'm personally massively uninterested
in adding extra stuff just to support people using MySQL + MyIsam. If
they want ACID properties, they should feel free to use a real database
storage engine.
> Delete without a test for previous existence is OK, but could fail
> without
> locking otherwise.
Why does this need locking? If the delete fails because the item doesn't
exist, so be it. Catch the error and move on. The database is in the
correct state at the end of the operation in either case.
> Also, as above, it can't be assumed that deleting
> an object is assured to result in it's removal (P_1 and P_2 hold m,
> P_1
> deletes m, P_2 updates and saves m and m springs to life again).
In the cases where this is important, the saves of any such objects
should ensure they use the (from the future) update-only (don't create a
new object) version of saving. I suspect the right answer here
ultimately is that, by default, instances created by data retrieved from
the database should be marked as such and they will automatically go
through the update-only path at save time, which will be faster and will
allow client code to correctly detect -- and then handle at will -- the
case of "somebody's deleted me when I wasn't looking."
> Generally, if a state change depends on previous state in the
> database, it is not safe, otherwise it is. And it can't be assumed
> that
> object state does not change in the database during it's lifetime in
> current process.
If you actually sit down and analyse a lot of the database operations
involved in a website (and Django is for websites), there aren't a lot
of these dependent cases. That's why website frameworks work very well
with Optimistic Locking models, since conflicts occur relatively
infrequently and you often don't mind even when they do: pick one of the
possible answers and go with it, isn't a bad solution a lot of times.
Yes, there are times when you do want a dependent update (the x = x + 1
case), but they aren't all that frequent (in the sense that you can
design around them). I agree that it's something people have to learn to
do, but if you're writing massively parallel systems, this hopefully
isn't your first few months on the job.
For example, take session updates. If two more or less simultaneous
session updates collide, there's nothing you can really do to resolve
them. Just pick one and go with it. It doesn't even matter which one,
but you could set things up to handle the latest created one if you
like. After all, it's *session* data (the domain of the problem is
significant): it's stuff that is only relevant to this session -- tied
to a user and a machine -- and having a collision means the user is
trying to do two things at once, possibly because one response was
taking too long. Okay, can't be helped... you're design is strong enough
that you're only storing session-relevant data there and critical,
long-term data in the proper place elsewhere. So just save one of the
results. At the implementation level, conflicts can be detected at
commit (save) time by including a version number in the session cookie
and making the update include the version number as well as the pk in
the update request (so updates of an old value will fail). Requires
"forced-update" saves to exist and needs adding the version to sessions
(and supporting it in session saves). But that's all solvable.
This is what I'm getting at when I'm encouraging approaching things on a
case-by-case. I don't think we need general locking primitives in Django
because requiring them is something that can be designed around and
usually makes the design stronger. Plus, getting locking primitives both
functionaly complete and correct is unbelievably hard (since you
shouldn't rule out the most efficient ways of running websites: multiple
processes and multiple machines). That's why I'm arguing in favour of
generally lock-free algorithms that use the database as the
synchronisation point
> The solution
> ------------
>
> Locking primitives
> ~~~~~~~~~~~~~~~~~~
>
> As Malcolm pointed out, locks can be intra-thread, inter-thread or
> inter-process and different cases have different requirements.
[...]
You've missed inter-machine (views on different machines being
distributed to via a reverse proxy) and this is symptomatic of the
slippery slope and why other solutions than locking are going to be more
appropriate.
The very strong argument in designing "shared nothing" styles of
architecture, particularly for web services is to enable proper
scalability without penalising smaller cases with the overhead and
without leaking abstractions into user code (proper locking management
will absolutely require client code to call it, which leaks the whole
locking and synchronisation structure into code that we should be trying
to insulate).
[...]
> Minimizing locking overhead
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> There is a nice optimisation for avoiding locking overhead in the
> get_or_create case, the Double-Checked Locking Pattern ("Modern C++
> Design" by Alexandrescu gives a nice treatment, but it's also
> available at
> http://en.wikipedia.org/wiki/Double_checked_locking_pattern ):
There's also quite a famous paper be Alexandrescu and Meyers showing
that the double-checked locking pattern doesn't actually work properly
and is unbelievably hard to get right the C++ level (and people
including the authors missed that for a number of years; I was at a
conference where Scott Meyers explained some of the background to
discovering this problem and realising "compilers are hard!"). Search
for "C++ and the perils of double-checked locking." Most of the problems
in that paper aren't applicable in Python in its current form because
Python 2.x doesn't make the bytecode-level optimisations that bite most
of the C++ implementations, but it does show just how hard this stuff is
to get right. The "conclusions and alternatives" section of that paper
sums up a similar argument to what I'm trying to make here: don't rely
on locking more than absolutely necessary. They don't go as far to say
"design around it", since it's a paper on locking and they'd like to
remain relevant, but designing lock-free algorithms (which is really the
ideal solution here) is a much bigger area of computer science these
days with distributed systems.
> if not exists(object):
> with lock(module):
> if not exists(object):
> create_and_save(object)
> return object
>
> There may be other tricks I'm not aware of.
The critical part of this pattern is how expensive the is_lock() call
might be. Locking, even when there's no contention, on distributed
systems -- processes or machines -- isn't free and is easy to starve for
resources if you don't correctly release it and easy to deadlock if you
don't do it in the right order (and adding deadlock detection to avoid
that isn't free either).
> Conclusion
> ----------
>
> It needs to be decided whether to guard operations that depend on
> previous state in the database with locking by default or provide
> locking primitives for userspace and how to implement inter-process
> locking.
Or to design around them so that locking isn't required in userspace
client code at all in all but the most exceptional cases.
There's a very real risk with allowing locking to creep up into client
code: the rest of the process becomes hostage to that code behaving
correctly. If you take the lock and don't release it, or you take
multiple locks in the wrong order, it's resource starvation time. And
it's not just the request/response path that matters here, so whilst
putting a lock reaper in the response path will be a good idea for any
locking stuff, it isn't a panacea. Any scripts (e.g. cronjobs) also have
to obey the rules and they can be run from anywhere. Locking primitives
are a dangerous thing to add to large code bases.
I firmly believe (which I think must be obvious from this post and my
last one) that if we can avoid introducing the necessity for this, it
will save everybody -- maintainers and third-party developers alike -- a
lot of trouble down the track. Locking done right means it works in all
cases, otherwise, as Craig pointed out, it's a delusion. Solving these
problems, though, doesn't necessarily require locking at a level above
the database, so we aren't obliged to walk down that road.
Best wishes,
Malcolm
--
I intend to live forever - so far so good.
http://www.pointy-stick.com/blog/
BTW, may be get_or_create should be rewritten as:
get()
try:
create()
except BackendSpecificUniqueConstraitFailure:
get()
?
> This is the case where we can lift "SELECT FOR UPDATE..." transaction
> control into the Python level. Again, it lets the database be the
> serialisation point. In agreement about the need for a solution here,
> not quite sure how to spell it at the Python API level yet.
Model.objects.get_for_update()
Model.objects.filter(...).for_update()
?
That's more or less how it is written, since r7289 (March 18). The only
small problem is the create step, which is a call to the model's save()
method. If you don't supply default data that forces the instance to be
unique, it could result in multiple occurrences in the database. That's
why it really wants to be model.save(create_only=True), or however we
end up spelling that at that point.
Regards,
Malcolm
--
Save the whales. Collect the whole set.
http://www.pointy-stick.com/blog/
Great! We've been running too long on a snapshot from December then :-)