> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>
The key is to use a single document to "synchronize" the transaction,
i.e. the commit or abort must be done with a single-document update. One
way to do this is to create a special collection for transaction
documents and then do transactions in these steps (i.e. using locks):
1. create the transactions document.
2. "lock" each object the transaction will use (see below, saved inside
the object).
3. commit the transaction, by atomically updating the transaction
document if it has not been aborted.
4. atomically update each object and remove its lock (only if still locked).
5. when all updates have been successful, remove the transaction
document and return success.
The locks are written as a field in the locked object and consists of a
reference to the transaction document and an expiration time. By always
locking objects in a particular order we can avoid dead locks.
When the transaction document is updated to "committed" (or earlier) all
intended updates for each object is also written to the transaction
document. That way we can complete the updates even if the client crashes.
If a new transaction finds an expired lock it can look up the
transaction document and try to abort it (only possible if not
committed). If the transaction is committed, then it can simply update
the objects and release the locks. After an abort it releases the locks
without doing the update instead. If it finds a non-expired lock it will
have to retry it after waiting a few ms.
One could get rid of the transaction document by putting this data in
the "first" of the objects being involved instead (similar to that
Google paper below). Requires fewer updates, but if another client
overwrites that object then the whole transaction is corrupted. With a
separate transaction document an overwritten document just means that we
lose the updates in that document. It is still important to note that
updates outside of transactions simply ignore locks and can overwrite them.
Note that the objects can be locked at the end of the transaction (as
part of committing it), but if you need to make sure the objects are not
changed by another transaction during the whole transaction then you can
lock them in the beginning of the transaction instead...
There are a lot of small things that you need to think about when
implementing something like this (e.g. getLastError with w=? and
wtimeout for important writes and how to safely delete an object in a
transaction), but I'll mostly ignore that here. You should definitely
check that something has been written to a majority of machines in a
replica set before you consider it successfully written...
Do note that this requires quite a few more writes than an ordinary
update and the locks will be held for quite some time (especially if you
lock everything at the start of the transaction and/or have large
transactions). When a client crashes it will also take a while until the
locks expire. Never the less it should be fast enough for some small
transactions that are run seldomly. Transactions that involve mutually
exclusive sets of objects do not affect each other at all either, so it
scales well when you get more objects (if each transaction still
involves only a few of them). It does not scale at all if a single
object is involved in many transactions.
I believe you can create a more specialized solution (still using a
single document to synchronize your updates) that is more efficient, in
most cases. GridFS could e.g. reference the data objects in the metadata
object and on updates create new data objects first, then update the
metadata object and last delete the old objects. Then a reader would not
see half-updated files.
I also have a use case where I (not frequently) need to update several
objects in a safe way. I think I will solve it with a special solution
for my particular problem.
Just my thoughts on the subject. I wrote this very quickly so may have
missed some things. Perhaps it would be useful to create something
similar to GridFS that does something similar to the transactions
described above?
/ Hampus Wessman
First a few more notes on how it would work:
1. The client (library) would simply queue all updates until we get to
step 3 below. Reads would be done as usual. This means that you would
not read your own writes inside a transaction (but then you probably
know what you just wrote anyway, so shouldn't be a big problem). All
updates also need to affect one specific object and we need to know its
primary key (so we can queue it and proceed as intended later).
2. You can either run the whole transaction first (queuing all the
updates locally) and then on commit execute step 1 - 5 or you can
execute step 1 & 2 first (similar to explicit locking in many SQL
databases), then do reads and writes (the writes being queued) and
finally execute step 3 - 5 on commit. I hinted at this, but I wasn't
very clear. This is easily achieved if we include all affected objects
in the locking at the beginning (otherwise it gets more complicated with
regard to deadlocks). This achieves different levels of isolation (see
below, note 9).
3. That the system would work just as well in a sharded environment as
otherwise is sort of obvious (if you fill out the missing details a
bit...) as it only relies on atomic updates of single documents anyway.
4. The system must work reliably with replica sets. For this to work we
need to write everything (that is important) to a majority of the
replica set members. We can check this with getLastError(w=m) where m is
a majority of the replica set members (on all shards). We will want to
have a timeout here, though, and when it times out we simply give up and
abandon the transaction. What is interesting is what to do when we e.g.
find a transaction that is committed, but not yet "cleaned up". We
cannot assume that this has been replicated to more than one machine
just because we read it (i.e. it may not be committed safely after all).
We do know (hopefully) that if it has been replicated to a majority of
the machines then we will always read that value (unless overwritten),
on the other hand. What we do here is that we try to write what we just
read again (i.e. without really changing anything) and then if
getLastError(w=m) succeeds we know that it is safely stored! Note that
in this situation we could NOT have tried to abort it, however, as the
commit MAY have been replicated to a majority and then we would
overwrite the commit (not good!). If it was neither committed or aborted
yet, then we could try to write whatever we wanted. This is actually
similar to Paxos, for those who are familiar with it (i.e. we may
theoretically overwrite something written to a minority, but never
something written to a majority as we will always read that something
and then write it again instead). That writes replicated to a majority
of replica set members are "safe" should be guaranteed by MongoDB (it
does so by using a consensus algorithm when it selects a new primary;
see http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts).
5. Right now the assumption in note 4 above is not entirely correct when
replica sets are used together with sharding, due to
http://jira.mongodb.org/browse/SERVER-2119. This will be fixed of
course, but still good to know about. The issue is that data could be
migrated to another shard (and only written to one server there) and
then lost in a failover in the new shard, even if it was once written to
a majority of nodes in the original shard.
6. The system I outlined uses a kind of (strict) two-phase locking, but
without special read-locks. Locking read objects are optional and they
are locked with exclusive locks when you choose to lock them (see
above). It's still basically the same, though. It should therefore work
correctly (if implemented correctly). See
http://en.wikipedia.org/wiki/Two-phase_locking.
7. Deadlocks are avoided if one only acquire locks in a predefined order
(that all clients agree on). I said that below, but it's worth pointing
out again. That way it is impossible to get circular waits (see
http://en.wikipedia.org/wiki/Deadlock). One could e.g. order all objects
based on database name, collection name and primary key.
8. This would probably be more efficient if implemented inside MongoDB
(or similarly), but I think it's cool that it could be implemented as a
library (or perhaps a specification + several libraries). One
inefficiency I see in doing it that way is that a client that finds a
locked object must poll the lock until it succeeds to acquire it... No
problem with low lock contention, though. One challenge would be to make
sure that all client libraries (different languages and version and so
on) followed the same "rules" and worked together without compromising
correctness.
9. A note on transaction isolation. Have a look at e.g.
http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html.
With the system described here, ordinary reads (outside first locking
objects) would be "read committed". They would never see uncommitted
data (from a transaction), but they could, however, see data from
transactions committed during the read. This means that if a transaction
updates 5 objects and you read those 5 objects concurrently, you could
get some with and some without the update (this could not happen in
PostgreSQL with "read committed"). If you lock the objects first, then
you get "repeatable reads". No other transactions will update the
objects concurrently. You only know that this was the case if you
finally manage to commit the transaction, though (the transaction may
have been aborted by another client and the locks released otherwise).
Also note that ordinary writes outside of a transaction could also
delete and/or violate the locks, so you must make sure that there are no
conflicting writes that doesn't use transactions...
10. The locks can be written as a special field inside the locked
objects. This is nice, because you don't need to change the layout of
your data to use it (and you don't need to use it for everything).
Ordinary reads will work just as before (as long as you ignore that
special field). One could also wrap the actual data inside a document
that adds transaction data on top of it (e.g. embedding the user data
under a 'data' field). If you also had a library or a proxy that all
requests went through, then you could easily rewrite requests so that
the lock info was never overwritten and never returned to the user... I
kind of like the less intrusive alternative above, though.
Comparison to MVCC:
11. This system would only use locking instead of using some kind of
MVCC to handle concurrent reads. The Google paper linked to earlier used
MVCC for consistent reads (and locking for committing transactions).
12. The main advantage of MVCC is that you can do consistent reads
without locking. You can do the same with this system, but you would
need to lock the objects before if you need the read to be consistent
between objects. If you do an ordinary read in this system you risk
getting inconsistent versions of objects (each being consistent with
itself, though). MVCC really helps here, if you need to do a lot of
consistent reads like this. I don't think most people need to do that.
13. The disadvantage of MVCC (in this context at least) is that it gets
complicated. You need to synchronize version/transaction numbers between
objects (probably globally using some kind of coordinator) and you need
to clean up old versions when you think they aren't needed anymore and
so on.
Now that got a bit longer than originally intended, but I think it all
got a bit clearer at least... Hopefully someone will find it a bit
interesting. I might try to create a real library like this some day
(won't have time for a while, though). Anyone that would be interested
in that? In that case, what about note 10 above? Finally, any better
ways all this could be done?
Hampus Wessman
My point mostly was that I think it would be possible to create a
client-side library for making some kind of atomic multi-object updates
with MongoDB (without any additional built-in support). It wouldn't
necessarily be very high performance, but should definitely be possible
and I think it's an interesting idea. That said, I think it would be
better in most cases to just implement an ad hoc solution using MongoDB
directly (which should be equally possible).
Not sure how XMPP would help, but doesn't MongoDB already push the
locking (or whatever you choose to do) into the (client) application?
Pushing it into MongoDB (if you meant it that way) would probably be
more efficient (not saying it would be a good idea).
/ Hampus
On 2010-12-10 20:39, Andrew Rondeau wrote:
> Wow, that was long.
>
> Why not just use some form of an XMPP server and push the locking
> logic into the application?
>
> On Dec 9, 6:51 am, Hampus Wessman<hampus.wess...@gmail.com> wrote:
>> I should add a few more notes on the system I outlined in my last e-mail
>> (see below). First note that this is just an example of one way this
>> could be done (i.e. adding transactions on top of MongoDB). May still be
>> interesting for some, though. I sort of like it. Feel free to comment.
>>
>> First a few more notes on how it would work:
>> 1. The client (library) would simply queue all updates until we get to
>> step 3 below. Reads would be done as usual. This means that you would
>> not read your own writes inside a transaction (but then you probably
>> know what you just wrote anyway, so shouldn't be a big problem). All
>> updates also need to affect one specific object and we need to know its
>> primary key (so we can queue it and proceed as intended later).
>>
>> 2. You can either run the whole transaction first (queuing all the
>> updates locally) and then on commit execute step 1 - 5 or you can
>> execute step 1& 2 first (similar to explicit locking in many SQL
>> read more �
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.