Multi-document, ACID transactions for MongoDB

Andrew Armstrong

unread,

Dec 8, 2010, 4:10:46 AM12/8/10

to mongodb-user

Hi,

I am currently researching several 'NoSQL' type solutions and MongoDB
looks pretty good so far.

One point that I am concerned about is the lack of transaction/atomic
commit of multiple document updates. This makes me think "well, ill
need a RMDBS to store some transactional data anyway" to do
housekeeping of things like invoices and customer accounts for
example.

I understand I am about to get yelled at (possibly?!) as I understand
why this feature is not supported at this time, having read the
MongoDB documentation and a few of these Google Groups discussions,
but hear me out for a moment :)

--

1) Use Case: I would want to support MongoDB having transaction
semantics for only a handful of relatively simple, low frequent
operations. For example, I would like to make sure updates to 2 or
more documents (insert, update or delete) happen across the cluster
atomically or not at all. For example, I may want to generate an
invoice for a customer and mark it as Paid while at the same time flag
their account as having a 'Premium' service.

This would literally be an update to two documents (for example), but
I could not risk the chance that an invoice is marked as 'Paid' yet
the 'Premium Account' flag is still False, or perhaps I want to delete
a customer account but at the exact same moment write a log message
about why that action was taken - I don't want a customer to be
deleted with no record as to why, for example.

I am and would be aware transactions are pretty 'expensive' in any
distributed system, and so as a developer would be conscious to
keeping transactions small and only when really necessary.

2) Considering the decentralized nature of MongoDB, the properties of
ACID could be relaxed where needed to provide as much of the benefits
of transaction-like semantics as possible without compromising on
performance or scalability.

For example, perhaps MongoDB transactions could have these properties,
assuming I have wrapped two 'document update' queries that spans two
primary nodes (2 separate physical machines / shards):

a) Transactions are durable in that when Commit(w: 2) returns without
error, the two primary nodes have all replied that they themselves,
plus one replica of their own, have confirmed the commit.
This 'w' flag would act in the same way as the documented
getLastError(w: N) command does for a single write query. For an
initial release for example, just let the replication nodes confirming
the commit as being durable enough - perhaps later a 'require fsync'
flag could also be passed.

b) Transactions could optionally be lockless, in that it would be
optional (like regular transactions) to not require locks on the rows
you are reading or modifying.

However, should the transaction have a Commit() call issued, it should
fail if one of the documents being updated/inserted/deleted in the
transaction have been changed since you last retrieved them during the
transaction (for example, each document can have a 'local version
number' that is just incremented whenever the documented is written
to).

During the commit by each primary, it checks to see if the local
version number differs from what the client (application server)
indicated what it thought was the latest local version number for that
document.

c) Transactions could support locks to some extent, I assume by
marking each read/loaded document in the transaction with a
'_lockedBy' attribute and a transaction GUID generated by the client;
which perhaps times out after a certain amount of time without any
commit or rollback. Of course this would prevent those records from
being accessed during that time (or perhaps just returning to
requesting clients 'Record is locked' errors) until the transaction is
completed.

3) Off the top of my head (and I could be completely wrong!) I would
envision a transaction described above would be relatively light
weight.

The client application could generate a random GUID to represent the
transaction, and then tell every primary it talks to that when its
inserting/updating/deleting the document, that its part of the
specified Transaction Id; and that the changes should be held off
until a final 'Commit Txn <GUID>' message comes through.

For example, perhaps when MongoDB is told to update an existing
record, it really inserts a new record (and hides it from view?) and
when the 'commit' comes through it just switches out the old for the
new - if possible - to avoid needing to create a transaction log
system.

I have probably missed a lot of issues here; and I know MongoDB can do
atomic updates of a single document (on its own local disk), I wonder
if it would be feasible to extend this to do multiple updates locally
so that it can participate in a 2PC commit procedure of a distributed
transaction.

Mongos could co-ordinate a 2PC between all primary replica's involved
to facilitate the transaction.

4) Google recently published a paper about their distributed
transaction system written on top of Bigtable (using local lock files
and other bits of information stored atomically in single-row
updates); perhaps a library could be built ontop of MongoDB to perform
the transactions this way, albeit a bit more difficult.

--

As a potential new customer to MongoDB; I do have a valid reason for
wanting transactional semantics (which if I can't get with MongoDB,
much to my dismay and sadness will need to build a 2nd solution to
support customer invoicing, etc as described above).

I would love to avoid that, and I think the idea's of transactions
have possibly been discouraged too quickly, when I have seen several
requests/use cases that are valid and I would not imagine to hamper
performance in a system such as MongoDB, especially when done
infrequently and to only a handful of records at a time. Perhaps there
are other requirements that can be dropped to facilitate lockless,
atomic insert/update/delete style operations as a very basic
transaction, which may cover a majority of use cases.

Thoughts/feedback?

Thanks,
Andrew

Maciej Dziardziel

unread,

Dec 8, 2010, 8:26:50 AM12/8/10

to mongodb-user

On Dec 8, 10:10 am, Andrew Armstrong <phpla...@gmail.com> wrote:

>
> As a potential new customer to MongoDB; I do have a valid reason for
> wanting transactional semantics (which if I can't get with MongoDB,
> much to my dismay and sadness will need to build a 2nd solution to
> support customer invoicing, etc as described above).

Mongo does not support full ACID and (my guess) it will not any time
soon, if ever.
You can find various whitepapers or theories that this can be achieved
in distributed environment,
but so far no one is able do it. So either use RDBMS for this kind of
operations that require transactions or work around when possible.
A lot of money and man-hours has been put into creating scalable
transactional systems and so far all efforts failed (and the attempts
i have seen where nightmares to deal with),
so it is really not so simple. Unfortunately you cannot have
everything.

(well, you can also write a letter to santa claus, might be worth a
try :-)

--
Maciej Dziardziel

Markus Gattol

unread,

Dec 8, 2010, 8:42:56 AM12/8/10

to mongodb-user

Maciej, I totally agree on what you said ... it's true that a lot of
time and money has been sunk into this hole already but still, no one
seemed to have managed and get it plugged. I think the best thing to
do is write software so that you can use both, a RDBMS for those
portions of your code that need transactions (probably a minority) and
use MongoDB for all the rest which do not need to be transactional.

Django's ORM for example has a router which allows you to do just that
ie route writes/reads to the appropriate database backend depending on
whether or not you need to do transactions in your application.

Andrew Armstrong

unread,

Dec 8, 2010, 9:03:34 AM12/8/10

to mongodb-user

Thank's for the replies! :)

I agree, I too have seen many projects tackle this - and its
definitely a difficult problem.

If you have not seen this Google publication before (made recently
this year when Caffeine search was launched) it talks about overlaying
a multi-row transaction system with 'snapshot' isolation support (with
some restrictions too) over Bigtable, which only supports single-row
atomic updates like MongoDB.

You can find the article at
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36726.pdf
(http://research.google.com/pubs/pub36726.html)

The discussion paper though raises some interesting points in order to
have it scale to thousands of machines concurrently while facing
machine failures, like MongoDB must handle.

Perhaps it will spark some interest by the MongoDB developer team and
may have them re-think the initial position of the transactions being
a bit impossible! :)

What's interesting is that this concept could be applied to MongoDB
much like GridFS was implemented, even if the developer's decide its
not a suitable addition to add directly - today. For example the paper
references multiple versions (timestamp column) of documents; when the
overlayed transaction system could just have an inner versions[] array
within each document to maintain snapshot isolation, and during the
transaction usage (ie all access to the collection must be done using
the 'transaction aware' overlay, like GridFS) it respects this
column .

Cheers,
Andrew

Eliot Horowitz

unread,

Dec 8, 2010, 9:39:08 AM12/8/10

to mongod...@googlegroups.com

Overall this is an area we'll probably do some things next year.
The key for us is making it scale, and 2PC doesn't really scale very well.
That being said, we do have some ideas about things to do that would help a lot.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Andrew Rondeau

unread,

Dec 8, 2010, 3:21:33 PM12/8/10

to mongodb-user

As a developer who might eventually need some form of transactional
support; my understanding is that MongoDB can only be atomic around
documents. This means that, if I had a schema design where the data
touched by a transaction was isolated into a single document; I could
have something like a transaction by doing a compare-and-exchange on a
single document. This of course is somewhat limiting and may only work
in some cases.

What would be nice is if I could do some form of a "compare and
exchange" on multiple documents at once; and then push the complexity
of rolling back failed writes into the drivers or my application. (For
example, the driver buffers all writes until I call "Commit," which
then returns true or false depending on if someone else modified the
documents that I'm trying to modify.)

(A global lock would also be nice for some occasional tasks; although
it would probably impede performance too much to be used in "normal"
use cases.)

On Dec 8, 6:39 am, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> Overall this is an area we'll probably do some things next year.
> The key for us is making it scale, and 2PC doesn't really scale very well.
> That being said, we do have some ideas about things to do that would help a lot.
>

Maciej Dziardziel

unread,

Dec 8, 2010, 4:24:08 PM12/8/10

to mongodb-user

On Dec 8, 9:21 pm, Andrew Rondeau <andrew.rond...@gmail.com> wrote:
> As a developer who might eventually need some form of transactional
> support; my understanding is that MongoDB can only be atomic around
> documents. This means that, if I had a schema design where the data
> touched by a transaction was isolated into a single document; I could
> have something like a transaction by doing a compare-and-exchange on a
> single document. This of course is somewhat limiting and may only work
> in some cases.

Yes, this is it:
http://www.mongodb.org/display/DOCS/findAndModify+Command

--
Maciej Dziardziel
fie...@gmail.com

jannick

unread,

Dec 8, 2010, 4:35:44 PM12/8/10

to mongodb-user

Perhaps the MongoDB team could start out by looking at multi-document
transaction for the case where all documents reside on the same
machine. Combined with having the sharding of multiple collections
coordinated, this could solve a lot of use cases where the transaction
is local to a user of customer account.

On 8 Dec., 21:21, Andrew Rondeau <andrew.rond...@gmail.com> wrote:
...

Andrew Armstrong

unread,

Dec 8, 2010, 6:27:09 PM12/8/10

to mongodb-user

To be honest I think transactions that span a single machine are more
work than expected, since you'd need to tell the user "sorry you cant
do this transaction" just because a chunk has been rebalanced, when it
may have worked 5 minutes ago, and that's a pretty poor experience for
the developer.

>The key for us is making it scale, and 2PC doesn't really scale very well.
> That being said, we do have some ideas about things to do that would help a lot.

I'm sure this is one of your ideas, but can the 2PC, or even Paxos
(for example) portion of the transaction only be done for the 'set the
transaction live' commit request; so you don't need to coordinate
anything with shards while performing the record changes in the
background etc and the 2PC only kicks in at the very end, leaving you
with just a few moments where a 2PC needs to happen.

I think that people would really appreciate (I know I would!) having
support for transactional semantics in something like MongoDB.
Scaling is important, and thanks for being concious of it, but you can
always indicate that transactions in MongoDB are much more expensive
than a regular RDMBS and that developers should aim to use single-row
logic whenever possible to retain performance.

I don't mind if a transaction takes a second or two for example when
generating an invoice for a customer and marking their account as
'Premium' at the same time, its not a frequent operation. The trade
off of the request taking a few seconds, and costing more resources in
network bandwidth/storage overhead temporarily/etc for a few moments
outweighs risking 'corrupted' data because of an inconsistency in my
accounts.

Thinking about GridFS; I notice that it is not transaction safe
(someone could read a corrupt file if that file is updated by another
process because the chunks are being changed or deleted), so that
could span several hundred rows and a transaction would be useful here
too.

I would be interested to find out what other existing MongoDB
customers think about transactional support too; whether they have use
cases similar to mine (just a few multi-row updates, and possibly like
GridFS where there could be hundred-row updates too) etc.

Cheers
- Andrew

Hampus Wessman

unread,

Dec 9, 2010, 3:29:32 AM12/9/10

to mongod...@googlegroups.com

I have been thinking about how to create a system like that on top of
MongoDB for some time. I think it's possible and I even wrote a small
prototype of a Python lib that added (limited) transaction like behavior
some time ago. Here are some thoughts about it.

The key is to use a single document to "synchronize" the transaction,
i.e. the commit or abort must be done with a single-document update. One
way to do this is to create a special collection for transaction
documents and then do transactions in these steps (i.e. using locks):
1. create the transactions document.
2. "lock" each object the transaction will use (see below, saved inside
the object).
3. commit the transaction, by atomically updating the transaction
document if it has not been aborted.
4. atomically update each object and remove its lock (only if still locked).
5. when all updates have been successful, remove the transaction
document and return success.

The locks are written as a field in the locked object and consists of a
reference to the transaction document and an expiration time. By always
locking objects in a particular order we can avoid dead locks.

When the transaction document is updated to "committed" (or earlier) all
intended updates for each object is also written to the transaction
document. That way we can complete the updates even if the client crashes.

If a new transaction finds an expired lock it can look up the
transaction document and try to abort it (only possible if not
committed). If the transaction is committed, then it can simply update
the objects and release the locks. After an abort it releases the locks
without doing the update instead. If it finds a non-expired lock it will
have to retry it after waiting a few ms.

One could get rid of the transaction document by putting this data in
the "first" of the objects being involved instead (similar to that
Google paper below). Requires fewer updates, but if another client
overwrites that object then the whole transaction is corrupted. With a
separate transaction document an overwritten document just means that we
lose the updates in that document. It is still important to note that
updates outside of transactions simply ignore locks and can overwrite them.

Note that the objects can be locked at the end of the transaction (as
part of committing it), but if you need to make sure the objects are not
changed by another transaction during the whole transaction then you can
lock them in the beginning of the transaction instead...

There are a lot of small things that you need to think about when
implementing something like this (e.g. getLastError with w=? and
wtimeout for important writes and how to safely delete an object in a
transaction), but I'll mostly ignore that here. You should definitely
check that something has been written to a majority of machines in a
replica set before you consider it successfully written...

Do note that this requires quite a few more writes than an ordinary
update and the locks will be held for quite some time (especially if you
lock everything at the start of the transaction and/or have large
transactions). When a client crashes it will also take a while until the
locks expire. Never the less it should be fast enough for some small
transactions that are run seldomly. Transactions that involve mutually
exclusive sets of objects do not affect each other at all either, so it
scales well when you get more objects (if each transaction still
involves only a few of them). It does not scale at all if a single
object is involved in many transactions.

I believe you can create a more specialized solution (still using a
single document to synchronize your updates) that is more efficient, in
most cases. GridFS could e.g. reference the data objects in the metadata
object and on updates create new data objects first, then update the
metadata object and last delete the old objects. Then a reader would not
see half-updated files.

I also have a use case where I (not frequently) need to update several
objects in a safe way. I think I will solve it with a special solution
for my particular problem.

Just my thoughts on the subject. I wrote this very quickly so may have
missed some things. Perhaps it would be useful to create something
similar to GridFS that does something similar to the transactions
described above?

/ Hampus Wessman

Hampus Wessman

unread,

Dec 9, 2010, 9:51:15 AM12/9/10

to mongod...@googlegroups.com

I should add a few more notes on the system I outlined in my last e-mail
(see below). First note that this is just an example of one way this
could be done (i.e. adding transactions on top of MongoDB). May still be
interesting for some, though. I sort of like it. Feel free to comment.

First a few more notes on how it would work:
1. The client (library) would simply queue all updates until we get to
step 3 below. Reads would be done as usual. This means that you would
not read your own writes inside a transaction (but then you probably
know what you just wrote anyway, so shouldn't be a big problem). All
updates also need to affect one specific object and we need to know its
primary key (so we can queue it and proceed as intended later).

2. You can either run the whole transaction first (queuing all the
updates locally) and then on commit execute step 1 - 5 or you can
execute step 1 & 2 first (similar to explicit locking in many SQL
databases), then do reads and writes (the writes being queued) and
finally execute step 3 - 5 on commit. I hinted at this, but I wasn't
very clear. This is easily achieved if we include all affected objects
in the locking at the beginning (otherwise it gets more complicated with
regard to deadlocks). This achieves different levels of isolation (see
below, note 9).

3. That the system would work just as well in a sharded environment as
otherwise is sort of obvious (if you fill out the missing details a
bit...) as it only relies on atomic updates of single documents anyway.

4. The system must work reliably with replica sets. For this to work we
need to write everything (that is important) to a majority of the
replica set members. We can check this with getLastError(w=m) where m is
a majority of the replica set members (on all shards). We will want to
have a timeout here, though, and when it times out we simply give up and
abandon the transaction. What is interesting is what to do when we e.g.
find a transaction that is committed, but not yet "cleaned up". We
cannot assume that this has been replicated to more than one machine
just because we read it (i.e. it may not be committed safely after all).
We do know (hopefully) that if it has been replicated to a majority of
the machines then we will always read that value (unless overwritten),
on the other hand. What we do here is that we try to write what we just
read again (i.e. without really changing anything) and then if
getLastError(w=m) succeeds we know that it is safely stored! Note that
in this situation we could NOT have tried to abort it, however, as the
commit MAY have been replicated to a majority and then we would
overwrite the commit (not good!). If it was neither committed or aborted
yet, then we could try to write whatever we wanted. This is actually
similar to Paxos, for those who are familiar with it (i.e. we may
theoretically overwrite something written to a minority, but never
something written to a majority as we will always read that something
and then write it again instead). That writes replicated to a majority
of replica set members are "safe" should be guaranteed by MongoDB (it
does so by using a consensus algorithm when it selects a new primary;
see http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts).

5. Right now the assumption in note 4 above is not entirely correct when
replica sets are used together with sharding, due to
http://jira.mongodb.org/browse/SERVER-2119. This will be fixed of
course, but still good to know about. The issue is that data could be
migrated to another shard (and only written to one server there) and
then lost in a failover in the new shard, even if it was once written to
a majority of nodes in the original shard.

6. The system I outlined uses a kind of (strict) two-phase locking, but
without special read-locks. Locking read objects are optional and they
are locked with exclusive locks when you choose to lock them (see
above). It's still basically the same, though. It should therefore work
correctly (if implemented correctly). See
http://en.wikipedia.org/wiki/Two-phase_locking.

7. Deadlocks are avoided if one only acquire locks in a predefined order
(that all clients agree on). I said that below, but it's worth pointing
out again. That way it is impossible to get circular waits (see
http://en.wikipedia.org/wiki/Deadlock). One could e.g. order all objects
based on database name, collection name and primary key.

8. This would probably be more efficient if implemented inside MongoDB
(or similarly), but I think it's cool that it could be implemented as a
library (or perhaps a specification + several libraries). One
inefficiency I see in doing it that way is that a client that finds a
locked object must poll the lock until it succeeds to acquire it... No
problem with low lock contention, though. One challenge would be to make
sure that all client libraries (different languages and version and so
on) followed the same "rules" and worked together without compromising
correctness.

9. A note on transaction isolation. Have a look at e.g.
http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html.
With the system described here, ordinary reads (outside first locking
objects) would be "read committed". They would never see uncommitted
data (from a transaction), but they could, however, see data from
transactions committed during the read. This means that if a transaction
updates 5 objects and you read those 5 objects concurrently, you could
get some with and some without the update (this could not happen in
PostgreSQL with "read committed"). If you lock the objects first, then
you get "repeatable reads". No other transactions will update the
objects concurrently. You only know that this was the case if you
finally manage to commit the transaction, though (the transaction may
have been aborted by another client and the locks released otherwise).
Also note that ordinary writes outside of a transaction could also
delete and/or violate the locks, so you must make sure that there are no
conflicting writes that doesn't use transactions...

10. The locks can be written as a special field inside the locked
objects. This is nice, because you don't need to change the layout of
your data to use it (and you don't need to use it for everything).
Ordinary reads will work just as before (as long as you ignore that
special field). One could also wrap the actual data inside a document
that adds transaction data on top of it (e.g. embedding the user data
under a 'data' field). If you also had a library or a proxy that all
requests went through, then you could easily rewrite requests so that
the lock info was never overwritten and never returned to the user... I
kind of like the less intrusive alternative above, though.

Comparison to MVCC:
11. This system would only use locking instead of using some kind of
MVCC to handle concurrent reads. The Google paper linked to earlier used
MVCC for consistent reads (and locking for committing transactions).

12. The main advantage of MVCC is that you can do consistent reads
without locking. You can do the same with this system, but you would
need to lock the objects before if you need the read to be consistent
between objects. If you do an ordinary read in this system you risk
getting inconsistent versions of objects (each being consistent with
itself, though). MVCC really helps here, if you need to do a lot of
consistent reads like this. I don't think most people need to do that.

13. The disadvantage of MVCC (in this context at least) is that it gets
complicated. You need to synchronize version/transaction numbers between
objects (probably globally using some kind of coordinator) and you need
to clean up old versions when you think they aren't needed anymore and
so on.

Now that got a bit longer than originally intended, but I think it all
got a bit clearer at least... Hopefully someone will find it a bit
interesting. I might try to create a real library like this some day
(won't have time for a while, though). Anyone that would be interested
in that? In that case, what about note 10 above? Finally, any better
ways all this could be done?

Hampus Wessman

Alex

unread,

Dec 9, 2010, 9:59:49 AM12/9/10

to mongodb-user

I have worked on some very large RDBMS systems in the financial and
government sector and when we run into problems it is almost always
with transactional writes conflicting with reads that are usually
generated for reporting purposes and require access to tens of
millions of rows. The standard way of approaching the situation is to
split the transactional system and the reporting system into separate
silos. This is expensive and difficult to implement correctly and can
be a nightmare to maintain. Tools such as mongo provide another
option for the reporting side of the equation if you have a system
that requires ACID compliance. The benefit of nosql options is they
blow away most other options in read speed and resources required
since they are not trying to support ACID. In addition, their
flexible schema structure can reduce the headaches of development and
maintainability of the reporting silo.

I for one, would be concerned to see any move by Mongo toward support
of 2PC, etc., because of concern it would conflict with what Mongo is
really good at. Leave true ACID to existing RDBMS. I would prefer
Mongo focus on creating a database that scales to...um...mongo
heights ;)

In the end, mongo and its brethren offer another powerful tool in the
architects toolbelt and a good architect will be able to choose the
correct tool or tools for a particular job (rdbms, column db, graphdb,
mongo, etc.). I only wish dbs such as Mongo existed 15 years ago when
I started dealing with RDBMS headaches.

Alex

D Boyd

unread,

Dec 9, 2010, 10:34:10 AM12/9/10

to mongodb-user

All:
We have been working with the issues of 2PC and keeping
records that span multiple tables in some BigTable implementations
in sync on various efforts.

In one BigTable DB implementation built by the US Department of
Defense (DoD) we use it supports versioning of all records in a table.
It is possible to retrieve or delete any specific version. This
allows a
simple form of rollback on a single table basis.

I am wondering if it might be possible to implement some sort of
linked versioning between collections in MongoDb to provide a very
basic 2PC capability.

Even a basic versioning capability would make implementing some
aspects of this for multiple tables in the driver or application
layers simplier.

I often think about document versioning in terms of how various
source
code control systems like subversion track changes and other tools
allow
groups of those changes to be linked together for a release or patch.

Any thoughts about implementing a record version capability in
mongo?

Hampus Wessman

unread,

Dec 9, 2010, 1:03:23 PM12/9/10

to mongod...@googlegroups.com

I agree that MongoDB's strength is its speed and scalability and I too wouldn't like it if support for multi-object transactions got in the way of that. I think it could be useful for some people to be able to easily do very simple atomic multi-object updates on a few objects once in a while (so you can keep your data inside MongoDB for those too), but I only think it's worth it if it doesn't affect performance and scalability in general... If you need to do a lot of transactions you should probably choose (or create) another tool anyway (possibly in addition to MongoDB).

Most multi-object updates can be easily solved already with MongoDB too, if you just put some thought into it (restructure your data, do the updates in a safe order and so on). I don't think there's any need to support complex multi-object transactions in MongoDB.

/ Hampus Wessman

2010/12/9 Alex <alex....@apptik.com>

Andrew Rondeau

unread,

Dec 10, 2010, 2:39:03 PM12/10/10

to mongodb-user

Wow, that was long.

Why not just use some form of an XMPP server and push the locking
logic into the application?

> seehttp://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts).

>
> 5. Right now the assumption in note 4 above is not entirely correct when

> replica sets are used together with sharding, due tohttp://jira.mongodb.org/browse/SERVER-2119. This will be fixed of

> course, but still good to know about. The issue is that data could be
> migrated to another shard (and only written to one server there) and
> then lost in a failover in the new shard, even if it was once written to
> a majority of nodes in the original shard.
>
> 6. The system I outlined uses a kind of (strict) two-phase locking, but
> without special read-locks. Locking read objects are optional and they
> are locked with exclusive locks when you choose to lock them (see
> above). It's still basically the same, though. It should therefore work

> correctly (if implemented correctly). Seehttp://en.wikipedia.org/wiki/Two-phase_locking.

>
> 7. Deadlocks are avoided if one only acquire locks in a predefined order
> (that all clients agree on). I said that below, but it's worth pointing

> out again. That way it is impossible to get circular waits (seehttp://en.wikipedia.org/wiki/Deadlock). One could e.g. order all objects

> based on database name, collection name and primary key.
>
> 8. This would probably be more efficient if implemented inside MongoDB
> (or similarly), but I think it's cool that it could be implemented as a
> library (or perhaps a specification + several libraries). One
> inefficiency I see in doing it that way is that a client that finds a
> locked object must poll the lock until it succeeds to acquire it... No
> problem with low lock contention, though. One challenge would be to make
> sure that all client libraries (different languages and version and so
> on) followed the same "rules" and worked together without compromising
> correctness.
>

> 9. A note on transaction isolation. Have a look at e.g.http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html.

> ...
>
> read more »

Hampus Wessman

unread,

Dec 10, 2010, 3:57:08 PM12/10/10

to mongod...@googlegroups.com

Yes, I know, it got unnecessarily long. Thanks for replying, though ;)

My point mostly was that I think it would be possible to create a
client-side library for making some kind of atomic multi-object updates
with MongoDB (without any additional built-in support). It wouldn't
necessarily be very high performance, but should definitely be possible
and I think it's an interesting idea. That said, I think it would be
better in most cases to just implement an ad hoc solution using MongoDB
directly (which should be equally possible).

Not sure how XMPP would help, but doesn't MongoDB already push the
locking (or whatever you choose to do) into the (client) application?
Pushing it into MongoDB (if you meant it that way) would probably be
more efficient (not saying it would be a good idea).

/ Hampus

On 2010-12-10 20:39, Andrew Rondeau wrote:
> Wow, that was long.
>
> Why not just use some form of an XMPP server and push the locking
> logic into the application?
>
> On Dec 9, 6:51 am, Hampus Wessman<hampus.wess...@gmail.com> wrote:
>> I should add a few more notes on the system I outlined in my last e-mail
>> (see below). First note that this is just an example of one way this
>> could be done (i.e. adding transactions on top of MongoDB). May still be
>> interesting for some, though. I sort of like it. Feel free to comment.
>>
>> First a few more notes on how it would work:
>> 1. The client (library) would simply queue all updates until we get to
>> step 3 below. Reads would be done as usual. This means that you would
>> not read your own writes inside a transaction (but then you probably
>> know what you just wrote anyway, so shouldn't be a big problem). All
>> updates also need to affect one specific object and we need to know its
>> primary key (so we can queue it and proceed as intended later).
>>
>> 2. You can either run the whole transaction first (queuing all the
>> updates locally) and then on commit execute step 1 - 5 or you can

>> execute step 1& 2 first (similar to explicit locking in many SQL

>> read more �

Andrew Armstrong

unread,

Feb 5, 2011, 9:38:03 PM2/5/11

to mongodb-user

bump :)

I was thinking this would be cool to see after server durability comes
in.

Is this something the Mongo team are interested in exploring in the
future? I understand such transactions would be expensive, but they
would be useful on occasion to replace the need for a relational
database for when things need to be consistent across multiple
inserts.

- Andrew

On Dec 11 2010, 7:57 am, Hampus Wessman <hampus.wess...@gmail.com>

> >> not read your own writes inside atransaction(but then you probably

> >> know what you just wrote anyway, so shouldn't be a big problem). All
> >> updates also need to affect one specific object and we need to know its
> >> primary key (so we can queue it and proceed as intended later).
>

> >> 2. You can either run the wholetransactionfirst (queuing all the

> >> updates locally) and then on commit execute step 1 - 5 or you can
> >> execute step 1& 2 first (similar to explicit locking in many SQL
> >> databases), then do reads and writes (the writes being queued) and
> >> finally execute step 3 - 5 on commit. I hinted at this, but I wasn't
> >> very clear. This is easily achieved if we include all affected objects
> >> in the locking at the beginning (otherwise it gets more complicated with
> >> regard to deadlocks). This achieves different levels of isolation (see
> >> below, note 9).
>
> >> 3. That the system would work just as well in a sharded environment as
> >> otherwise is sort of obvious (if you fill out the missing details a
> >> bit...) as it only relies on atomic updates of single documents anyway.
>
> >> 4. The system must work reliably with replica sets. For this to work we
> >> need to write everything (that is important) to a majority of the
> >> replica set members. We can check this with getLastError(w=m) where m is
> >> a majority of the replica set members (on all shards). We will want to
> >> have a timeout here, though, and when it times out we simply give up and

> >> abandon thetransaction. What is interesting is what to do when we e.g.
> >> find atransactionthat is committed, but not yet "cleaned up". We

> >> 9. A note ontransactionisolation. Have a look at e.g.http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html.

> >> With the system described here, ordinary reads (outside first locking
> >> objects) would be "read committed". They would never see uncommitted

> >> data (from atransaction), but they could, however, see data from

> >> transactions committed during the read. This means that if atransaction
> >> updates 5 objects and you read those 5 objects concurrently, you could
> >> get some with and some without the update (this could not happen in
> >> PostgreSQL with "read committed"). If you lock the objects first, then
> >> you get "repeatable reads". No other transactions will update the
> >> objects concurrently. You only know that this was the case if you
> >> finally manage to commit thetransaction, though (thetransactionmay
> >> have been aborted by another client and the locks released otherwise).

> >> Also note that ordinary writes outside of atransactioncould also

> >> delete and/or violate the locks, so you must make sure that there are no
> >> conflicting writes that doesn't use transactions...
>
> >> 10. The locks can be written as a special field inside the locked
> >> objects. This is nice, because you don't need to change the layout of
> >> your data to use it (and you don't need to use it for everything).
> >> Ordinary reads will work just as before (as long as you ignore that
> >> special field). One could also wrap the actual data inside a document

> >> that addstransactiondata on top of it (e.g. embedding the user data

> >> under a 'data' field). If you also had a library or a proxy that all
> >> requests went through, then you could easily rewrite requests so that
> >> the lock info was never overwritten and never returned to the user... I
> >> kind of like the less intrusive alternative above, though.
>
> >> Comparison to MVCC:
> >> 11. This system would only use locking instead of using some kind of
> >> MVCC to handle concurrent reads. The Google paper linked to earlier used
> >> MVCC for consistent reads (and locking for committing transactions).
>
> >> 12. The main advantage of MVCC is that you can do consistent reads
> >> without locking. You can do the same with this system, but you would
> >> need to lock the objects before if you need the read to be consistent
> >> between objects. If you do an ordinary read in this system you risk
> >> getting inconsistent versions of objects (each being consistent with
> >> itself, though). MVCC really helps here, if you need to do a lot of
> >> consistent reads like this. I don't think most people need to do that.
>
> >> 13. The disadvantage of MVCC (in this context at least) is that it gets

> >> complicated. You need to synchronize version/transactionnumbers between

> >> objects (probably globally using some kind of coordinator) and you need
> >> to clean up old versions when you think they aren't needed anymore and
> >> so on.
>
> >> Now that got a bit longer than originally intended, but I think it all
> >> got a bit clearer at least... Hopefully someone will find it a bit
> >> interesting. I might try to create a real library like this some day
> >> (won't have time for a while, though). Anyone that would be interested
> >> in that? In that case, what about note 10 above? Finally, any better
> >> ways all this could be done?
>
> >> Hampus Wessman
>
> >> On 2010-12-09 09:29, Hampus Wessman wrote:
>
> >>> I have been thinking about how to create a system like that on top of
> >>> MongoDB for some time. I think it's possible and I even wrote a small
> >>> prototype of a Python lib that added (limited)transactionlike
> >>> behavior some time ago. Here are some thoughts about it.
> >>> The key is to use a single document to "synchronize"
>

> ...
>
> read more »

Scott Hernandez

unread,

Feb 5, 2011, 9:41:53 PM2/5/11

to mongod...@googlegroups.com

There have, and will continue to be discussions but it doesn't sit
very high on the list of features.

Andrew Armstrong

unread,

Feb 17, 2011, 3:07:59 AM2/17/11

to mongodb-user

I stumbled across an interesting blog entry from VoltDB (in memory
transaction engine) and how it handles cross node transactions.

It's design is very different to MongoDB (and regular relational
database's), but check out http://voltdb.com/blog/transaction-ordering-and-replication

Basically each CPU core on each node is responsible for a subset of
data, and only one read/write operation per subset of data can be
happening at any one time. Each operation is serial, so there is no
locking or 2PC operation necessary.

Performance sounds pretty fast (30 machine cluster I think had 3.5
million txns/sec); but its an interesting way of solving cross shard
transactions without needing locks or 2PC.

Sergei Tulentsev

unread,

Feb 17, 2011, 6:11:52 AM2/17/11

to mongod...@googlegroups.com, Andrew Armstrong

When everything is in memory, any tool will perform, I guess :-)

--

You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

--
Best regards,
Sergei Tulentsev

Nosh Petigara

unread,

Feb 17, 2011, 11:57:05 AM2/17/11

to mongod...@googlegroups.com, Sergei Tulentsev, Andrew Armstrong

Somewhat of a different perspective, but since this thread has reemerged, thought I would chime in.

In the example you gave (payment->invoice->premium acct flagging), even with a database that supports multi-step transactions, the system as a whole is difficult to make transactional. For example, the system that handles the payment processing is likely distinct from your application and database system, so its still possible to charge someone and then have an application error or database failure prevent the invoicing and premium account upgrading from happening.

This isn't to downplay the need for multi-object transactions - but just as food for thought. If you are building a system that comprises of many services or components, its likely that you have some relaxed transactional semantics, which you have to deal with in some way. One way to think of some of these problems is dealing with each operation as distinct and ensuring that the sequence you want will occur eventually. At a particular point you may be in an intermediate state, but overall the system will perform all operations eventually.

--Nosh

Andrew Armstrong

unread,

Feb 27, 2011, 5:57:52 AM2/27/11

to mongodb-user

(bump)

That's a good point Nosh; but there's still the benefit of knowing
that either your system recognized the transaction as completed, or
not (eg the account + local invoice was generated or it wasn't).

I was thinking earlier, maybe implementing a write-batch system (that
is durable across shards, and optionally 'atomic' to readers) would be
enough to cover a large use case for the least amount of work?

If we remove the required for consistent reads when in a transaction;
perhaps being able to batch together a series of writes across
multiple documents (that may include multiple shards) would be just
useful enough on its own.

For example, if I needed to flag my user account as 'paid' while
generating an invoice, I could describe this in a write batch like so
(in english):

mongo> start batch
mongo> update users.andrew.paid = true
mongo> insert into invoices (account: andrew, amount: $10.00,
receiptNumber: 123456)
mongo> commit batch { clusterWideAtomic: true/false }

My writes are received and durably written to each shard that needs to
perform them; however those changes are not yet visible to anyone else
- including my own writer (let's simplify this system by ignoring the
need to read our own transactional writes).

For example; perhaps mongod has this logic in its 'Do insert'
operation:

if (client is in transaction)
// Store away this write for later - Taking advantage of the 1.8x
durability feature to ensure we can recover this write if we crash
db.local.system.transactions.insert(<insert query from client to
perform later>);
else
<really do the insert like normal>

Similar code could be used for update/deletes.

We then issue a 'commit batch' command to each participating shard.
mongod runs some logic like:

Function CommitBatch()
// Grab exclusive write lock for atomic update
AcquireWriteLock()

foreach(operationToDo in db.local.system.transactions.find(xid:
abc123)
// Parse the original JSON request sent to us we saved away to
do later

// Operations in this transaction applied locally on this shard,
release write lock
ReleaseWriteLock
End Function

If 'clusterWideAtomic' was specified; write locks for the entire
cluster are taken first (which would block all writes on all
participating shards!) during the commit on each node, so that readers
cannot see partial updates between shards. This is an optional
performance penalty of course.

As you can see; these are not necessarily transactions, rather they
are (possibly syncronised) 'write batches' where either all documents
will be committed (possibly with eventual consistency if
clusterWideAtomic == False) or none will at all.

This would mean that eventually, either my customer account has the
'paid' flag set (plus an invoice) or neither of those things have
happened.

Other features of transactions I imagine would be a lot more work for
MongoDB to create (READ COMMITTED isolation level for example for
'snapshot' consistent reads), and being able to take locks on
documents to detect (or later reject a commit) would be extra work.

Ideally full transaction support (similar to how RavenDB allows it)
would be a pretty neat feature to see when consistency is a very nice
property to have when developing an application.

- Andrew

On Feb 18, 3:57 am, Nosh Petigara <n...@10gen.com> wrote:
> Somewhat of a different perspective, but since this thread has reemerged,
> thought I would chime in.
>
> In the example you gave (payment->invoice->premium acct flagging), even with
> a database that supports multi-step transactions, the system as a whole is
> difficult to make transactional. For example, the system that handles the
> payment processing is likely distinct from your application and database
> system, so its still possible to charge someone and then have an application
> error or database failure prevent the invoicing and premium account
> upgrading from happening.
>
> This isn't to downplay the need for multi-object transactions - but just as
> food for thought. If you are building a system that comprises of many
> services or components, its likely that you have some relaxed transactional
> semantics, which you have to deal with in some way. One way to think of some
> of these problems is dealing with each operation as distinct and ensuring
> that the sequence you want will occur eventually. At a particular point you
> may be in an intermediate state, but overall the system will perform all
> operations eventually.
>
> --Nosh
>
> On Thu, Feb 17, 2011 at 6:11 AM, Sergei Tulentsev <
>
>
>
>
>
>
>
> sergei.tulent...@gmail.com> wrote:
> > When everything is in memory, any tool will perform, I guess :-)
>

Reply all

Reply to author

Forward