Concurrency Control in the datastore

DXD

unread,

Nov 6, 2008, 7:38:14 PM11/6/08

to Google App Engine

Hi,

I'm a new GAE developer, and now particularly interested in
investigating the difference in concurrency control between the
datastore and a traditional RDBMS. I looked at Ryan Barrett's
presentation about the datastore in the Google IO sessions (http://
sites.google.com/site/io/under-the-covers-of-the-google-app-engine-
datastore), and got some brief understanding of it. To me, doesn't
look like Google implemented the Timestamp-based scheduling method
(although there does exist a timestamp attached with each row). The
method I'm talking about is the "traditional" timestamp method
mentioned in section 9.8.1 of the book "Database System
Implementation" (http://infolab.stanford.edu/~ullman/dbsi.html) where
each row should have the "read time" and the "write time", and based
on the comparison between these values and the timestamp of the
transaction, the scheduler will decide how to handle the read/write
request from the transaction. Is my statement true? If so, could
anyone pls point me to some (as detailed as possible) references/
papers/technical documentations/presentations about how/what method
Google actually used to implement concurrency control? (did they
follow some standard method or invent their own?). I need this asap,
so greatly appreciate if someone could give me some help.

Thanks,
David.

Alexander Kojevnikov

unread,

Nov 6, 2008, 7:46:52 PM11/6/08

to Google App Engine

http://code.google.com/appengine/articles/transaction_isolation.html

DXD

unread,

Nov 13, 2008, 1:16:23 AM11/13/08

to Google App Engine

Thanks Alexander.

So now this is my guess about the rules of the datastore's timestamp-
based scheduler. If someone could, pls correct where I'm wrong in here
(basically I consolidated the theoretical timestamp-based scheduler
with what I understand from Ryan's presentation):

Each transaction T is assigned a timestamp TS(T) at startup. Each root
entity R is given a read timestamp RTS(R) and a committed timestamp
CTS(R) (this is the timestamp mentioned in Ryan's presentation).

1. Transaction T wants to read some entity X whose root is R (X and R
may or may not be the same).
a). If TS(T) > CTS(R): T goes ahead and performs the read
(retrieved data from X is respective to CTS(R)). If TS(T) > RTS(R),
set RTS(R) := TS(T); otherwise leave RTS(R) intact.
b). If TS(T) < CTS(R): invalid read. Rollback T (abort T and
restart it with a new, larger timestamp).

2. T wants to write X.
a). If TS(T) >= RTS(R) and TS(T) > CTS(R): T performs the write to
X including writing the journal in R and applying that journal to X,
excluding setting the new value for CTS(R).
b). If TS(T) >= RTS(R), but TS(T) < CTS(R): the Thomas write rule
applies here. The write by T is ignored. T is allowed to proceed
without making change to R and X.
c). if TS(T) < RTS(R): invalid write. T must be rolled back.

3. T completes all its actions and now wants to commit. If CTS(R) at
this point is still the same as when T started, it is set to TS(T); T
effectively gets committed. If CTS(R) has changed, T must be rolled
back.

Note that in rolling back:
- If T has changed RTS(R), the old value must be restored.
- T may be retried at most n number of times (currently n is set to
4?).

I'd appreciate if someone could give comments asap on what I outlined
above.

Thanks a lot,
David.

ryan

unread,

Nov 17, 2008, 12:38:38 PM11/17/08

to Google App Engine

hi david! you're in roughly the same neighborhood, but there are a
number of key differences between what you've outline and what the
datastore actually does. here are a few:

- each entry in an entity group's tx log has a single timestamp. there
aren't separate separate read and last committed timestamps.
- we only fail at commit time, not earlier. specifically, we never
consider a read "invalid" just because a write has happened since the
tx started. we just read the data as of when the tx started, ie before
the new write. if a write occurs within the tx, we'll fail on commit.
if the tx is read-only, then there's no need to fail.
- when starting a tx, its timestamp is always generated based on the
last committed timestamp of the entity group. we can't just use the
time on the local server, since clocks aren't perfectly synchronized
across servers.
- note that txes may not be retried purely within the datastore, since
writes may depend on the values of earlier reads. if the entities that
were read have changed, we must re-run your application logic to
recalculate the derived writes. that's why run_in_transaction() takes
a lambda, so that we can re-run it if necessary.

DXD

unread,

Dec 3, 2008, 7:44:09 PM12/3/08

to Google App Engine

Hi Ryan,

So actually you guys use the multiversion timestamp control, and the
rules of the scheduler are in fact more straightforward than what I
outlined. I believe they are as follows (pls correct me if I'm wrong
somewhere):

When a transaction T starts, it is assigned a timestamp TS(T)
generated based on last committed timestamp CTS(R) of the entity group
it targets (or more precisely, of the root entity R). It also
remembers CTS(R).

1. T wants to write entity X in the group: it just goes ahead and
performs the write, including writing the journal in R and applying

that journal to X, excluding setting the new value for CTS(R).

2. T wants to read entity X: if T has not performed any write to X, it
will retrieve data from X as of the original CTS(R) that it remembers
from the beginning. If it has modified X, it will retrieve data from X
as of TS(T) (I suppose T must see the change it made to X; note that
at this point TS(T) has not become the last committed timestamp yet).

3. T completes all its actions and now wants to commit. It compares
the original CTS(R) that it remembers from the beginning with the
current value at this point; if they are still the same, TS(T) is set
as the new value of CTS(R) and T effectively gets committed;
otherwise, T must be rolled back. Note that if T is a read-only
transaction, this step 3 essentially gets omitted (and T always
succeeds).

If what I outline above is correct, I guess there's something I'm not
totally comfortable with. In a general timestamp-based scheduler, the
timestamp order of transactions is also the serial order in which they
must appear to execute. So if two transactions T1, T2 start in this
order, and T1 writes an entity X whereas T2 reads that same entity,
then I believe the theory dictates that T2 must read what T1 writes.

However, in the datastore, that's not always the case. For ex: we have
T1 and T2 running concurrently, T1 is a write transaction whereas T2
is a read-only one; T2 starts when T1 is ongoing (but not yet
committed) -> the last committed timestamp, and thus the data of
entity X that T2 sees, is as of when T1 starts. T1 makes changes to X
and then commits at some point later. But these changes are not seen
by T2. So T2 does not read what T1 writes, which is contradictory to
the theory. I understand this discrepancy exists because you use
committed time instead of write time, and you omit read time. But what
I'm wondering is why this behavior is reasonable, why it makes sense.
Could you pls provide some explanations? (if these explanations take
sometime, could you pls verify the scheduler's rules I outlined above
first?)

Thanks a lot,
David.

ryan

unread,

Dec 4, 2008, 2:16:56 PM12/4/08

to Google App Engine

hey david! wow, comprehensive. you're right: modulo some minor
implementation details, this is basically our optimistic concurrency
algorithm.

your concern stems from the fact that the datastore doesn't use locks,
so txes aren't really ordered like they are in lock-based systems.

in a lock-based system, regardless of whether the locks or simple or
differentiate btw reads and write, T2 wouldn't start until T1 has
committed. given that, yes, T2 would see T1's write.

in our datastore, however, you're right, T2 can run concurrently with
T1. T2 issues its reads before T1 commits, so to preserve isolation,
T2 isn't allowed to see T1's in progress tx. if was, it might see
inconsistent data.

to make this concrete, let's say T1 is transferring $1 from account A
to B, and T2 is reading the balances of A and B. if T2 could see T1's
in-progress writes, this could happen:

- T1 adds $10 to B
- T2 reads A and B
- T1 deducts $10 from A

if T2 sees B's balance after T1 adds $10 to it, T2 will see that B has
$10 extra, but A still has the same amount. (it doesn't matter if T1
deducts from A first. T2 would see a similar inconsistency.)

DXD

unread,

Dec 11, 2008, 2:46:21 PM12/11/08

to Google App Engine

Wonderful! Great design!!! Thanks Ryan.

David.

yejun

unread,

Dec 11, 2008, 4:18:09 PM12/11/08

to Google App Engine

Basically, I think during concurrent tractions, the first one commit
will win, the rest all fail. Non transaction operation will always
success. Write is always transactioned.
Am I right?

ryan

unread,

Dec 11, 2008, 5:08:44 PM12/11/08

to Google App Engine

On Dec 11, 1:18 pm, yejun <yej...@gmail.com> wrote:
> Basically, I think during concurrent tractions, the first one commit
> will win, the rest all fail. Non transaction operation will always
> success. Write is always transactioned.

correct. the python API retries transactions when they collide,
though, and the datastore itself retries writes when they collide, so
your app will generally only see collisions (in the form of
TransactionFailedError exceptions) during periods of very high
contention when all retries fail.

yejun

unread,

Dec 11, 2008, 5:49:19 PM12/11/08

to Google App Engine

Actually I still have a question.
What actually happened, if a new entity created in a transaction. What
if 2 concurrent transaction create entity with same key path.

ryan

unread,

Dec 12, 2008, 3:07:41 PM12/12/08

to Google App Engine

if you create an entity with the same key as an existing entity, e.g.
you do Foo(key_name='bar').put() in one request, then you do the same
thing in another request, the second put() will overwrite the first.

if this happens in two concurrent transactions and they collide, one
of the transactions will retry and do its put() again, overwriting the
first one.

if the put()s are outside transactions, it's much less likely that
they'll collide. it's still possible that the datastore could attempt
to perform those writes at the same time, but it's much less likely.

Sharp-Developer.Net

unread,

Dec 12, 2008, 4:21:36 PM12/12/08

to Google App Engine

And there is a get_or_insert() method.

I guess it's waiting for active transactions to be completed and could
be used for synchronization.
--
Alex
http://sharp-developer.net/

yejun

unread,

Dec 12, 2008, 5:11:42 PM12/12/08

to Google App Engine

get_or_insert is very annoying to use. First it doesn't tell you
whether it is a get or insert at the end. 2nd it cannot be
encapsulated in another transaction.

On Dec 12, 4:21 pm, "Sharp-Developer.Net"

<alexander.trakhime...@gmail.com> wrote:
> And there is a get_or_insert() method.
>
> I guess it's waiting for active transactions to be completed and could
> be used for synchronization.
> --

> Alexhttp://sharp-developer.net/

DXD

unread,

Dec 18, 2008, 2:46:01 PM12/18/08

to Google App Engine

Ryan,

I have an additional little question about the read action and
rollback.

Suppose we have a group whose root is R with 2 entities X and Y. Now
tx T1 updates X and successfully commits. So CTS(R) becomes TS(T1),
but Y doesn't have an entry for TS(T1) (because it didn't get updated
by T1). Then tx T2 comes in and queries Y according to the current CTS
(R) which is TS(T1). So what is retrieved from Y must be as of its
lastest timestamp before TS(T1), right? And generally speaking, to be
precise, the read rule I mentioned in an earlier msg should be revised
as this: if T has not performed any write to X, it will retrieve data
from X as of the original CTS(R) that it remembers from the beginning,
or as of the latest timestamp entry before CTS(R) if CTS(R) entry
doesn't exist in X.

Is it correct? (is this one of the minor implementation details that
you mentioned earlier?)

And if this is correct, I believe in rolling back T, all write entries
that T has performed and been applied to all entities must be deleted,
right? Otherwise, the above rule wouldn't work properly. Also, are
there any other things that must be done during rollback?

Thanks,
David.

ryan

unread,

Dec 19, 2008, 12:15:54 PM12/19/08

to Google App Engine

thanks for the clarification!

On Dec 18, 11:46 am, DXD <DungD...@gmail.com> wrote:
>
> Y doesn't have an entry for TS(T1) (because it didn't get updated
> by T1). Then tx T2 comes in and queries Y according to the current CTS
> (R) which is TS(T1). So what is retrieved from Y must be as of its
> lastest timestamp before TS(T1), right?

right.

> to be precise, the read rule I mentioned in an earlier msg should be revised
> as this: if T has not performed any write to X, it will retrieve data
> from X as of the original CTS(R) that it remembers from the beginning,
> or as of the latest timestamp entry before CTS(R) if CTS(R) entry
> doesn't exist in X.

correct.

> And if this is correct, I believe in rolling back T, all write entries
> that T has performed and been applied to all entities must be deleted,
> right? Otherwise, the above rule wouldn't work properly. Also, are
> there any other things that must be done during rollback?

actually, during a tx, writes are persisted only to the entity group's
journal. the tx isn't actually applied, ie written to the actual
entities, until it's committed. that's important because of the
standard distributed system properties: the datastore server or the
python runtime running the app could die at any time, networking could
fail, etc., so we don't want to start applying to the entities
themselves until we know we've journalled the full tx and we can roll
it forward later if we die during application.

that means that rollback is effectively a noop, which is nice for a
few reasons, e.g. we can more aggressively garbage collect old
versions of entities and old parts of entity group journals.

Reply all

Reply to author

Forward