Datastore Geography.

12 views
Skip to first unread message

javaDinosaur

unread,
Apr 21, 2009, 3:59:22 PM4/21/09
to Google App Engine
The preview release of Java for App Engine has gotten me interested in
GAE once more but an unanswered question from the early 2008 beta days
is still causing me anxiety.

I ask this question based on my working assumption that GAE spins app
web processes at a Google data center closest to the client browser.

If moderate App usage were to cause an App to have two active web
processes, one based in Europe and the other in West Coast America do
both sites reference one authoritative Datastore location?

Or do Datastore API calls reference the nearest local Datastore copy?

If the later how would concurrent updates to the same entity from
European and Californian client browsers be resolved?

Barry Hunter

unread,
Apr 21, 2009, 4:28:53 PM4/21/09
to google-a...@googlegroups.com
I cant find the reference now, but the Datastore (and even memcache I
beleive) works as a single global 'entity', there is only one.

Transactions are global too, and the datastore is consistant.

basically you the application designer has to make so consideration if
the app runs on a single, or 10,000 seperate servers. The interface to
the datastore is consistant.
--
Barry

- www.nearby.org.uk - www.geograph.org.uk -

javaDinosaur

unread,
Apr 21, 2009, 4:56:49 PM4/21/09
to Google App Engine
> I cant find the reference now, but the Datastore (and even memcache I
> beleive) works as a single global 'entity', there is only one.

I agree that from an API usage perspective the Datastore is a single
global resource but typical single entity Datastore fetch times are
less than intercontinental IP packet travel times which indicates that
an App's web process and its associate data storage are geographically
close.

Maybe my assumption that GAE would spin up multiple web processes for
a single app dispersed around the globe is incorrect. Perhaps each GAE
app has a semi permanent geographic affinity for web and data tiers
which eliminates the update uncertainty I alluded to in my original
post.

T.J. Crowder

unread,
Apr 21, 2009, 6:58:02 PM4/21/09
to Google App Engine
Hi,

> ...typical single entity Datastore fetch times are
> less than intercontinental IP packet travel times...
> Maybe my assumption that GAE would spin up multiple web processes for
> a single app dispersed around the globe is incorrect...

The first doesn't necessarily imply the second. Remember that
Datastore is predicated on the assumption that reads will outnumber
writes by a huge margin. Google makes a big point about writes being
expensive. I also seem to recall reading somewhere here that although
a transaction is atomic in the sense that it will, as a whole, happen
or not happen, it is NOT guaranteed that a read immediately following
the transaction will see the updated data. That sounds a lot like a
replication delay to me.

So I could be talking completely through my hat, but I'd bet that
somewhere there's an authoritative copy of your data (and that the
authoritative copy can move around), but that reads are usually coming
from local slaved copies, not from the authoritative copy. And taking
it a step further, I bet that works at at *least* the entity level,
not the application level, so the authoritative copies of your
Thingies may be in a completely different place than the authoritative
copy of your Whatsits. (Otherwise, I don't think we'd have this
entity group transaction business. If the whole authoritative copy
were in one place, we'd be able to have much more flexible
transactions in terms of what entities we could enroll in them.)

I'd also suspect you won't get a definitive answer for how things are
laid out under-the-covers from Google, because they need to keep their
implementation options open.

FWIW (which may be very little indeed),
--
T.J. Crowder
tj / crowder software / com
Independent Software Engineer, consulting services available

javaDinosaur

unread,
Apr 21, 2009, 8:43:38 PM4/21/09
to Google App Engine
On Apr 21, 11:58 pm, "T.J. Crowder" <t...@crowdersoftware.com> wrote:

> I'd also suspect you won't get a definitive answer for how things are
> laid out under-the-covers from Google, because they need to keep their
> implementation options open.

Yes a good point and I would not expect Google to provide lots of
internal details but it would be good to get confirmation that:

1. Geographically dispersed users could not complete concurrent
transactions on different copies of the same entity root ID and then
experience a mysterious loss of one of the transactions post commit.

2. Some idea as to whether intercontinental replication within the
Datastore might lead to geographically dispersed users seeing
different versions of the same committed data for a few seconds.

I could cope with the latter because I would view it as a laggy
version of snapshot isolation that we have experienced with
conventional RDBs in recent years.

ryan

unread,
Apr 21, 2009, 10:52:21 PM4/21/09
to Google App Engine
good thread! i can address this in at least some detail now. more
importantly, we expect to gradually provide more and more information
about these kinds of questions. in particular, expect to hear more at
google i/o in may. (i'm giving an entire talk on this very topic. :P)

http://code.google.com/events/io/

at a high level, all data stored with your App Engine application is
replicated across multiple disks and geographical locations. your
application is also hosted in multiple geographical locations, but
usually only served from a single location at any given time. In the
future we'd like to give developers more control over this.

as for the datastore, and all other current stored data APIs like
memcache, there is a single, global view of data. we go to great
lengths to ensure that these APIs are strongly consistent.

in other words, once you've written data with e.g. a datastore put()
or delete(), it is immediately visible to all requests, for all users,
as soon as that call completes successfully. that includes the same
request that made the call, as well as other requests, regardless of
geographic location. similarly, concurrent writes or transactions will
*never* unexpectedly overwrite or collide with each other, regardless
of where the user is located or where the request is served from.

this is what we mean when we say in the docs that the datastore is
strongly consistent:

http://code.google.com/appengine/docs/python/datastore/overview.html
http://code.google.com/appengine/docs/whatisgoogleappengine.html

we view this as a significant differentiating factor vs. similar
systems, like amazon simpledb, that are eventually consistent.

> I'd also suspect you won't get a definitive answer for how things are
> laid out under-the-covers from Google, because they need to keep their
> implementation options open.

good point! we do expect to keep our implementation options open,
largely by describing the behavior you can expect from the serving
stack and the various APIs, and only guaranteeing that that behavior
won't change.

having said that, we're actually working hard to write articles, give
talks, and otherwise describe in detail how things work under the
covers. if you're interested in the datastore, for example, take a
look at the talk i gave at last year's i/o:

http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore

javaDinosaur

unread,
Apr 22, 2009, 5:49:31 AM4/22/09
to Google App Engine
Thank you Ryan the post provided exactly what I needed to hear about
Datastore api behaviour. I will be tuned in for your Google I-O
presentation.

In some respects I am relieved to read that an App is usually only
served from one location because if not I would have remained puzzled
as to how Google overcame the speed of light to achieve timely strong
Datastore consistency.

Your post has triggered the following question and I would be grateful
if your Google I-O talk covered the following.

Last year I noticed that GAE hosted my trial App within 30ms of my UK
location i.e. somewhere in Europe. Such automatic hosting geo affinity
is highly impressive but what if an App's admin is UK based but the
expected user community is centered on the USA? Will the App's hosted
location drift towards the bulk of users over time or does the admin's
location pin down the location?

KARTHIKEYAN

unread,
Apr 22, 2009, 5:56:25 AM4/22/09
to google-a...@googlegroups.com

ryan

unread,
Apr 22, 2009, 10:36:51 AM4/22/09
to Google App Engine
On Apr 22, 2:49 am, javaDinosaur <jonathan...@hotmail.co.uk> wrote:
>
> Last year I noticed that GAE hosted my trial App within 30ms of my UK
> location i.e. somewhere in Europe. Such automatic hosting geo affinity
> is highly impressive but what if an App's admin is UK based but the

you might be reading more into that number than it warrants. first,
was that measurement on static files? or dynamic requests? we're more
opportunistic about serving static files from multiple geographic
locations and out of edge caches since, naturally, they're static
files.

second, users often see low latency because requests are picked up by
a google frontend near the user's geographic location, then travel to
and from the location serving the app over google's network, which is
usually higher quality and less loaded than the commodity internet.

Panos

unread,
Apr 22, 2009, 1:38:18 PM4/22/09
to Google App Engine
One aspect that neither Ryan nor anybody else in this thread touched
upon is the entity hierarchy relationship. I understand that
transactions that involve multiple records are only honored among
entities that have the same GAE ancestry. This along with hints in the
documentation and presentations leads me to believe that records with
same ansestry are kept as close as possible (same machine, cluster,
data center). Again, my understanding is that ancestry is orthogonal
to entity kind so some Foos and some Bars might be in one location
because all of them have the same parent entity while other Foos and
Bars might migrare somewhere else.

Anyway, I would appreciate if Ryan or somebody else who is familiar
with the implementation elaborate further on whether today's
implementation uses the ancestry hints when it decides how to
partition the application data.

ryan

unread,
Apr 22, 2009, 5:28:59 PM4/22/09
to Google App Engine
On Apr 22, 10:38 am, Panos <PA...@ACM.ORG> wrote:
>
> Anyway, I would appreciate if Ryan or somebody else who is familiar
> with the implementation elaborate further on whether today's
> implementation uses the ancestry hints when it decides how to
> partition the application data.

it does! details in http://snarfed.org/space/datastore_talk.html ,
particularly slides 13-17, which discuss how paths are translated to
bigtable row names.
Reply all
Reply to author
Forward
0 new messages