Compilation of unanswered datastore questions

6 views
Skip to first unread message

Bill

unread,
Sep 11, 2008, 3:43:36 PM9/11/08
to Google App Engine
There are some unanswered questions about datastore that I still have
and occasionally crop up on the message board. This thread is an
attempt to brain dump and outline areas we could use more info. If
you've run into questions about datastore and don't see it on this
list, please add to the thread. Please don't put questions on this
thread that are easily answered by the docs. Instead we want
questions that only Google can answer or that we, as a community, can
step up and answer ourselves (like benchmarking).

Here's my initial brain dump:

1) Will timeout issues on put/transactions be removed when we go pay-
as-you-go or should we develop production apps with these limits in
mind? Exact # of puts or transactions you can reasonably expect to
work within one request before quota issue.

2) Best practices for robustly handling datastore exceptions due to
App Engine maintenance/issues with least impact on users, e.g.
DeadlineExceededError (increased latency) and CapabilityDisabledError
(maintenance shut down of certain datastore functions).

3) Best practices for (de)normalization and entity sizes. A gut
reaction some developers might take when approaching datastore is to
denormalize and put stuff in fewer tables. What are the costs of
keeping many small entities and using reference properties instead?
For example, in a many-to-many relationship, we could have 3 Kinds: A,
B, and join(A,B). This is just like a traditional relational DB with
a join model. What are the costs of traversing implicit collection
sets defined by the reference properties in the join Kind? If you
have a limited relationship between two entities, when does using a
ListProperty (of keys, for example) make sense, especially in light of
the cap on indexed properties per entity?

4) Benchmarks! I've been meaning to run tests on costs for different
datastore operations:
- Direct get using key or id
- Direct get using list of key/id
- Fetches using filters
- Iterative get from a query
- How the above 3 (direct w/ key, bulk fetch, iterative get) scale
with request size.
- Delete/Put
- The big hit using transactions

Is it a big win to come up with a good key naming scheme or does that
bite you in other ways?

bowman...@gmail.com

unread,
Sep 11, 2008, 4:19:34 PM9/11/08
to Google App Engine
Benchmark wise, one thing I noticed recently was this.

Working on an eeepc using the development SDK, I had an operation that
did 122 puts. This took over a minute to complete. I modified it so I
created a json formatted list which then only required 1 put, and the
difference was incredible. About 2 seconds for the entire operation to
complete.

I do not know how this compares to the live version of appengine, but
on the development sdk it was huge.

Bill

unread,
Sep 11, 2008, 4:27:36 PM9/11/08
to Google App Engine
> I do not know how this compares to the live version of appengine, but
> on the development sdk it was huge.

It would probably be magnified on the local SDK which seems to use an
inefficient storage scheme. If you load up your local datastore with
data, you'll notice your puts get longer and longer. Obviously, the
live datastore doesn't have this characteristic. And as bad as puts()
are, local transactions are killer. Try doing your 122 puts under
transactions :)

Enough is enough. I've decided to make and release via open-source a
benchmark app. Will start working on it after a late lunch.

johnP

unread,
Sep 11, 2008, 11:00:10 PM9/11/08
to Google App Engine

Great questions. Answers are greatly requested and anticipated.

ryan

unread,
Sep 13, 2008, 3:01:08 PM9/13/08
to Google App Engine
great questions! i can tackle at least some of these.

as a disclaimer, everything i say here refers to the production
environment. the SDK behaves differently, and as discussed in a few
other threads, is inefficient in a number of ways.

On Sep 11, 12:43 pm, Bill <billk...@gmail.com> wrote:
> 1) Will timeout issues on put/transactions be removed when we go pay-
> as-you-go or should we develop production apps with these limits in
> mind?

datastore timeouts and request deadlines will still exist after we've
launched billing, so yes, you'll want to develop with them in mind.

> Exact # of puts or transactions you can reasonably expect to
> work within one request before quota issue.

exact numbers will always depend on the size and shape of your data.
having said that, you should be able to put or delete a large number
of entities, e.g. in the hundreds or more, if you pass multiple
entities or keys in a single put() or delete() call:

http://code.google.com/appengine/docs/datastore/functions.html

you may also be able to write or delete more entities if the ratio of
entities to entity groups in the put() or delete() call is high.

> 3) Best practices for (de)normalization and entity sizes. A gut
> reaction some developers might take when approaching datastore is to
> denormalize and put stuff in fewer tables. What are the costs of
> keeping many small entities and using reference properties instead?

storing and querying on reference properties doesn't cost any more
than storing and querying on non-reference properties. the one
reference property feature that incurs extra cost is the automatic
dereferencing:

http://code.google.com/appengine/docs/datastore/typesandpropertyclasses.html#ReferenceProperty

> For example, in a many-to-many relationship, we could have 3 Kinds: A,
> B, and join(A,B). This is just like a traditional relational DB with
> a join model. What are the costs of traversing implicit collection
> sets defined by the reference properties in the join Kind? If you
> have a limited relationship between two entities, when does using a
> ListProperty (of keys, for example) make sense, especially in light of
> the cap on indexed properties per entity?

you almost always want to model one-to-many relationships with
reference properties. similarly, you almost always want to model many-
to-many relationships with a list reference property, ie
ListProperty(db.Key). with these, "related to X" queries won't cost
any more than any other query. using a "join" kind, on the other hand,
incurs additional fetches for each of the result entities on top of
the join kind query. the main use case for join kinds is when you
want to impose additional criteria on the join at runtime.

rafe kaplan's google i/o talk describes these techniques in detail:

http://sites.google.com/site/io/working-with-google-app-engine-models

> 4) Benchmarks! I've been meaning to run tests on costs for different
> datastore operations:
> - Direct get using key or id
> - Direct get using list of key/id
> - Fetches using filters
> - Iterative get from a query
> - How the above 3 (direct w/ key, bulk fetch, iterative get) scale
> with request size.
> - Delete/Put
> - The big hit using transactions

like always, these will depend noticeably on the size and shape of
your data. i can give a few rules of thumb, though.

direct gets by key will usually be the fastest operation. single-
property queries, ie queries with a single filter or sort order,
should generally be fast. queries that use a user-defined index should
generally be fast.

queries with equals filters on multiple properties that use the built-
in indexes have extra amount of overhead, which is roughly a fixed
cost per query result. the overhead will depend on (you can probably
guess what's next) the size and shape of your data. if these queries
aren't as fast as you'd like in your app, adding dedicated index(es)
will speed them up.

finally, transactions shouldn't add a prohibitive amount of overhead.
in many cases, doing a number of writes in a transaction can actually
be (a little) faster than doing them outside of a transaction. are you
seeing a noticeable slowdown with transactions, compared to without?

> Is it a big win to come up with a good key naming scheme or does that
> bite you in other ways?

you mean, providing a key_name instead of having the datastore
allocate an id?

http://code.google.com/appengine/docs/datastore/keysandentitygroups.html#Kinds_Names_and_IDs

performance should be the same with key_name vs. id. the main
difference is that key_name allows a (limited) form of querying
without actually querying. for example, say you're writing a wiki, and
you put the page name in key_name. when a request for a page comes in,
you can construct a key with that key_name in memory and get() it
directly, as opposed to querying with an equals filter.

Bill

unread,
Sep 16, 2008, 5:05:41 AM9/16/08
to Google App Engine
Thanks for those answers, Ryan.

> you almost always want to model one-to-many relationships with
> reference properties. similarly, you almost always want to model many-
> to-many relationships with a list reference property, ie
> ListProperty(db.Key). with these, "related to X" queries won't cost
> any more than any other query. using a "join" kind, on the other hand,
> incurs additional fetches for each of the result entities on top of
> the join kind query. the main use case for join kinds is when you
> want to impose additional criteria on the join at runtime.

OK. It makes sense to promote simple joins to full models when you're
storing data about the relationship, which I guess you're saying is
additional criteria. But in all other cases we go with
ListProperty(db.Key) on the side of the relationship that'll have the
smaller list. Except when you expect big lists on both sides of the
relationship.

For example, let's say you have two kinds: Person and Book. A worst-
case (?) Person like Umberto Eco has a personal library of around
30,000 books, while a book can be in millions of homes. If we put a
list of keys into Person, it'd look like:

class Person(db.Model):
name = db.StringProperty()
books = db.ListProperty(db.Key)
...

If we had an index.yaml like this:

- kind: Person
properties:
- name: name
- name: books

In this case, you have to go to a separate join Kind because the total
number of indexed property values for a single entity is capped at
5000, and our worst-case Person exceeds the cap without considering
any other property besides name. Right? So we go to:

class Person(db.Model):
name = db.StringProperty()
@property
def books(self):
return [o.book for o in self.has_books]

class Ownership(db.Model):
person = db.ReferenceProperty(Person,
collection_name='has_books')
book = db.ReferenceProperty(collection_name='owned_by')

class Book(db.Model):
title = db.StringProperty()
@property
def owners(self):
return [o.person for o in self.owned_by]

So to go from book to person or vice versa, we now incur a query for
the ownerships and additional fetches to get the entity referenced by
each ownership. Since there's a speedup for doing big fetches instead
of lots of little ones, it would be better to replace the iterative
lookup with a get(keys) where keys are drawn from a fetch on the
collection set query? (And you'd have to enable paging through the
fetches since each fetch returns <= 1000 ownerships.)

> finally, transactions shouldn't add a prohibitive amount of overhead.
> in many cases, doing a number of writes in a transaction can actually
> be (a little) faster than doing them outside of a transaction. are you
> seeing a noticeable slowdown with transactions, compared to without?

There is a large slowdown on the local dev server. I'll have some
numbers in the future, but when I was informally testing counter shard
creation under transactions on my MacBook Pro, the local performance
was something like 100x slower when doing puts under transactions.
The performance in the cloud, however, was great (as you mention), so
I shouldn't let the local performance limit my use of transactions. I
guess the only issue with transactions is not raw performance but
making sure you're not banging on the same entity groups.

ryan

unread,
Sep 23, 2008, 5:27:49 AM9/23/08
to Google App Engine
On Sep 16, 2:05 am, Bill <billk...@gmail.com> wrote:
> But in all other cases we go with
> ListProperty(db.Key) on the side of the relationship that'll have the
> smaller list. Except when you expect big lists on both sides of the
> relationship.
...
> In this case, you have to go to a separate join Kind because the total
> number of indexed property values for a single entity is capped at
> 5000, and our worst-case Person exceeds the cap without considering
> any other property besides name. Right? So we go to:

right. unfortunate, but true. this is one of the main reasons that
implementing pub/sub and activity streams on the app engine datastore
aren't as easy as they could be. we hope to improve on this
eventually.

> > finally, transactions shouldn't add a prohibitive amount of overhead.
> > in many cases, doing a number of writes in a transaction can actually
> > be (a little) faster than doing them outside of a transaction. are you
> > seeing a noticeable slowdown with transactions, compared to without?
>
> There is a large slowdown on the local dev server. I'll have some

true. after i replied, i realized you were probably talking about the
dev server.

i might actually be able to remove that performance hit. i'll take a
look tomorrow.

Bill

unread,
Sep 30, 2008, 8:24:58 PM9/30/08
to Google App Engine
Here's another question:
What's the best practice for handling timeout (and other) db errors?

This is partly prompted by ryan's recent comment that "regardless of
how your data is shaped, you'll at least see a
small timeout rate. .01% is in the ballpark." [http://
groups.google.com/group/google-appengine/browse_thread/thread/
dd90847d0e7c1ceb#]

I admit to not checking for timeouts for a single small put. It means
that there should be some thought behind all uses of datastore with
respect to db.Error. I looked at rietveld source code for some error
handling examples but didn't find any interesting patterns. There are
some general db.Error catches but no Timeout or CapabilityDisabled
exception handling. (By the way, a couple of the last exceptions in
datastore_errors.py aren't documented at
http://code.google.com/appengine/docs/datastore/exceptions.html.
Should documentation issues be logged in the SDK area?)

If you receive a timeout, do you need to launch another request from
the client because your time is already used up for the current
request? I assume that many errors should re-initiate a request (or
ask user to resubmit) because the time quota on each request. Is this
correct?
Reply all
Reply to author
Forward
0 new messages