get_by_key_name vs fetch performance

Waldemar Kornewald

unread,

Feb 10, 2010, 11:47:44 AM2/10/10

to App Engine, Ryan Barrett

Hi,
were there any optimizations to the datastore lately? We did a few
Model.get_by_key_name vs Query.fetch() benchmarks (code is attached)
and it looks like the difference is minimal for individual
gets/fetches and practically non-existent for batch-gets vs
batch-fetch for the same entities.

Here we do 1000 individual get()s:
http://kornewald.appspot.com/get

Here we do 1000 individual fetch()es for the same entities:
http://kornewald.appspot.com/fetch

Here we do four batch-get()s of 250 entities each:
http://kornewald.appspot.com/batchget

Here we do four batch-fetch()es for 250 entities each:
http://kornewald.appspot.com/batchfetch

The number returned is the time needed for retrieving the entities, so
the first two basically show the time per single get()/fetch().

Is there anything wrong with the benchmark code?

Our previous benchmarks showed a much more significant difference (3x
slower fetch()). Now it's merely a 30% difference and the few
milliseconds can hardly be noticed by the end-user.

Can we stop designing models like crazy around key names because there
is hardly any benefit in the added complexity or inconvenience in most
cases (e.g., not being able to change the key name afterwards)?

It looks like the only case where batch-get()s are useful is when you
can't formulate a single fetch() for the same kind of query.

Bye,
Waldemar

guestbook.zip

Eli Jones

unread,

Feb 10, 2010, 12:11:02 PM2/10/10

to google-a...@googlegroups.com

The main difference I find in .get_by_key_name() is the CPU overhead.. not the time it takes. So .. you should also be benchmarking API CPU time

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Waldemar Kornewald

unread,

Feb 10, 2010, 12:49:14 PM2/10/10

to Google App Engine

On Feb 10, 6:11 pm, Eli Jones <eli.jo...@gmail.com> wrote:
> The main difference I find in .get_by_key_name() is the CPU overhead.. not
> the time it takes. So .. you should also be benchmarking API CPU time

Yes, here the individual fetch() calls indeed cost more than 2x as
much, but really I'd rather pay slightly more (it won't be 2x as much,
anyway, when taking all queries and other cost factors into account)
and not care about the productivity overhead that keys add. In
benchmarks I did a pretty long time ago (when App Engine was very
young) a single get() took around 17ms and a fetch() for a single
entity could take 35-90ms, depending on how lucky you were. This was
significant enough to be noticeable on some pages which loaded a lot
of individual entities, so we accepted that, for example, usernames
can't be easily modified afterwards. The new numbers totally change
the game.

Hopefully, someday we'll also not have to care about startup times,
anymore (even if we have to pay a few $/month for pre-warmed instances
or give up the free quota for warm instances).

Bye,
Waldemar

Eli Jones

unread,

Feb 10, 2010, 3:02:25 PM2/10/10

to google-a...@googlegroups.com

I did my own tests on api_cpu cost for meModel.get_by_key_name() versus db.GqlQuery("Select * From meModel Where value=:1",meValue).fetch(1)

It costs a fixed 2.6 times as much to do the GqlQuery (it seems Google has pegged the api_cpu cost of these queries to a set amount. 10 for .get_by_key_name() and 26 for .fetch(1)). And.. if you are beyond the free quota and paying for CPU.. then this will indeed cost 2.6 times as much money to do the .fetch(1) versus the .get_by_key_name(mykey).

I guess if you aren't using a lot of api_cpu.. then it won't matter.

It also seems that .get_by_key_name() can be twice as fast at times.. but it isn't consistent. And really.. the time taken is fairly equivalent (as you have noted).

The primary benefit, I can see, in using .get_by_key_name() is when setting an entities key_name = to a composite of two of it's values (that will remain fixed). This allows you to forgoe setting up a composite index, which would slow down puts and also take up storage space.

I use a model likes so:

class myModel(db.Model):

ID = db.IntegerProperty(required=True)

step = db.IntegerProperty(required=True)

quote = db.FloatProperty(required=True,indexed=False)

and I put the entities like so:

meEntity = myModel(key_name = str(myID)+"_"+str(mystep),ID=myID,step=mystep, quote=myQuote)

db.put(meEntity)

so.. if I want the entity where ID = 1 and step = 3402.. i just .get_by_key_name("1_3402") instead of needing an index defined on (ID,step).

If I want a bunch of them in a range, I precompute the keys into a list and do .get_by_key_name() on that list.

Anyway, everyone should do whatever balances performance with development time in the way that best suits them.

Bye,
Waldemar

ryan

unread,

Feb 10, 2010, 5:53:32 PM2/10/10

to Google App Engine

hi all! great discussion. thanks for the original post and
measurements, waldemar! in short, you're right, the 1.3.1 datastore
backend in production includes a number of improvements to both query
performance and fault tolerance.

for query performance, we turned on a new code path that parallelizes
internal operations and bigtable scans and lookups more aggressively,
which is likely the reason for the improvements of query fetches vs.
gets that you saw.

for fault tolerance, we're now doing more retries in the backend
automatically, usually up to the full 30s request deadline for most
calls - basically everything except transaction commits, which retries
client side instead of in the backend. (if you're using python, you
might now want to try db.run_in_transaction_custom_retries() with a
high number of retries, e.g. 10, instead of just
db.run_in_transaction(). similar java support should be coming soon.)

we'll mention more detail in the official release notes and blog post,
but based on a day or so of results so far, we're already seeing a
substantial drop in error rate, mostly due to reduced timeouts, across
the board. we're also seeing that error rate is much less spiky, wihch
is always good.

> guestbook.zip
> 2KViewDownload

Message has been deleted

Waldemar Kornewald

unread,

Feb 11, 2010, 2:31:48 PM2/11/10

to Google App Engine

Thanks for the explanation, Ryan. Nice work, indeed.

Will every query that works directly on a (composite) datastore index
be almost as fast as db.get()?

Why don't you also increase the number of retries for
run_in_transaction?

Bye,
Waldemar

ryan

unread,

Feb 12, 2010, 11:06:20 AM2/12/10

to Google App Engine

On Feb 11, 11:31 am, Waldemar Kornewald <wkornew...@gmail.com> wrote:
>
> Will every query that works directly on a (composite) datastore index
> be almost as fast as db.get()?

as a general rule, no. queries that scan indices will always have to
do at least two serial disk seeks, one to read the index and one to
look up the entities the index rows point to. gets only need a single
disk seek, since they have the entities' primary key.

having said that, one vs two disk seeks isn't always the dominating
factor for latency. instead, python protocol buffer decoding might
dominate, or the bigtable tablet server RPCs might, e.g. if you're
fetching many entities from many different tablet servers. (we'll
issue a lot of RPCs in parallel on your behalf, but not arbitrarily
many.)

note that this only applies to queries that use an index. kindless
ancestor queries and queries on __key__, for example, scan the
entities table directly, so they'll only need a single disk seek.