Proposal: __index__ support to make certain datastore/query operations more efficient

10 views
Skip to first unread message

tav

unread,
Aug 6, 2009, 12:05:08 AM8/6/09
to google-a...@googlegroups.com
As a developer we have various limitations that we have to work with
on App Engine. And if you are performance conscious, then you have
even more ;p

Two limits that one has to often work around are the 1MB Datastore
calls and ListProperty size limits.

A common pattern that I've used to overcome these is to use "related
entities". I use these related entities to store properties that I
want to search on -- separate from the entities that I want to
retrieve in order to display the results back to the user.

I was pleased to see App Engine developer Brett Slatkin officially
encouraging the use of this pattern in his I/O talk:

* http://code.google.com/events/io/sessions/BuildingScalableComplexApps.html

He calls them "relation index entities" and explains it better than I
can, so here be a slightly adapted excerpt from pages 23-25 of his
presentation PDF:

----------------

Problem: Scalably "delivering" a Twitter-esque post/message

Solution: Split the message into 2 entities:
* Message model contains the info we care about
* MessageIndex has only relationships for querying

>>> class Message(db.Model):
... sender = db.StringProperty()
... body = db.TextProperty()

>>> class MessageIndex(db.Model):
... receivers = db.StringListProperty()

When writing, put entities in the same entity group for transactions:
* That is, make the Message entity be the parent for the MessageIndex entity
* You can of course have multiple MessageIndex entities per Message to
scale up...

And for queries:
* Do a key-only query to fetch the MessageIndexes
* Transform returned keys to retrieve parent entity
* Fetch Message entities in batch

>>> indexes = db.GqlQuery(
... "SELECT __key__ FROM MessageIndex "
... "WHERE receivers = :1", me)

>>> keys = [k.parent() for k in indexes]
>>> messages = db.get(keys)

----------------

Now, if you start using this pattern heavily, you'll realise that it
should be extremely trivial for the App Engine devs to turn the 3
stepped process of:

* Do a key-only query to fetch the MessageIndexes
* Transform returned keys to retrieve parent entity
* Fetch Message entities in batch

Into just a single step:

* Do a query on MessageIndexes which will actually return the related
Messages entities

That is, the query would be done on the MessageIndexes entities, but
instead of returning those entities, the related Message entities
would be returned.

Saving both us and App Engine an additional Datastore request and key
computation!! Given the 30 seconds limit, this also means that you
could then do twice the amount of querying in the same time! So it
helps even more!

So how would this work?

Well, let's say that we had the following Message:

>>> msg = Message(send='tav', body='Hello World')
>>> msg_key = msg.key()

And the following related MessageIndex entities:

>>> rcv1 = MessageIndex(parent=msg_key, receivers=['alice', 'bob'], __index__=msg_key)

>>> rcv2 = MessageIndex(parent=msg_key, receivers=['ryan'], __index__=msg_key)

The presence of the newly proposed __index__ property would alter the
behaviour of how the Datastore indexes the rcv1/rcv2 entities. Instead
of doing the current:

rcv1:receivers:alice <rcv1_key>
rcv1:receivers:bob <rcv1_key>
rcv2:receivers:ryan <rcv2_key>

It would look like:

rcv1:receivers:alice <msg_key>
rcv1:receivers:bob <msg_key>
rcv2:receivers:ryan <msg_key>

[Note for those not familiar with the App Engine Datastore -- the App
Engine guys don't store the full entity in all of the index tables,
instead the <entity_key> is stored and used to load up the complete
entity with all of its properties in order to respond to the completed
query once the relevant entities have been identified.]

That is, with __index__, the Datastore would sort of behave as if it
was actually indexing additional properties of the Message entity!! So
when a query comes in, it can do a query as if for a MessageIndex, but
then return the Message entity pointed to by the key!!

Now, sadly, I fear I've explained all this quite poorly, but I do
believe that it's a brilliantly simple/powerful idea.

All it needs is for an optional __index__ property on entities to be
treated specially and for it to be a Key if present (not a
ReferenceProperty) -- in which case, the key referred to in the
__index__ property is used to "index" the entity in question rather
than the entity's own key().


Anyways, hope I've made some sense and provided something of value --
let me know if I can provide any further clarifications. Thanks!

--
love, tav

plex:espians/tav | t...@espians.com | +44 (0) 7809 569 369
http://tav.espians.com | http://twitter.com/tav | skype:tavespian

Julian Namaro

unread,
Aug 6, 2009, 7:45:32 AM8/6/09
to Google App Engine
That's an interesting idea, for that particular use case.

If I reformulate the "relation index entities" query as I understand
it:
1. [datastore]Traverse the index and return entity keys
2. [app engine]Transform returned keys to retrieve parent entity
3. [datastore] Batch get of entities

Your idea:
1. [datastore]Traverse the index and to get entity keys
3. [datastore] Batch get of entities

So you would save step 2, a roundtrip app engine - datastore, not sure
that's substantial though.

Julian




On Aug 6, 1:05 pm, tav <t...@espians.com> wrote:
> As a developer we have various limitations that we have to work with
> on App Engine. And if you are performance conscious, then you have
> even more ;p
>
> Two limits that one has to often work around are the 1MB Datastore
> calls and ListProperty size limits.
>
> A common pattern that I've used to overcome these is to use "related
> entities". I use these related entities to store properties that I
> want to search on -- separate from the entities that I want to
> retrieve in order to display the results back to the user.
>
> I was pleased to see App Engine developer Brett Slatkin officially
> encouraging the use of this pattern in his I/O talk:
>
> *http://code.google.com/events/io/sessions/BuildingScalableComplexApps...
> plex:espians/tav | t...@espians.com | +44 (0) 7809 569 369http://tav.espians.com|http://twitter.com/tav| skype:tavespian

tav

unread,
Aug 7, 2009, 11:16:25 AM8/7/09
to google-a...@googlegroups.com
Hey Julian,

> That's an interesting idea, for that particular use case.
>
> If I reformulate the "relation index entities" query as I understand
> it:
> 1. [datastore]Traverse the index and return entity keys
> 2. [app engine]Transform returned keys to retrieve parent entity
> 3. [datastore] Batch get of entities
>
> Your idea:
> 1. [datastore]Traverse the index and to get entity keys
> 3. [datastore] Batch get of entities
>
> So you would save step 2, a roundtrip app engine - datastore, not sure
> that's substantial though.

Nice reformulation! However, my proposal would be just a single step:

1. [datastore] Traverse the index and return entities

The only additional cost should be an additional if/else clause when
*writing* entities -- the datastroe would have to look to see if the
entity has an __index__ property. Quite a minimal cost given the
savings otherwise!

Why should it just be a single step? Because of the way that the
datastore works (at least according to the presentation that Ryan gave
last year):

* http://snarfed.org/space/datastore_talk.html

According to it (slide 21):

* Each row in an index table includes the index data and entity key

With my proposal, when an entity is indexed, the key defined in its
__index__ property will be used in place of its own key when the
indexes are constructed...

Of course these special "related entities" would still be accessible
by using their key names...

Hope that makes sense.

Andy Freeman

unread,
Aug 7, 2009, 1:15:36 PM8/7/09
to Google App Engine
> With my proposal, when an entity is indexed, the key defined in its
> __index__ property will be used in place of its own key when the
> indexes are constructed...

Cute, but there's no way to guarantee that there's an object with that
key.

Also, redirecting queries this way means that there's no way to get
the key (or entity) via a query so the entity can be updated, deleted,
etc.

On Aug 7, 8:16 am, tav <t...@espians.com> wrote:
> Hey Julian,
>
> > That's an interesting idea, for that particular use case.
>
> > If I reformulate the "relation index entities" query as I understand
> > it:
> > 1. [datastore]Traverse the index and return entity keys
> > 2. [app engine]Transform returned keys to retrieve parent entity
> > 3. [datastore] Batch get of entities
>
> > Your idea:
> > 1. [datastore]Traverse the index and to get entity keys
> > 3. [datastore] Batch get of entities
>
> > So you would save step 2, a roundtrip app engine - datastore, not sure
> > that's substantial though.
>
> Nice reformulation! However, my proposal would be just a single step:
>
> 1. [datastore] Traverse the index and return entities
>
> The only additional cost should be an additional if/else clause when
> *writing* entities -- the datastroe would have to look to see if the
> entity has an __index__ property. Quite a minimal cost given the
> savings otherwise!
>
> Why should it just be a single step? Because of the way that the
> datastore works (at least according to the presentation that Ryan gave
> last year):
>
> *http://snarfed.org/space/datastore_talk.html
>
> According to it (slide 21):
>
> * Each row in an index table includes the index data and entity key
>
> With my proposal, when an entity is indexed, the key defined in its
> __index__ property will be used in place of its own key when the
> indexes are constructed...
>
> Of course these special "related entities" would still be accessible
> by using their key names...
>
> Hope that makes sense.
>
> --
> love, tav
>
> plex:espians/tav | t...@espians.com | +44 (0) 7809 569 369http://tav.espians.com|http://twitter.com/tav| skype:tavespian

tav

unread,
Aug 7, 2009, 1:42:22 PM8/7/09
to google-a...@googlegroups.com
Hey Andy,

> Cute, but there's no way to guarantee that there's an object with that
> key.

Very good point!

Would require an additional check in the if/else clause on write of
the related entity which should raise an error if the entity it points
to doesn't exist...

And if all entities that were "__index__'ed" were kept in a separate
table -- much like the Kind table, then perhaps that could be checked
when an entity is about to be deleted and if a referrant related
entity is still around, an Exception could be raised?

Thoughts?

> Also, redirecting queries this way means that there's no way to get
> the key (or entity) via a query so the entity can be updated, deleted,
> etc.

Sure. But one should always be able to get the entity directly using
it's key name -- i tend to have predictable key_names for my related
entities -- and perhaps even use the transaction descendant queries
that landed yesterday in 1.2.4?

Andy Freeman

unread,
Aug 8, 2009, 3:37:22 PM8/8/09
to Google App Engine
> Would require an additional check in the if/else clause on write of
> the related entity which should raise an error if the entity it points
> to doesn't exist...

That's not enough - the __index__ entity could be deleted afterwards.

The relevant check in deletes is very expensive (it's a query) and
it's unclear what one can do when it occurs because you can't query
for entities with __index__ values. (What Kind table?)

Requiring explicit key_name 's and/or unnecessary transaction groups
isn't really a solution.

I think that a better approach would be a different kind of query, one
that returned the entity referenced by the reference property
specified in the query. Note that this would allow a given entity to
have the equivalent of multiple __index__ properties.

This doesn't eliminate the problem with referenced entities that don't
exist.

One possible solution is a sentinel value. None is a not-horrible
choice. (For db.Query.fetch(), None is unambiguous - there would be
None's in the list if the query succeeded but the entity fetch
failed. For db.Query.get(), None would ambiguous. One possibility is
to throw an exception of the query succeeded but the entity fetch
failed.)
> plex:espians/tav | t...@espians.com | +44 (0) 7809 569 369http://tav.espians.com|http://twitter.com/tav| skype:tavespian

ryan

unread,
Aug 17, 2009, 6:44:25 PM8/17/09
to Google App Engine
interesting! thanks for the writeup, tav, and for the discussion,
julian and andy. this is definitely a worthwhile use case.

short answer: without getting too deep into the different use cases, i
agree with andy. this is a great feature, and would almost certainly
be useful in a library or the api itself. however, it can be done in
userland, right now, at roughly the same cost and performance as we
could do it in the backend.

longer answer: i often distinguish between features that can be done
effectively in userland, or in the api libraries, and things that
inherently need to be done in the datastore backend. in general, we
find that the backend provides enough low-level building blocks that
many features can be implemented in userland with nearly the same
efficiency and performance. for example, the IN and != filters and GQL
query parsing are both currently done in userland, mainly for this
reason.

a datastore operation's cost is determined primarily by the total
number of disk seeks it needs, and its performance is determined
mainly by the number of *serial* disk seeks it needs. for example, a
query that fetches ten entities costs roughly 11 disk seeks, one for
the index scan and one for each entity lookup. however, the same query
incurs only two disk seeks' worth of latency, since all of the entity
lookups are done in parallel.

naturally, cpu, bandwidth, datastore backend RPCs, and tablet server
RPCs also count, but they're generally orders of magnitude less costly
and faster than disk seeks. as always, though the specifics depend on
the size and shape of your data.

in this case, we'd implement this feature in the backend by doing a
keys-only query, generating a list of parent keys from the result
keys, uniquifying it, then doing a batch get on those parent keys.

we currently expose both keys only queries and batch gets, so the
notable extra overhead of doing this in userland is just the single
extra RPC (and associated bandwidth) to the datastore backend, and the
CPU overhead of decoding the intermediate result keys and encoding the
batch get request. as usual, those will generally be swamped by the
cost and latency of the disk seeks for the query and batch get. given
that, the difference between userland and the datastore backend should
be negligible.

the one benefit of moving a feature into the backend is unifying the
code complexity. if a userland feature is used in both languages, we/
you have to maintain parallel implementations in both languages. if
it's in the backend, we only have to maintain it in one place.

we're cautious about this, since we put a high priority on the
robustness of all of our backends, including the datastore. every
extra line of code we add to them incurs a maintennance cost and
increases the risk of bugs and instability. that makes the bar for
including a feature in a backend fairly high. that's the main reason
behind our philosophy of exposing low-level, orthogonal building
blocks in the datastore backend, so that you can hopefully do
interesting higher-level features like this in userland, without
depending on us.

regardless, we're all for useful, general purpose features like this.
i'd love to see it implemented and published as an open source
library!
Reply all
Reply to author
Forward
0 new messages