Two limits that one has to often work around are the 1MB Datastore
calls and ListProperty size limits.
A common pattern that I've used to overcome these is to use "related
entities". I use these related entities to store properties that I
want to search on -- separate from the entities that I want to
retrieve in order to display the results back to the user.
I was pleased to see App Engine developer Brett Slatkin officially
encouraging the use of this pattern in his I/O talk:
* http://code.google.com/events/io/sessions/BuildingScalableComplexApps.html
He calls them "relation index entities" and explains it better than I
can, so here be a slightly adapted excerpt from pages 23-25 of his
presentation PDF:
----------------
Problem: Scalably "delivering" a Twitter-esque post/message
Solution: Split the message into 2 entities:
* Message model contains the info we care about
* MessageIndex has only relationships for querying
>>> class Message(db.Model):
... sender = db.StringProperty()
... body = db.TextProperty()
>>> class MessageIndex(db.Model):
... receivers = db.StringListProperty()
When writing, put entities in the same entity group for transactions:
* That is, make the Message entity be the parent for the MessageIndex entity
* You can of course have multiple MessageIndex entities per Message to
scale up...
And for queries:
* Do a key-only query to fetch the MessageIndexes
* Transform returned keys to retrieve parent entity
* Fetch Message entities in batch
>>> indexes = db.GqlQuery(
... "SELECT __key__ FROM MessageIndex "
... "WHERE receivers = :1", me)
>>> keys = [k.parent() for k in indexes]
>>> messages = db.get(keys)
----------------
Now, if you start using this pattern heavily, you'll realise that it
should be extremely trivial for the App Engine devs to turn the 3
stepped process of:
* Do a key-only query to fetch the MessageIndexes
* Transform returned keys to retrieve parent entity
* Fetch Message entities in batch
Into just a single step:
* Do a query on MessageIndexes which will actually return the related
Messages entities
That is, the query would be done on the MessageIndexes entities, but
instead of returning those entities, the related Message entities
would be returned.
Saving both us and App Engine an additional Datastore request and key
computation!! Given the 30 seconds limit, this also means that you
could then do twice the amount of querying in the same time! So it
helps even more!
So how would this work?
Well, let's say that we had the following Message:
>>> msg = Message(send='tav', body='Hello World')
>>> msg_key = msg.key()
And the following related MessageIndex entities:
>>> rcv1 = MessageIndex(parent=msg_key, receivers=['alice', 'bob'], __index__=msg_key)
>>> rcv2 = MessageIndex(parent=msg_key, receivers=['ryan'], __index__=msg_key)
The presence of the newly proposed __index__ property would alter the
behaviour of how the Datastore indexes the rcv1/rcv2 entities. Instead
of doing the current:
rcv1:receivers:alice <rcv1_key>
rcv1:receivers:bob <rcv1_key>
rcv2:receivers:ryan <rcv2_key>
It would look like:
rcv1:receivers:alice <msg_key>
rcv1:receivers:bob <msg_key>
rcv2:receivers:ryan <msg_key>
[Note for those not familiar with the App Engine Datastore -- the App
Engine guys don't store the full entity in all of the index tables,
instead the <entity_key> is stored and used to load up the complete
entity with all of its properties in order to respond to the completed
query once the relevant entities have been identified.]
That is, with __index__, the Datastore would sort of behave as if it
was actually indexing additional properties of the Message entity!! So
when a query comes in, it can do a query as if for a MessageIndex, but
then return the Message entity pointed to by the key!!
Now, sadly, I fear I've explained all this quite poorly, but I do
believe that it's a brilliantly simple/powerful idea.
All it needs is for an optional __index__ property on entities to be
treated specially and for it to be a Key if present (not a
ReferenceProperty) -- in which case, the key referred to in the
__index__ property is used to "index" the entity in question rather
than the entity's own key().
Anyways, hope I've made some sense and provided something of value --
let me know if I can provide any further clarifications. Thanks!
--
love, tav
plex:espians/tav | t...@espians.com | +44 (0) 7809 569 369
http://tav.espians.com | http://twitter.com/tav | skype:tavespian
> That's an interesting idea, for that particular use case.
>
> If I reformulate the "relation index entities" query as I understand
> it:
> 1. [datastore]Traverse the index and return entity keys
> 2. [app engine]Transform returned keys to retrieve parent entity
> 3. [datastore] Batch get of entities
>
> Your idea:
> 1. [datastore]Traverse the index and to get entity keys
> 3. [datastore] Batch get of entities
>
> So you would save step 2, a roundtrip app engine - datastore, not sure
> that's substantial though.
Nice reformulation! However, my proposal would be just a single step:
1. [datastore] Traverse the index and return entities
The only additional cost should be an additional if/else clause when
*writing* entities -- the datastroe would have to look to see if the
entity has an __index__ property. Quite a minimal cost given the
savings otherwise!
Why should it just be a single step? Because of the way that the
datastore works (at least according to the presentation that Ryan gave
last year):
* http://snarfed.org/space/datastore_talk.html
According to it (slide 21):
* Each row in an index table includes the index data and entity key
With my proposal, when an entity is indexed, the key defined in its
__index__ property will be used in place of its own key when the
indexes are constructed...
Of course these special "related entities" would still be accessible
by using their key names...
Hope that makes sense.
> Cute, but there's no way to guarantee that there's an object with that
> key.
Very good point!
Would require an additional check in the if/else clause on write of
the related entity which should raise an error if the entity it points
to doesn't exist...
And if all entities that were "__index__'ed" were kept in a separate
table -- much like the Kind table, then perhaps that could be checked
when an entity is about to be deleted and if a referrant related
entity is still around, an Exception could be raised?
Thoughts?
> Also, redirecting queries this way means that there's no way to get
> the key (or entity) via a query so the entity can be updated, deleted,
> etc.
Sure. But one should always be able to get the entity directly using
it's key name -- i tend to have predictable key_names for my related
entities -- and perhaps even use the transaction descendant queries
that landed yesterday in 1.2.4?