Caching a list on high replication datastore

Andrew Richardson

unread,

Feb 22, 2012, 5:25:56 PM2/22/12

to google-a...@googlegroups.com

I'm considering switching to the high replication datastore, and I want to make sure I understand what needs to be cached in order to deal with eventual consistency. I know this question has been asked many times...just want to make sure I'm understanding after reading some other answers.

I have a page that displays a list of projects which a user is part of. From this page, they can rename a project, leave a project, or create a new project. If they perform any of these actions, the page is refreshed, and obviously should reflect the change.

This list is already being cached...so if user #1 is logged in, memcache may have "user-1-projects" stored. On the master-slave datastore, I can simply delete this value if they rename/leave/create a project, and the next page view will cause it to be rebuilt from a datastore query. But as I understand it, on the HRD I will have to modify the cached value in-place rather than deleting it...ie I retrieve the "user-1-projects" list from the cache, rename/create/delete the relevant item, then write the updated "user-1-projects" entry back to the cache. If I just delete it as I'm currently doing, the subsequent query may return stale results.

Is this correct?

Joshua Smith

unread,

Feb 22, 2012, 5:44:54 PM2/22/12

to google-a...@googlegroups.com

It depends how you get the list of projects.

Queries are eventually-consistent, but fetches are consistent right away.

So if the list for a user is stored as a list of ID's in the user's record, you can fetch that, then fetch the items listed, and all will be consistent always.

But if you are querying for projects that happen to have a user listed (WHERE user = :1), then you need to get clever.

Delete and rename are actually pretty easy to handle if you can afford another round trip to the database. Instead of querying for the records completely, you query for the keys, and then fetch those records. Here's the code I use:

class HRModel(db.Model):

@classmethod

def gql_with_get(cls, query_string, *args, **kwds):

return filter(None, db.get(db.GqlQuery('SELECT __key__ FROM %s %s' % (cls.kind(), query_string), *args, **kwds)))

It works just like Model.gql() but it does the two-step. The db.get will get the consistent data. The filter is needed to handle deletes.

This will not detect new records appearing, however. There is no pretty way to deal with that. You have to somehow let the process doing the query know that if it doesn't see a certain record, it should retry.

In practice, I work around this through a trick in most cases. When you create a new project, the user probably needs to fill some stuff in. So create the record right away, and put in enough info so it will appear in your query results. Then have the user edit the existing record to specify the rest of the data, and put it back to the datastore.

By the time the user has done their part, eventual consistency will have occurred, and Bob's your uncle.

-Joshua

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/Aen4pAMD2LAJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Robert Kluin

unread,

Feb 23, 2012, 1:27:47 AM2/23/12

to google-a...@googlegroups.com

For the case of adding, you can also track what was just added, then
ensure that it is included in the results. So you'll run your normal
query (without the additional lookup step), then just ensure the new
item is in the list. The same can actually work for updates. If you
update one entity, ensure that the next page view updates or replaces
the version of that entity from the query results.

In practice, this type of method may be cheaper since you're just
doing one query and not a query followed by a batch get (which will
potentially be many RPCs). Of course, if you want everything to be
consistent then Josh's method is probably about the best you can do.
You could probably improves it a bit by caching the entities in
memcahce too; in other words, check memcache then only goto the
datastore for keys not in memcahce.

Robert

Andrew Richardson

unread,

Feb 23, 2012, 1:48:57 AM2/23/12

to google-a...@googlegroups.com

Thanks for the replies!

Josh's solutions are interesting...I agree with Robert that the extra datastore hits aren't my favorite, but it is appealing for simplicity's sake. And these operations are relatively rare (ie users will not be changing projects very often), so it could be acceptable. Tracking the recent changes is obviously much cheaper...but requires more coding to update and merge in the results from memcache. Sounds like it comes down to a trade-off between simplicity and datastore cost, which isn't too surprising.

At any rate, you confirmed what I thought the behavior would be, and the fact that I do need to come up with a more robust caching solution before switching to the HRD. That's what I needed to know!

Reply all

Reply to author

Forward