Avoiding races when using memcache to back datastore queries

160 views
Skip to first unread message

Joel Holveck

unread,
Mar 17, 2016, 1:00:16 PM3/17/16
to Google App Engine
I'm working on a bit of an app that can receive requests for the same query several times in rapid succession, and it's almost always the same in a short timespan.  Rather than doing datastore queries each time, I'd like to cache the information in memcache.

I first thought I'd model my mechanism on how ndb uses memcache.  If I'm reading the docs right at https://cloud.google.com/appengine/docs/python/ndb/cache#memcache , ndb will delete a memcache key prior to a .put() and expect the first .get() to populate the memcache.  But it sounds like this could cause a race if instance 1 is performing a put at the same time as instance 2 is performing a put, and that the stale data could be cached.

  1. Instance 1 determines it needs to put changes to an object into the datastore.
  2. Instance 2 determines it needs to get the same object from datastore.
  3. Instance 1 deletes the memcache key.
  4. Instance 2 queries the memcache, and finds it empty.
  5. Instance 2 loads the old version from the datastore.
  6. Instance 1 puts the new version in the datastore.
  7. Instance 2 puts the old version's data into memcache.
  8. Much later, Instance 3 queries the memstore for the object, and gets the old version.

In this case, this is only using methods that are supposed to have strong consistency guarantees, but the way they're composed means that ndb's "fetch by key" would not be strongly consistent.

I suspect the docs are simplifying the actual consistency protocol used by ndb, but this seems like a problem that needs to be solved pretty frequently.  Anybody have insights to share?

As I mentioned, my ultimate goal is to build a mechanism to cache a particular query in memcache, but I'm looking at just a get_by_id operation's caching to get ideas of how to do this well.

Thanks,
Piquan

Joel Holveck

unread,
Mar 17, 2016, 1:27:46 PM3/17/16
to Google App Engine
By the way, feel free to say this should be on Stack Overflow if that's more appropriate.  I still don't have a good feel for what should be posted here vs. elsewhere.

Nick (Cloud Platform Support)

unread,
Mar 18, 2016, 4:56:05 PM3/18/16
to Google App Engine
Hey Joel,

Your question fits more on the "general discussion" side of the sometimes* ambiguous boundary where it might be too in-depth for Stack Overflow, with too many possible answers, and yet is also a sort-of-specific issue which tends to get redirected to Stack. In the spirit of helping you out regardless, I'll be happy to assist here with some advice.

Cache invalidation is traditionally held up as a class of problem in computing which is devilishly hard to get right. The cardinal sin is to have the system perceive stale data as fresh, although it's acceptable to fetch stale data as long as it's identifiable as such and the path to fresh data is relatively cheap.

The race condition you describe is definitely something to avoid, and I see one possible solution:

The deletion of the memcache key by instance 1 is clearly meant to prepare the way for a put() to the datastore of the new entity, so we should make the composition of these two actions into one atomic action, so that instance 2 will fail to fetch from memcache and thereafter find the new value in datastore. This can be accomplished as follows:

Rather than deleting the memcache key before put, a timestamp can be put in its place (or somehow associated with it), only to be deleted once the datastore entity is safely put. This means that instance 2, attempting to access the memcache key, would notice that it's currently being updated and do extremely small incremental sleeps until it finds the new version available in the datastore.

Could you go into a little more detail about your system specifically? I feel this will help spur more concrete discussion of the trade-offs you can make in your specific situation.

Regards,

Nick
Cloud Platform Community Support

Greg Jones

unread,
Mar 18, 2016, 6:28:35 PM3/18/16
to Google App Engine
It probably doesn't surprise you that how NDB handles caching is more complicated (and sensible) than the docs summarise:
Specifically, to avoid the problem you highlight, NDB puts sentinel '_LOCKED' value into memcache at certain points, to signal to other readers that an update is taking place, and that they should not put anything back into cache. In conjunction, the sequence is more like:
    1. Instance 1 determines it needs to put changes to an object into the datastore.
    2. Instance 2 determines it needs to get the same object from datastore.
    1. Instance 1 sets the memcache value to _LOCKED
    2. Instance 2 queries the memcache, and finds _LOCKED
    1. Instance 2 loads the old version from the datastore.
    2. Instance 1 puts the new version in the datastore.
    1. Because it found _LOCKED in memcache, 2 doesn't set the value from datastore back in memcache
    2. 1 deletes the memcache value

    NDB also makes use of the  'CAS' facility of the memcache service[1][2], that prevents certain re-orderings of this sequence, or failures at the memcache steps, from causing problems.

    Joel Holveck

    unread,
    Mar 18, 2016, 11:15:38 PM3/18/16
    to Google App Engine
    Thanks for the help!

    It looks like if memcache evicts the _LOCKED sentinel between steps 3 and 4 of Greg's sequence (for example, resets the shard in a maintenance or migration event), this algorithm could still cache stale data.  Greg, you mentioned that it uses CAS to prevent problems with memcache failures; I'm trying to think about how the instance could prevent this with judicious use of CAS, but nothing springs to mind.

    To Nick's question about the specific circumstances I'm looking at, I'll try to put it in a nutshell.

    I'm writing a user notification service.  The idea is that a computer (like a workstation or an Arduino) can post a message that will be transmitted to the user, even if they're on a different computer or mobile device.  The basic functionality is just pushing a notification through GCM, but there's extra features: for instance, notifications can be staged over time so that it's sent to less intrusive clients (like a notification center pop-up) before more intrusive ones (like a noisy phone push).  Also, when a notification is acknowledged on one client it's cleared from others, so you don't pick up your phone and see a lot of messages that you've already dealt with.  (There's a few other features that aren't germane to this discussion.)

    On the web UI, there's a list of recent notifications, updated in realtime.  This is surprisingly tricky to get right, and I'm still working on the design.  It first loads the notification list from the server and displays it.  Then, it establishes a channel (using the Channel API), and reloads the notification list.  It does this little dance so that it can give the user a quick initial display (since establishing the channel is slow, in page load terms), but still be guaranteed not to drop messages between the initial load and when the channel is established.

    Since the reload is almost always the same as the initial load, I'm experimenting with caching the notification list in memcache.  While I was designing the cache flow, I was having a hard time coming up with a correct algorithm.  I figured I'd start by considering how ndb caches objects, and realized that the documented algorithm admits the race condition I described.

    As a separate issue, I'm still trying to decide if I can use a non-ancestor query and still have a live list of notifications, subject only to indexing and propagation delays, but that's a separate concern.

    Greg Jones

    unread,
    Mar 19, 2016, 7:13:58 AM3/19/16
    to Google App Engine
    If the reader gets None instead, then it sets _LOCKED, and then uses the CAS-ID from that when it attempts to put the from-datastore value back into memcache. Either it will put before the writer deletes (the delete isn't conditional), or the writer will delete first (or the value will get evicted), which will invalidate the CAS-ID.

    Reply all
    Reply to author
    Forward
    0 new messages