Thanks for the help!
It looks like if memcache evicts the _LOCKED sentinel between steps 3 and 4 of Greg's sequence (for example, resets the shard in a maintenance or migration event), this algorithm could still cache stale data. Greg, you mentioned that it uses CAS to prevent problems with memcache failures; I'm trying to think about how the instance could prevent this with judicious use of CAS, but nothing springs to mind.
To Nick's question about the specific circumstances I'm looking at, I'll try to put it in a nutshell.
I'm writing a user notification service. The idea is that a computer (like a workstation or an Arduino) can post a message that will be transmitted to the user, even if they're on a different computer or mobile device. The basic functionality is just pushing a notification through GCM, but there's extra features: for instance, notifications can be staged over time so that it's sent to less intrusive clients (like a notification center pop-up) before more intrusive ones (like a noisy phone push). Also, when a notification is acknowledged on one client it's cleared from others, so you don't pick up your phone and see a lot of messages that you've already dealt with. (There's a few other features that aren't germane to this discussion.)
On the web UI, there's a list of recent notifications, updated in realtime. This is surprisingly tricky to get right, and I'm still working on the design. It first loads the notification list from the server and displays it. Then, it establishes a channel (using the Channel API), and reloads the notification list. It does this little dance so that it can give the user a quick initial display (since establishing the channel is slow, in page load terms), but still be guaranteed not to drop messages between the initial load and when the channel is established.
Since the reload is almost always the same as the initial load, I'm experimenting with caching the notification list in memcache. While I was designing the cache flow, I was having a hard time coming up with a correct algorithm. I figured I'd start by considering how ndb caches objects, and realized that the documented algorithm admits the race condition I described.
As a separate issue, I'm still trying to decide if I can use a non-ancestor query and still have a live list of notifications, subject only to indexing and propagation delays, but that's a separate concern.