~Can a memcache set failure corrupt data?

68 views
Skip to first unread message

crizCraig

unread,
Dec 1, 2011, 9:17:57 PM12/1/11
to appengine-ndb-discuss
In seeking to deal with the silent failures of memcache sets, I looked
to NDB for some guidance. I found the following code in
http://www.google.com/codesearch#i5-Ui7JIc4w/ndb/context.py&pv=1

failures = memcache.set_multi(mapping)
if failures:
badkeys = []
for failure in failures:
badkeys.append(mapping[failure].key)
logging.info('memcache failed to set %d out of %d keys: %s',
len(failures), len(mapping), badkeys)

While this at least logs failures, it doesn't seem to avoid the
following problem:
- A counter is successfully written to the memcache and datastore.
- That counter is then incremented successfully in the datastore but
silently fails to write to memcache.
- A subsequent increment reads the stale counter value from memcache
and writes the datastore with the wrong value.

Also, here's the thread in the vanilla App Engine Group I started in
reference to the same problem:
http://groups.google.com/group/google-appengine/browse_thread/thread/1f883584b6bc8afc/a54bed6aa7fd9ea6?lnk=gst&q=fail+silently#a54bed6aa7fd9ea6

Thanks in advance for your response.

Guido van Rossum

unread,
Dec 1, 2011, 10:13:09 PM12/1/11
to appengine-...@googlegroups.com, Alfred Fuller
On Thu, Dec 1, 2011 at 18:17, crizCraig <craig....@gmail.com> wrote:
> In seeking to deal with the silent failures of memcache sets, I looked
> to NDB for some guidance. I found the following code in
> http://www.google.com/codesearch#i5-Ui7JIc4w/ndb/context.py&pv=1
>
>      failures = memcache.set_multi(mapping)
>      if failures:
>        badkeys = []
>        for failure in failures:
>          badkeys.append(mapping[failure].key)
>        logging.info('memcache failed to set %d out of %d keys: %s',
>                     len(failures), len(mapping), badkeys)
>
> While this at least logs failures, it doesn't seem to avoid the
> following problem:
> - A counter is successfully written to the memcache and datastore.
> - That counter is then incremented successfully in the datastore but
> silently fails to write to memcache.
> - A subsequent increment reads the stale counter value from memcache
> and writes the datastore with the wrong value.

This is a problem with all approaches to caching -- the cache can be
stale. The only 100% foolproof solution would be to offer transactions
across the datastore and memcache, but that's not an easy thing to do,
and certainly doesn't exist (nor are there plans to do it AFAIK). And
even if it did exist, there would be a penalty in the form of
performance.

NDB tries to deal with this as best as it can, by doing something like
the following when you put() a key:

- overwrite the memcache key with a special LOCK value
- write to datastore
- delete the memcache key altogether

(For details read the code around
http://code.google.com/p/appengine-ndb-experiment/source/browse/ndb/context.py#645
.)

The value gets written back to memcache only later, when get()
successfully reads the value from the datastore. (See
http://code.google.com/p/appengine-ndb-experiment/source/browse/ndb/context.py#623
.)

There are still some scenarios where this could go wrong, but they are
pretty rare, and would involve multiple writers with extreme delays
thrown into their execution (A slight improvement is proposed in
http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=84).

Also note that when App Engine's memcache server fails, it normally
clears all its state; and after a write operation succeeds, it will
never return old data at a later time. Theoretically it would be
possible for a network partition to temporarily cause memcache to be
inaccessible while it maintains its state, but I have never heard of
an incident where this actually happened. Also, upon network failure,
memcache operations actually raise exceptions.

Finally, your description makes it sound like you are using the incr
operation. This is not used by NDB's use of memcache as a cache for
datastore operations (though it is offered as a stand-alone
operation).

[CC'ed Alfred in case he has better information.]

--
--Guido van Rossum (python.org/~guido)

Reply all
Reply to author
Forward
0 new messages