I originally thought I was seeing an occurence of issue 192 but was
advised that this was probably not (comment #22).
http://code.google.com/p/spymemcached/issues/detail?id=192
Here is a snippet of the spy relevant stack traces in our logs (using
version 2.6.1, we've since upgraded to 2.8.0 but I don't have a
current stack trace handy):
java.lang.IllegalStateException: Timed out waiting to add
net.spy.memcached.protocol.binary.GetOperationImpl@7cb95c23(max
wait=10000ms)
at
net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:
292)
at
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:
611)
at
net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:
591)
at
net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:279)
at
net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:799)
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:
919)
...
So we haven't been able to reproduce this condition in non live
traffic conditions so we only have stack traces and other forensics to
piece together what is happening. Basically we get an avalanche of
these timeout messages at some point and then a recovery. It appears
to not be a result of a flapping server based on observation of the
server cluster. For reference, I have also twiddled with the
opTimeouts without much success (tried both setting very high and very
low, I settled somewhere in the middle at 2500 ms).
Anyway, just looking for some pointers or general advice on how to
tune against conditions where we seem to be overloaded with async
gets.
Thanks in advance!!
--Josh