Issues with OperationTimeoutException

1,752 views
Skip to first unread message

Michael Leiseca

unread,
Jul 19, 2013, 5:13:59 PM7/19/13
to spymem...@googlegroups.com
Hi,

I have been trying to debug issues we are having with spymemcached (2.8.12) involving timeouts. We have 9 tomcat instances using 7 memcached instances and seeming randomly get timeout exceptions. Our peak usage is around 300 operations/second, but are seeing exceptions well below that level of usage

I have spent the day trying out different configuration options, even extending the timeout to 1 second, but am still seeing regular timeouts. The memcached server seems to be doing almost nothing, so I'm guessing that the problem is inside of spymemcached. I know that there have been some conversations about this issue in this group, but couldn't tell which ones are still current and if there are configuration options that are recommended for high throughput.

thanks!

Michael


Here are selected bits of the stack trace:

net.spy.memcached.OperationTimeoutException: Timeout waiting for value
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1036)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1051)
...
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: /...
        at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:160)
        at net.spy.memcached.internal.GetFuture.get(GetFuture.java:62)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1026)

Boris Partensky

unread,
Jul 21, 2013, 1:23:53 PM7/21/13
to spymem...@googlegroups.com
I'd first look into your GC behavior. Turn up gc logging of your tomcat jvm and see if the timings of the timeout spikes correspond to reported stw phases of the GC. Assuming that you use CMS gc, adding the following settings to your JVM args may provide enough details - "-verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -Xloggc:/path/to/your/gc.log"



--
You received this message because you are subscribed to the Google Groups "spymemcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spymemcached...@googlegroups.com.
To post to this group, send email to spymem...@googlegroups.com.
Visit this group at http://groups.google.com/group/spymemcached.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Michael Leiseca

unread,
Jul 22, 2013, 2:34:09 PM7/22/13
to spymem...@googlegroups.com
It doesn't look like GC is the problem. I turned on logging and everything looks fine.

I have upgraded to 2.9.1 and am still having the same problem. It seems like it happens in bunches (maybe 5-10 timeouts) after several minutes of high volume traffic.

I have been poking through the debug level logs. NIO is new to me so I'm not sure how useful my trying to debug this is, but here's what I see when a request times out:

2013-07-22 11:12:03,535 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.MemcachedConnection - Selecting with delay of 0ms
2013-07-22 11:12:03,535 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.MemcachedConnection - Selected 1, selected 1 keys
2013-07-22 11:12:03,536 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.MemcachedConnection - Handling IO for:  sun.nio.ch.SelectionKeyImpl@1b5970e6 (r=false, w=true, c=false, op={QA sa=/10.1.240.15:11211, #Rops=0, #Wops=1, #iq=0, topRop=null, topWop=Cmd: 0 Opaque: 1940 Key: MYKEY, toWrite=0, interested=4})
2013-07-22 11:12:03,536 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.protocol.binary.GetOperationImpl - Transitioned state from WRITE_QUEUED to WRITING
2013-07-22 11:12:03,536 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl - After copying stuff from Cmd: 0 Opaque: 1940 Key: MYKEY: java.nio.DirectByteBuffer[pos=68 lim=16384 cap=16384]
2013-07-22 11:12:03,536 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.protocol.binary.GetOperationImpl - Transitioned state from WRITING to READING
2013-07-22 11:12:03,536 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl - Finished writing Cmd: 0 Opaque: 1940 Key: MYKEY
2013-07-22 11:12:03,536 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl - Wrote 68 bytes
2013-07-22 11:12:03,536 [Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211 /10.1.240.15:11211}] DEBUG net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl - Setting interested opts to 1


Given the logs for successful requests, I would expect to see something like the below (but I never do for this get request)

.... Handling IO for:  sun.nio.ch.SelectionKeyImpl@1b5970e6 (r=true, w=false, c=false, op={QA sa=/10.1.240.15:11211, #Rops=0, #Wops=1, #iq=0, topRop=null, topWop=Cmd: 0 Opaque: 1940 Key: MYKEY, toWrite=0, interested=4})
... net.spy.memcached.protocol.binary.GetOperationImpl - Reading ## header bytes

This is the error when the request eventually times out:

net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 500 ms. Node status: Connection Status { localhost/127.0.0.1:11211 active: true, authed: true, last read: 3,495 ms ago /10.1.240.15:11211 active: true, authed: true, last read: 992 ms ago }

Serkan Camurcuoglu

unread,
Aug 1, 2013, 10:40:52 AM8/1/13
to spymem...@googlegroups.com
Hi Michael, any update on this? We're also experiencing a similar problem.

Thanks,

Boris Partensky

unread,
Aug 2, 2013, 6:12:46 PM8/2/13
to spymem...@googlegroups.com
(Just realized that I typed a response awhile ago and failed to send, sorry).

Some questions (always helps to put as many details as you can) - What jre version? Do you use CMS? How long are ParNew pauses? What kind of icms_dc percentages are you seeing? Are there any Full GCs? If you are 100% sure that your GC is tuned correctly, then you can just create a pool of spy clients and round robin requests. We do that for some apps where the load onto memcache is very high.

Reply all
Reply to author
Forward
0 new messages