Connectivity issues under medium load

300 views
Skip to first unread message

Alexey N. Zlobin

unread,
May 11, 2011, 8:39:34 AM5/11/11
to spymemcached
Hi,

I'm using spymemcached in my servlet-based application. During load
testing I've met a connectivity problem. Regularly my application
fails with the following reports from spymemcached:

2011-05-11 08:45:44.335 INFO net.spy.memcached.MemcachedConnection:
Reconnecting due to exception on {QA sa=/10.0.0.10:11211, #Rops=3,
#Wops=0, #iq=0, topRop=net.spy.memcached.protocol.ascii.Optim
izedGetImpl@5c058b7f, topWop=null, toWrite=0, interested=1}
java.io.IOException: Disconnected unexpected, will reconnect.
at
net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:
392)
at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:
333)
at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:
219)
at net.spy.memcached.MemcachedClient.run(MemcachedClient.java:
1591)
2011-05-11 08:45:44.335 WARN net.spy.memcached.MemcachedConnection:
Closing, and reopening {QA sa=/10.0.0.10:11211, #Rops=3, #Wops=0,
#iq=0,
topRop=net.spy.memcached.protocol.ascii.OptimizedGetImpl@5c058b7f,
topWop=null, toWrite=0, interested=1}, attempt 0.
2011-05-11 08:45:44.336 WARN
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: Discarding
partially completed op:
net.spy.memcached.protocol.ascii.OptimizedGetImpl@5c058b7f
2011-05-11 08:45:44.354 WARN
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: Discarding
partially completed op:
net.spy.memcached.protocol.ascii.StoreOperationImpl@28bf3bbb
2011-05-11 08:45:44.354 WARN
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: Discarding
partially completed op:
net.spy.memcached.protocol.ascii.StoreOperationImpl@22ea86e3
2011-05-11 08:45:46.341 INFO net.spy.memcached.MemcachedConnection:
Reconnecting {QA sa=/10.0.0.10:11211, #Rops=0, #Wops=956, #iq=0,
topRop=null,
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@19285e7b,
toWrite=0, interested=0}
2011-05-11 08:45:46.342 INFO net.spy.memcached.MemcachedConnection:
Connection state changed for sun.nio.ch.SelectionKeyImpl@2952ea24
2011-05-11 08:45:46.355 INFO net.spy.memcached.MemcachedConnection:
Reconnecting due to exception on {QA sa=/10.0.0.10:11211, #Rops=2,
#Wops=0, #iq=0,
topRop=net.spy.memcached.protocol.ascii.OptimizedGetImpl@3b8e2477,
topWop=null, toWrite=0, interested=1}
java.io.IOException: Disconnected unexpected, will reconnect.
at
net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:
392)
at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:
333)
at
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:
219)
at net.spy.memcached.MemcachedClient.run(MemcachedClient.java:
1591)
2011-05-11 08:45:46.356 WARN net.spy.memcached.MemcachedConnection:
Closing, and reopening {QA sa=/10.0.0.10:11211, #Rops=2, #Wops=0,
#iq=0,
topRop=net.spy.memcached.protocol.ascii.OptimizedGetImpl@3b8e2477,
topWop=null, toWrite=0, interested=1}, attempt 0.
2011-05-11 08:45:46.356 WARN
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: Discarding
partially completed op:
net.spy.memcached.protocol.ascii.OptimizedGetImpl@3b8e2477
2011-05-11 08:45:46.375 WARN
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: Discarding
partially completed op:
net.spy.memcached.protocol.ascii.OptimizedGetImpl@5a0deddc
2011-05-11 08:45:48.361 INFO net.spy.memcached.MemcachedConnection:
Reconnecting {QA sa=/10.0.0.10:11211, #Rops=0, #Wops=597, #iq=0,
topRop=null,
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@2e1972bf,
toWrite=0, interested=0}
2011-05-11 08:45:48.362 INFO net.spy.memcached.MemcachedConnection:
Connection state changed for sun.nio.ch.SelectionKeyImpl@8bc713e
2011-05-11 08:45:48.365 INFO net.spy.memcached.MemcachedConnection:
Reconnecting due to exception on {QA sa=/10.0.0.10:11211, #Rops=2,
#Wops=0, #iq=0,
topRop=net.spy.memcached.protocol.ascii.OptimizedGetImpl@3dbd83b2,
topWop=null, toWrite=0, interested=1}
java.io.IOException: Disconnected unexpected, will reconnect.

... and so on. Some times connection can't be restored for 5-6
minutes.

The application receives following exceptions:

net.spy.memcached.OperationTimeoutException: Timeout waiting for value
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:
924) ~[memcached-2.5.jar:]
at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:
939) ~[memcached-2.5.jar:]
<application stacktrace>

At the same time I able to connect to the memcached server from the
application machine using telnet. The memcached server is running with
-v key and doesn't report anything in the log.

Spymemcache version: 2.5, memcache server version: 1.4.2 (ubuntu
package 1.4.2-1ubuntu3). This problem starts to appear under load
around 400 reads and 200 writes per second.

I would appreciate any ideas about the source of this problem or about
possible workaround.

Thanks,
Alexey.

Matt Ingenthron

unread,
May 11, 2011, 12:31:52 PM5/11/11
to spymem...@googlegroups.com, Alexey N. Zlobin
Hi Alexey,

Receiving an IOException there does seem to indicate something did go
wrong with the connection to the server.

Two questions:
What does your constructor (or connection factory builder) for the
MemcachedClient look like?
Are you doing synchronous or asynchronous requests in your code? If
asynch, what kind of timeout?

I don't think I can explain why the connection is getting upset, but I
can tell you that there is a change in 2.6rc2 (posted on the google code
site) which should lessen the impact of this connection hiccup. It's
not guaranteed to eliminate timeouts from occurring in your application
code, but what it should happen is you'd receive less timeouts in a
shorter interval, followed by a quicker recovery.

The problem there (issue 136) was something that would impede recovery,
which would cause timeouts, which would cause a connection drop. I
can't be certain this is what's happening from what you post though.

Can you try 2.6rc2 to see if it helps?

Matt

Alexey Zlobin

unread,
May 12, 2011, 3:53:15 AM5/12/11
to Matt Ingenthron, spymem...@googlegroups.com
On Wed, May 11, 2011 at 8:31 PM, Matt Ingenthron <inge...@cep.net> wrote:
> Hi Alexey,
>
> Receiving an IOException there does seem to indicate something did go
> wrong with the connection to the server.
>
> Two questions:
> What does your constructor (or connection factory builder) for the
> MemcachedClient look like?

I use the simplest constructor I found, it looks like
new MemcachedClient( new InetSocketAddress(host, port) )

> Are you doing synchronous or asynchronous requests in your code?  If
> asynch, what kind of timeout?

I am using synchronous get (MemcachedClient#get) and asynchronous set
(MemcachedClient#set) methods. Set uses the default timeout value,
which is 1 second.

I've just reviewed my code - the set result is never read (i.e. I just
call the set method and not save returned future). May it cause the
problems?

> I don't think I can explain why the connection is getting upset, but I
> can tell you that there is a change in 2.6rc2 (posted on the google code
> site) which should lessen the impact of this connection hiccup.  It's
> not guaranteed to eliminate timeouts from occurring in your application
> code, but what it should happen is you'd receive less timeouts in a
> shorter interval, followed by a quicker recovery.
>
> The problem there (issue 136) was something that would impede recovery,
> which would cause timeouts, which would cause a connection drop.  I
> can't be certain this is what's happening from what you post though.
>
> Can you try 2.6rc2 to see if it helps?

Unfortunately I can't try to reproduce the problem with a rc build. I
forgot to mention that this problem repeats only in the production
environment, and I can't reproduce it on the test environment. The
main difference between them is that in the test env. memcached
installed on the same machine with application server and in the
production on an another.

I dig a bit into the spymemcached code and found that connection is
dropped if no bytes was read after receiving a socket ready
notification [1]. Why this case is considered as a connection failure?
Sorry if this question is too dumb, i'm a total newbee in async IO
handling.

[1] https://github.com/dustin/java-memcached-client/blob/2.5/src/main/java/net/spy/memcached/MemcachedConnection.java#L392

--
С уважением,
Алексей Злобин.

Matt Ingenthron

unread,
May 13, 2011, 3:39:26 AM5/13/11
to spymem...@googlegroups.com
On 5/12/11 12:53 AM, Alexey Zlobin wrote:
> On Wed, May 11, 2011 at 8:31 PM, Matt Ingenthron <inge...@cep.net> wrote:
>> Hi Alexey,
>>
>> Receiving an IOException there does seem to indicate something did go
>> wrong with the connection to the server.
>>
>> Two questions:
>> What does your constructor (or connection factory builder) for the
>> MemcachedClient look like?
> I use the simplest constructor I found, it looks like
> new MemcachedClient( new InetSocketAddress(host, port) )

Thanks, that'd mean you have a default of a 1sec timeout.

>> Are you doing synchronous or asynchronous requests in your code? If
>> asynch, what kind of timeout?
> I am using synchronous get (MemcachedClient#get) and asynchronous set
> (MemcachedClient#set) methods. Set uses the default timeout value,
> which is 1 second.
>
> I've just reviewed my code - the set result is never read (i.e. I just
> call the set method and not save returned future). May it cause the
> problems?

No, that shouldn't cause any problems.

>> I don't think I can explain why the connection is getting upset, but I
>> can tell you that there is a change in 2.6rc2 (posted on the google code
>> site) which should lessen the impact of this connection hiccup. It's
>> not guaranteed to eliminate timeouts from occurring in your application
>> code, but what it should happen is you'd receive less timeouts in a
>> shorter interval, followed by a quicker recovery.
>>
>> The problem there (issue 136) was something that would impede recovery,
>> which would cause timeouts, which would cause a connection drop. I
>> can't be certain this is what's happening from what you post though.
>>
>> Can you try 2.6rc2 to see if it helps?
> Unfortunately I can't try to reproduce the problem with a rc build. I
> forgot to mention that this problem repeats only in the production
> environment, and I can't reproduce it on the test environment. The
> main difference between them is that in the test env. memcached
> installed on the same machine with application server and in the
> production on an another.

Well, if you can't change out the code, perhaps you can at least look at
network statistics for details on what's gone in the wrong direction?
For instance, are there lots of TCP retransmits?

At the moment, my guess is you're pushing something past saturated into
drop-a-packet land in production. I know you may have a modest load,
but maybe something else running at the same time doesn't?

It's just a guess though. We're operating without much data here.

By the way, 2.6 is nnot a candidate any longer.

> I dig a bit into the spymemcached code and found that connection is
> dropped if no bytes was read after receiving a socket ready
> notification [1]. Why this case is considered as a connection failure?
> Sorry if this question is too dumb, i'm a total newbee in async IO
> handling.
>
> [1] https://github.com/dustin/java-memcached-client/blob/2.5/src/main/java/net/spy/memcached/MemcachedConnection.java#L392

No problem, but it is a newbie question. :) What that actually says is
if you read negative bytes, drop the connection. Reading nothing is a
regular occurrence. Reading a negative tells us that the connection is
gone, which we don't expect, so we queue a reconnect on the next time
through.

Regards,

Matt

Alexey Zlobin

unread,
May 13, 2011, 5:43:15 AM5/13/11
to spymem...@googlegroups.com

Do you mean that this looks like I've ran out of my network bandwidth
or tcp stack performance? Actually we do have a lot of other things
producing network traffic on the same machine. But traffic during the
test was much smaller than the interface capacity (100 Mb and 1000
Mb).

> It's just a guess though.  We're operating without much data here.
>
> By the way, 2.6 is nnot a candidate any longer.

That's great news about 2.6 release. I think I'll try to switch on it,
keep it for a few days on the test env. and switch the production in
order to repeat my tests.

>> I dig a bit into the spymemcached code and found that connection is
>> dropped if no bytes was read after receiving a socket ready
>> notification [1]. Why this case is considered as a connection failure?
>> Sorry if this question is too dumb, i'm a total newbee in async IO
>> handling.
>>
>> [1] https://github.com/dustin/java-memcached-client/blob/2.5/src/main/java/net/spy/memcached/MemcachedConnection.java#L392
>
> No problem, but it is a newbie question.  :)  What that actually says is
> if you read negative bytes, drop the connection.  Reading nothing is a
> regular occurrence.  Reading a negative tells us that the connection is
> gone, which we don't expect, so we queue a reconnect on the next time
> through.

Oh, i'm so ashamed. I checked this line 10 times, 10 times saw the '<'
mark and 10 times read it as 'equals'. Sorry again.

> --
> You received this message because you are subscribed to the Google Groups "spymemcached" group.
> To post to this group, send email to spymem...@googlegroups.com.
> To unsubscribe from this group, send email to spymemcached...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/spymemcached?hl=en.
>
>

--
Best regards,
Alexey Zlobin.

Matt Ingenthron

unread,
May 14, 2011, 1:47:56 AM5/14/11
to spymem...@googlegroups.com, Alexey Zlobin
> Do you mean that this looks like I've ran out of my network bandwidth
> or tcp stack performance? Actually we do have a lot of other things
> producing network traffic on the same machine. But traffic during the
> test was much smaller than the interface capacity (100 Mb and 1000
> Mb).
>
>> It's just a guess though. We're operating without much data here.
>>

Well, the behavior you describe sounds like some resource has been overrun.

Now that you mention it being a test, does it perhaps generate a lot of
garbage? If so, GC pauses can also take long enough to cause timeouts.
You may want to run your test with GC logging on to see if you can
correlate.

See the issue tracker for some work I'd done on that previously.


> Oh, i'm so ashamed. I checked this line 10 times, 10 times saw the '<'
> mark and 10 times read it as 'equals'. Sorry again.
>
>

Don't be. I've done far sillier things. And that was just earlier this
week!

Matt

Alexey Zlobin

unread,
May 17, 2011, 1:17:17 PM5/17/11
to Matt Ingenthron, spymem...@googlegroups.com
Hello again,

I found a way to reconfigure my test environment and upgrade it to 2.6.

The problem appears on it too. But there is a big reconnect speed
improvement indeed, and it now reconnects within few seconds instead
of several minutes.

GC pauses is around 200-300 ms once per 3-4 seconds and CPU is
underutilized (60-70% according to different measurement tools). So I
don't think that GC can cause my problems.

For tcp debugging I've found nothing better than tcpdump/tcptrace
tools (see the tcptrace output below). It looks like the only
exceptional parameter value is "segs cum acked".

TCP connection info:
7 TCP connections traced:
TCP connection 1:
host a: app:53679
host b: mem:11211
complete conn: RESET (SYNs: 0) (FINs: 1)
first packet: Tue May 17 15:40:34.385020 2011
last packet: Tue May 17 15:54:40.199197 2011
elapsed time: 0:14:05.814176
total packets: 587890
filename: tcpdump.out
a->b: b->a:
total packets: 262928 total packets: 324962
resets sent: 0 resets sent: 2
ack pkts sent: 262928 ack pkts sent: 324960
pure acks sent: 81011 pure acks sent: 5848
sack pkts sent: 6 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 1 max sack blks/ack: 0
unique bytes sent: 20252801 unique bytes sent: 15186202
actual data pkts: 181917 actual data pkts: 319111
actual data bytes: 20252801 actual data bytes: 15186202
rexmt data pkts: 0 rexmt data pkts: 0
rexmt data bytes: 0 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 1
pushed data pkts: 181716 pushed data pkts: 319111
SYN/FIN pkts sent: 0/0 SYN/FIN pkts sent: 0/1
req 1323 ws/ts: N/Y req 1323 ws/ts: N/Y
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 0 bytes mss requested: 0 bytes
max segm size: 2896 bytes max segm size: 2740 bytes
min segm size: 1 bytes min segm size: 5 bytes
avg segm size: 111 bytes avg segm size: 47 bytes
max win adv: 126 bytes max win adv: 6408 bytes
min win adv: 123 bytes min win adv: 6408 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 125 bytes avg win adv: 6407 bytes
initial window: 49 bytes initial window: 0 bytes
initial window: 1 pkts initial window: 0 pkts
ttl stream length: NA ttl stream length: NA
missed data: NA missed data: NA
truncated data: 15887132 bytes truncated data: 10540422 bytes
truncated packets: 181895 pkts truncated packets: 133100 pkts
data xmit time: 845.814 secs data xmit time: 845.814 secs
idletime max: 6894.7 ms idletime max: 6928.6 ms
throughput: 23945 Bps throughput: 17955 Bps
post-loss acks: 0 post-loss acks: 1
segs cum acked: 6293 segs cum acked: 166280
duplicate acks: 1 duplicate acks: 4
triple dupacks: 0 triple dupacks: 1
max # retrans: 0 max # retrans: 0
min retr time: 0.0 ms min retr time: 0.0 ms
max retr time: 0.0 ms max retr time: 0.0 ms
avg retr time: 0.0 ms avg retr time: 0.0 ms
sdv retr time: 0.0 ms sdv retr time: 0.0 ms

--
Best regards,
Alexey Zlobin.

Reply all
Reply to author
Forward
0 new messages