Receiving an IOException there does seem to indicate something did go
wrong with the connection to the server.
Two questions:
What does your constructor (or connection factory builder) for the
MemcachedClient look like?
Are you doing synchronous or asynchronous requests in your code? If
asynch, what kind of timeout?
I don't think I can explain why the connection is getting upset, but I
can tell you that there is a change in 2.6rc2 (posted on the google code
site) which should lessen the impact of this connection hiccup. It's
not guaranteed to eliminate timeouts from occurring in your application
code, but what it should happen is you'd receive less timeouts in a
shorter interval, followed by a quicker recovery.
The problem there (issue 136) was something that would impede recovery,
which would cause timeouts, which would cause a connection drop. I
can't be certain this is what's happening from what you post though.
Can you try 2.6rc2 to see if it helps?
Matt
I use the simplest constructor I found, it looks like
new MemcachedClient( new InetSocketAddress(host, port) )
> Are you doing synchronous or asynchronous requests in your code? If
> asynch, what kind of timeout?
I am using synchronous get (MemcachedClient#get) and asynchronous set
(MemcachedClient#set) methods. Set uses the default timeout value,
which is 1 second.
I've just reviewed my code - the set result is never read (i.e. I just
call the set method and not save returned future). May it cause the
problems?
> I don't think I can explain why the connection is getting upset, but I
> can tell you that there is a change in 2.6rc2 (posted on the google code
> site) which should lessen the impact of this connection hiccup. It's
> not guaranteed to eliminate timeouts from occurring in your application
> code, but what it should happen is you'd receive less timeouts in a
> shorter interval, followed by a quicker recovery.
>
> The problem there (issue 136) was something that would impede recovery,
> which would cause timeouts, which would cause a connection drop. I
> can't be certain this is what's happening from what you post though.
>
> Can you try 2.6rc2 to see if it helps?
Unfortunately I can't try to reproduce the problem with a rc build. I
forgot to mention that this problem repeats only in the production
environment, and I can't reproduce it on the test environment. The
main difference between them is that in the test env. memcached
installed on the same machine with application server and in the
production on an another.
I dig a bit into the spymemcached code and found that connection is
dropped if no bytes was read after receiving a socket ready
notification [1]. Why this case is considered as a connection failure?
Sorry if this question is too dumb, i'm a total newbee in async IO
handling.
--
С уважением,
Алексей Злобин.
Thanks, that'd mean you have a default of a 1sec timeout.
>> Are you doing synchronous or asynchronous requests in your code? If
>> asynch, what kind of timeout?
> I am using synchronous get (MemcachedClient#get) and asynchronous set
> (MemcachedClient#set) methods. Set uses the default timeout value,
> which is 1 second.
>
> I've just reviewed my code - the set result is never read (i.e. I just
> call the set method and not save returned future). May it cause the
> problems?
No, that shouldn't cause any problems.
>> I don't think I can explain why the connection is getting upset, but I
>> can tell you that there is a change in 2.6rc2 (posted on the google code
>> site) which should lessen the impact of this connection hiccup. It's
>> not guaranteed to eliminate timeouts from occurring in your application
>> code, but what it should happen is you'd receive less timeouts in a
>> shorter interval, followed by a quicker recovery.
>>
>> The problem there (issue 136) was something that would impede recovery,
>> which would cause timeouts, which would cause a connection drop. I
>> can't be certain this is what's happening from what you post though.
>>
>> Can you try 2.6rc2 to see if it helps?
> Unfortunately I can't try to reproduce the problem with a rc build. I
> forgot to mention that this problem repeats only in the production
> environment, and I can't reproduce it on the test environment. The
> main difference between them is that in the test env. memcached
> installed on the same machine with application server and in the
> production on an another.
Well, if you can't change out the code, perhaps you can at least look at
network statistics for details on what's gone in the wrong direction?
For instance, are there lots of TCP retransmits?
At the moment, my guess is you're pushing something past saturated into
drop-a-packet land in production. I know you may have a modest load,
but maybe something else running at the same time doesn't?
It's just a guess though. We're operating without much data here.
By the way, 2.6 is nnot a candidate any longer.
> I dig a bit into the spymemcached code and found that connection is
> dropped if no bytes was read after receiving a socket ready
> notification [1]. Why this case is considered as a connection failure?
> Sorry if this question is too dumb, i'm a total newbee in async IO
> handling.
>
> [1] https://github.com/dustin/java-memcached-client/blob/2.5/src/main/java/net/spy/memcached/MemcachedConnection.java#L392
No problem, but it is a newbie question. :) What that actually says is
if you read negative bytes, drop the connection. Reading nothing is a
regular occurrence. Reading a negative tells us that the connection is
gone, which we don't expect, so we queue a reconnect on the next time
through.
Regards,
Matt
Do you mean that this looks like I've ran out of my network bandwidth
or tcp stack performance? Actually we do have a lot of other things
producing network traffic on the same machine. But traffic during the
test was much smaller than the interface capacity (100 Mb and 1000
Mb).
> It's just a guess though. We're operating without much data here.
>
> By the way, 2.6 is nnot a candidate any longer.
That's great news about 2.6 release. I think I'll try to switch on it,
keep it for a few days on the test env. and switch the production in
order to repeat my tests.
>> I dig a bit into the spymemcached code and found that connection is
>> dropped if no bytes was read after receiving a socket ready
>> notification [1]. Why this case is considered as a connection failure?
>> Sorry if this question is too dumb, i'm a total newbee in async IO
>> handling.
>>
>> [1] https://github.com/dustin/java-memcached-client/blob/2.5/src/main/java/net/spy/memcached/MemcachedConnection.java#L392
>
> No problem, but it is a newbie question. :) What that actually says is
> if you read negative bytes, drop the connection. Reading nothing is a
> regular occurrence. Reading a negative tells us that the connection is
> gone, which we don't expect, so we queue a reconnect on the next time
> through.
Oh, i'm so ashamed. I checked this line 10 times, 10 times saw the '<'
mark and 10 times read it as 'equals'. Sorry again.
> --
> You received this message because you are subscribed to the Google Groups "spymemcached" group.
> To post to this group, send email to spymem...@googlegroups.com.
> To unsubscribe from this group, send email to spymemcached...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/spymemcached?hl=en.
>
>
--
Best regards,
Alexey Zlobin.
Well, the behavior you describe sounds like some resource has been overrun.
Now that you mention it being a test, does it perhaps generate a lot of
garbage? If so, GC pauses can also take long enough to cause timeouts.
You may want to run your test with GC logging on to see if you can
correlate.
See the issue tracker for some work I'd done on that previously.
> Oh, i'm so ashamed. I checked this line 10 times, 10 times saw the '<'
> mark and 10 times read it as 'equals'. Sorry again.
>
>
Don't be. I've done far sillier things. And that was just earlier this
week!
Matt
I found a way to reconfigure my test environment and upgrade it to 2.6.
The problem appears on it too. But there is a big reconnect speed
improvement indeed, and it now reconnects within few seconds instead
of several minutes.
GC pauses is around 200-300 ms once per 3-4 seconds and CPU is
underutilized (60-70% according to different measurement tools). So I
don't think that GC can cause my problems.
For tcp debugging I've found nothing better than tcpdump/tcptrace
tools (see the tcptrace output below). It looks like the only
exceptional parameter value is "segs cum acked".
TCP connection info:
7 TCP connections traced:
TCP connection 1:
host a: app:53679
host b: mem:11211
complete conn: RESET (SYNs: 0) (FINs: 1)
first packet: Tue May 17 15:40:34.385020 2011
last packet: Tue May 17 15:54:40.199197 2011
elapsed time: 0:14:05.814176
total packets: 587890
filename: tcpdump.out
a->b: b->a:
total packets: 262928 total packets: 324962
resets sent: 0 resets sent: 2
ack pkts sent: 262928 ack pkts sent: 324960
pure acks sent: 81011 pure acks sent: 5848
sack pkts sent: 6 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 1 max sack blks/ack: 0
unique bytes sent: 20252801 unique bytes sent: 15186202
actual data pkts: 181917 actual data pkts: 319111
actual data bytes: 20252801 actual data bytes: 15186202
rexmt data pkts: 0 rexmt data pkts: 0
rexmt data bytes: 0 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 1
pushed data pkts: 181716 pushed data pkts: 319111
SYN/FIN pkts sent: 0/0 SYN/FIN pkts sent: 0/1
req 1323 ws/ts: N/Y req 1323 ws/ts: N/Y
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 0 bytes mss requested: 0 bytes
max segm size: 2896 bytes max segm size: 2740 bytes
min segm size: 1 bytes min segm size: 5 bytes
avg segm size: 111 bytes avg segm size: 47 bytes
max win adv: 126 bytes max win adv: 6408 bytes
min win adv: 123 bytes min win adv: 6408 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 125 bytes avg win adv: 6407 bytes
initial window: 49 bytes initial window: 0 bytes
initial window: 1 pkts initial window: 0 pkts
ttl stream length: NA ttl stream length: NA
missed data: NA missed data: NA
truncated data: 15887132 bytes truncated data: 10540422 bytes
truncated packets: 181895 pkts truncated packets: 133100 pkts
data xmit time: 845.814 secs data xmit time: 845.814 secs
idletime max: 6894.7 ms idletime max: 6928.6 ms
throughput: 23945 Bps throughput: 17955 Bps
post-loss acks: 0 post-loss acks: 1
segs cum acked: 6293 segs cum acked: 166280
duplicate acks: 1 duplicate acks: 4
triple dupacks: 0 triple dupacks: 1
max # retrans: 0 max # retrans: 0
min retr time: 0.0 ms min retr time: 0.0 ms
max retr time: 0.0 ms max retr time: 0.0 ms
avg retr time: 0.0 ms avg retr time: 0.0 ms
sdv retr time: 0.0 ms sdv retr time: 0.0 ms
--
Best regards,
Alexey Zlobin.