Memcached Connection Failure

1,221 views
Skip to first unread message

nsheth

unread,
Sep 15, 2009, 7:44:41 PM9/15/09
to memcached
Hello,

About once a day, usually during peak traffic times, I hit some major
load issues. I'm running memached on the same boxes as my
webservers. Load usually spikes to 35-50, and I see the apache error
log flooded with messages like the following:

[Sun Sep 13 14:54:34 2009] [error] [client 10.0.0.2] PHP Warning:
memcache_pconnect() [<a href='function.memcache-pconnect'>function.
memcache-pconnect</a>]: Can't connect to 10.0.0.5:11211, Unknown error
(0) in /var/www/html/memcache.php on line 174, referer: xxxx

Any thoughts? Restart apache, and everything clears up.

I'm running memcached 1.2.5 currently (which looks to be a bit out of
date at this point, so perhaps an update is in order).

Thanks!

Vladimir

unread,
Sep 15, 2009, 8:27:28 PM9/15/09
to memc...@googlegroups.com
nsheth wrote:
> About once a day, usually during peak traffic times, I hit some major
> load issues. I'm running memached on the same boxes as my
> webservers. Load usually spikes to 35-50, and I see the apache error
> log flooded with messages like the following:
>
> [Sun Sep 13 14:54:34 2009] [error] [client 10.0.0.2] PHP Warning:
> memcache_pconnect() [<a href='function.memcache-pconnect'>function.
> memcache-pconnect</a>]: Can't connect to 10.0.0.5:11211, Unknown error
> (0) in /var/www/html/memcache.php on line 174, referer: xxxx
>
> Any thoughts? Restart apache, and everything clears up.
>


It's PHP. I have seen something but in last couple weeks it has
"cleared" itself. It could be coincidental with using memcached 1.4.1,
code changes etc. I actually have some Ganglia snapshots of the behavior
you are describing here

http://2tu.us/pgr

Reason why load goes to 35-50 is that Apache starts consuming greater
and greater amounts of memory indicating a PHP memory leak. Granted it
could also have something to do with session garbage collection.

> I'm running memcached 1.2.5 currently (which looks to be a bit out of
> date at this point, so perhaps an update is in order).
>

I think that would be a wise choice.

Vladimir

Stephen Johnston

unread,
Sep 15, 2009, 8:45:11 PM9/15/09
to memc...@googlegroups.com
This is a total long shot, but we spent alot of time figuring out a similar issue that ended up being ephemeral port exhaustion.

Stephen Johnston

Vladimir

unread,
Sep 15, 2009, 8:48:39 PM9/15/09
to memc...@googlegroups.com
Too many connections in CLOSE_WAIT state ?

Anyways I would highly recommend installing something like Ganglia to get some types of metrics.

Also at 35-50 machine is not doing much other than swapping.

Eric Day

unread,
Sep 15, 2009, 9:08:07 PM9/15/09
to memc...@googlegroups.com
Hi!

If you discover this is a TIME_WAIT issue (too many TCP sockets
waiting around in kernel), you can tweak this in the kernel:

# cat /proc/sys/net/ipv4/tcp_fin_timeout
60

# cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000

61000-32768= 28232

(these are the defaults on Debian Linux).

So you only have a pool of 28232 sockets to work with, and each will
linger around for 60 seconds in a TIME_WAIT state even after being
close()d on both ends. You can increase your port range and lower
your TIME_WAIT value to buy you a larger window. Something to keep
in mind though for any clients/servers that have a high connect rate.

-Eric

Vladimir

unread,
Sep 15, 2009, 9:19:40 PM9/15/09
to memc...@googlegroups.com
I do question whether those would actually cause load to spike up.
Perhaps connection refused but I suspect those two ie. load spike and
connection refused are linked. Please correct if I am wrong. I just
checked my tcp_time_wait metrics and they peak around 600 even during
these load spikes.

nsheth

unread,
Sep 16, 2009, 1:36:50 AM9/16/09
to memcached
The machine isn't swapping, actually. I'll try to "catch" it
happening next time and see if I can get more information about the
connections used . . . and also look into upgrading to 1.4.1,
hopefully that helps.

nsheth

unread,
Sep 18, 2009, 6:30:14 PM9/18/09
to memcached
We weren't experiencing any abnormal connection levels.

I did upgrade to the latest client and server version 1.4.1. So far
so good . . .

nsheth

unread,
Sep 22, 2009, 6:59:39 PM9/22/09
to memcached
Hmm, just saw the same issue occur again. Load spiked to 35-40.
(I've set MaxClients to 40 in apache, and looking at the status page,
I see it basically using every thread, so that may explain that load
level).

Going back on the connections, it looks like we've got about 1.2k
connections in various states, so nowhere near any of these limits.

Any other thoughts?

Thanks!

dormando

unread,
Sep 22, 2009, 8:31:19 PM9/22/09
to memcached
Hey,

Can you troubleshoot it more carefully without thinking it's specific to
memcached? How'd you track it down to memcached in the first place?

When your load is spiking, what requests are hitting your server? Can you
look at an apache server-status page to see what's in flight, or
re-assemble such a view from the logs?

It smells like you're getting a short flood of traffic. If you can see
what type of traffic you're getting at the time of the load spike you can
reproduce it yourself... Load the page yourself, time how long it takes to
render, then break it down and see what it's doing.

If it's related to memcached, it's still likely to be a bug in how you're
using it internally (looping wrong, or something) - since your load is
related to the number of apache procs, and you claim it's not swapping,
it's either doing disk io or running CPU hard.

-Dormando

nsheth

unread,
Sep 22, 2009, 8:45:18 PM9/22/09
to memcached
I've already looked in some detail at that, but haven't been able to
discern any real pattern. I'll look again, though.

I suspect memcache, as whenever I experience this, I get a flood of
messages in my error log like:

[Sun Sep 13 14:54:34 2009] [error] [client 10.0.0.2] PHP Warning:
memcache_pconnect() [<a href='function.memcache-pconnect'>function.
memcache-pconnect</a>]: Can't connect to 10.0.0.5:11211, Unknown error
(0) in /var/www/html/memcache.php on line 174, referer: xxxx

Vladimir Vuksan

unread,
Sep 22, 2009, 8:47:19 PM9/22/09
to memc...@googlegroups.com
I don't think running CPU hard would explain. You could have 100% CPU utilization and load of one. Load of 35-40 is usually related to some type of IO. Most cases disk IO however network IO is not out of question. I would suggest installing something like Ganglia to get some actionable metrics. My money is on Apache consuming ever increasing amounts of memory.

dormando

unread,
Sep 22, 2009, 8:50:48 PM9/22/09
to memc...@googlegroups.com
Wrong;

for omg in `seq 1 30` ; do yes > /dev/null & done

observe load hit 30.

-Dormando

dormando

unread,
Sep 22, 2009, 8:53:02 PM9/22/09
to memcached
Okay,

Smells like you're leaking memcached connections objects somewhere, or you
have a ton of servers? During these spikes, can you telnet to memcached
and run the 'stats' command, or can you not connect either?

Try restarting memcached with -c (connection limit) set to 32767 or
somesuch. See if that changes things.

Is your pecl/memcache library fully upgraded?

If you're using memcached 1.2.8 or later the 'stats' output has a value
'listen_disabled_num' - if that value is nonzero, or incrementing, you're
hitting the connection limit on memcached.

On Tue, 22 Sep 2009, nsheth wrote:

>
> I've already looked in some detail at that, but haven't been able to
> discern any real pattern. I'll look again, though.
>
> I suspect memcache, as whenever I experience this, I get a flood of
> messages in my error log like:
>
> [Sun Sep 13 14:54:34 2009] [error] [client 10.0.0.2] PHP Warning:
> memcache_pconnect() [<a href='function.memcache-pconnect'>function.
> memcache-pconnect</a>]: Can't connect to 10.0.0.5:11211, Unknown error
> (0) in /var/www/html/memcache.php on line 174, referer: xxxx
>

> On Sep 22, 5:31ápm, dormando <dorma...@rydia.net> wrote:
> > Hey,
> >
> > Can you troubleshoot it more carefully without thinking it's specific to
> > memcached? How'd you track it down to memcached in the first place?
> >
> > When your load is spiking, what requests are hitting your server? Can you
> > look at an apache server-status page to see what's in flight, or
> > re-assemble such a view from the logs?
> >
> > It smells like you're getting a short flood of traffic. If you can see
> > what type of traffic you're getting at the time of the load spike you can
> > reproduce it yourself... Load the page yourself, time how long it takes to
> > render, then break it down and see what it's doing.
> >
> > If it's related to memcached, it's still likely to be a bug in how you're
> > using it internally (looping wrong, or something) - since your load is
> > related to the number of apache procs, and you claim it's not swapping,
> > it's either doing disk io or running CPU hard.
> >
> > -Dormando
> >
> > On Tue, 22 Sep 2009, nsheth wrote:
> >

> > > Hmm, just saw the same issue occur again. áLoad spiked to 35-40.


> > > (I've set MaxClients to 40 in apache, and looking at the status page,
> > > I see it basically using every thread, so that may explain that load
> > > level).
> >
> > > Going back on the connections, it looks like we've got about 1.2k
> > > connections in various states, so nowhere near any of these limits.
> >
> > > Any other thoughts?
> >
> > > Thanks!
> >

> > > On Sep 18, 3:30ápm, nsheth <nsh...@gmail.com> wrote:
> > > > We weren't experiencing any abnormal connection levels.
> >

> > > > I did upgrade to the latest client and server version 1.4.1. áSo far


> > > > so good . . .
> >

> > > > On Sep 15, 10:36ápm, nsheth <nsh...@gmail.com> wrote:
> >
> > > > > The machine isn't swapping, actually. áI'll try to "catch" it


> > > > > happening next time and see if I can get more information about the
> > > > > connections used . . . and also look into upgrading to 1.4.1,
> > > > > hopefully that helps.
> >

> > > > > On Sep 15, 6:19ápm, Vladimir <vli...@veus.hr> wrote:
> >
> > > > > > I do question whether those would actually cause load to spike up.
> > > > > > Perhaps connection refused but I suspect those two ie. load spike and
> > > > > > connection refused are linked. Please correct if I am wrong. I just
> > > > > > checked my tcp_time_wait metrics and they peak around 600 even during
> > > > > > these load spikes.
> >
> > > > > > Eric Day wrote:
> > > > > > > If you discover this is a TIME_WAIT issue (too many TCP sockets
> > > > > > > waiting around in kernel), you can tweak this in the kernel:
> >
> > > > > > > # cat /proc/sys/net/ipv4/tcp_fin_timeout
> > > > > > > 60
> >
> > > > > > > # cat /proc/sys/net/ipv4/ip_local_port_range

> > > > > > > 32768 á 61000


> >
> > > > > > > 61000-32768= 28232
> >
> > > > > > > (these are the defaults on Debian Linux).
> >
> > > > > > > So you only have a pool of 28232 sockets to work with, and each will
> > > > > > > linger around for 60 seconds in a TIME_WAIT state even after being
> > > > > > > close()d on both ends. You can increase your port range and lower
> > > > > > > your TIME_WAIT value to buy you a larger window. Something to keep
> > > > > > > in mind though for any clients/servers that have a high connect rate.
> >
> > > > > > > -Eric
> >
> > > > > > > On Tue, Sep 15, 2009 at 08:48:39PM -0400, Vladimir wrote:
> >

> > > > > > >> á áToo many connections in CLOSE_WAIT state ?
> >
> > > > > > >> á áAnyways I would highly recommend installing something like Ganglia to get
> > > > > > >> á ásome types of metrics.
> >
> > > > > > >> á áAlso at 35-50 machine is not doing much other than swapping.
> >
> > > > > > >> á áStephen Johnston wrote:
> >
> > > > > > >> á á áThis is a total long shot, but we spent alot of time figuring out a
> > > > > > >> á á ásimilar issue that ended up being ephemeral port exhaustion.
> >
> > > > > > >> á á áStephen Johnston
> >
> > > > > > >> á á áOn Tue, Sep 15, 2009 at 8:27 PM, Vladimir <vli...@veus.hr> wrote:
> >
> > > > > > >> á á á ánsheth wrote:
> >
> > > > > > >> á á á á áAbout once a day, usually during peak traffic times, I hit some
> > > > > > >> á á á á ámajor
> > > > > > >> á á á á áload issues. áI'm running memached on the same boxes as my
> > > > > > >> á á á á áwebservers. áLoad usually spikes to 35-50, and I see the apache
> > > > > > >> á á á á áerror
> > > > > > >> á á á á álog flooded with messages like the following:
> >
> > > > > > >> á á á á á[Sun Sep 13 14:54:34 2009] [error] [client 10.0.0.2] PHP Warning:
> > > > > > >> á á á á ámemcache_pconnect() [<a href='function.memcache-pconnect'>function.
> > > > > > >> á á á á ámemcache-pconnect</a>]: Can't connect to 10.0.0.5:11211, Unknown
> > > > > > >> á á á á áerror
> > > > > > >> á á á á á(0) in /var/www/html/memcache.php on line 174, referer: xxxx
> >
> > > > > > >> á á á á áAny thoughts? áRestart apache, and everything clears up.
> >
> > > > > > >> á á á áIt's PHP. I have seen something but in last couple weeks it has
> > > > > > >> á á á á"cleared" itself. It could be coincidental with using memcached 1.4.1,
> > > > > > >> á á á ácode changes etc. I actually have some Ganglia snapshots of the
> > > > > > >> á á á ábehavior you are describing here
> >
> > > > > > >> á á á áhttp://2tu.us/pgr
> >
> > > > > > >> á á á áReason why load goes to 35-50 is that Apache starts consuming greater
> > > > > > >> á á á áand greater amounts of memory indicating a PHP memory leak. Granted it
> > > > > > >> á á á ácould also have something to do with session garbage collection.
> >
> > > > > > >> á á á á áI'm running memcached 1.2.5 currently (which looks to be a bit out
> > > > > > >> á á á á áof
> > > > > > >> á á á á ádate at this point, so perhaps an update is in order).
> >
> > > > > > >> á á á áI think that would be a wise choice.
> > > > > > >> á á á áVladimir
>

Vladimir Vuksan

unread,
Sep 22, 2009, 8:53:52 PM9/22/09
to memc...@googlegroups.com
I stand corrected :-)

Neil Sheth

unread,
Sep 22, 2009, 8:59:05 PM9/22/09
to memc...@googlegroups.com
I've only got 2 servers hitting this.  Currently, the connection limit is set to 1024, but I can increase that.

I'm running now.  Looking at memcached stats, the value of listen_disabled_num is 0.

My pecl/memcache library is 2.2.5, latest stable.

VLadmir, I do have a cacti installation.  Looking at that, I see a cpu peak at that time, but that may just be a result of having 40 apache threads actively churning?



On Tue, Sep 22, 2009 at 5:53 PM, dormando <dorm...@rydia.net> wrote:
Okay,

Smells like you're leaking memcached connections objects somewhere, or you
have a ton of servers? During these spikes, can you telnet to memcached
and run the 'stats' command, or can you not connect either?

Try restarting memcached with -c (connection limit) set to 32767 or
somesuch. See if that changes things.

Is your pecl/memcache library fully upgraded?

If you're using memcached 1.2.8 or later the 'stats' output has a value
'listen_disabled_num' - if that value is nonzero, or incrementing, you're
hitting the connection limit on memcached.

On Tue, 22 Sep 2009, nsheth wrote:

>
> I've already looked in some detail at that, but haven't been able to
> discern any real pattern.  I'll look again, though.
>
> I suspect memcache, as whenever I experience this, I get a flood of
> messages in my error log like:
>
> [Sun Sep 13 14:54:34 2009] [error] [client 10.0.0.2] PHP Warning:
> memcache_pconnect() [<a href='function.memcache-pconnect'>function.
> memcache-pconnect</a>]: Can't connect to 10.0.0.5:11211, Unknown error
> (0) in /var/www/html/memcache.php on line 174, referer: xxxx
>
> On Sep 22, 5:31 pm, dormando <dorma...@rydia.net> wrote:
> > Hey,
> >
> > Can you troubleshoot it more carefully without thinking it's specific to
> > memcached? How'd you track it down to memcached in the first place?
> >
> > When your load is spiking, what requests are hitting your server? Can you
> > look at an apache server-status page to see what's in flight, or
> > re-assemble such a view from the logs?
> >
> > It smells like you're getting a short flood of traffic. If you can see
> > what type of traffic you're getting at the time of the load spike you can
> > reproduce it yourself... Load the page yourself, time how long it takes to
> > render, then break it down and see what it's doing.
> >
> > If it's related to memcached, it's still likely to be a bug in how you're
> > using it internally (looping wrong, or something) - since your load is
> > related to the number of apache procs, and you claim it's not swapping,
> > it's either doing disk io or running CPU hard.
> >
> > -Dormando
> >
> > On Tue, 22 Sep 2009, nsheth wrote:
> >
> > > Hmm, just saw the same issue occur again.  Load spiked to 35-40.

> > > (I've set MaxClients to 40 in apache, and looking at the status page,
> > > I see it basically using every thread, so that may explain that load
> > > level).
> >
> > > Going back on the connections, it looks like we've got about 1.2k
> > > connections in various states, so nowhere near any of these limits.
> >
> > > Any other thoughts?
> >
> > > Thanks!
> >
> > > On Sep 18, 3:30 pm, nsheth <nsh...@gmail.com> wrote:
> > > > We weren't experiencing any abnormal connection levels.
> >
> > > > I did upgrade to the latest client and server version 1.4.1.  So far

> > > > so good . . .
> >
> > > > On Sep 15, 10:36 pm, nsheth <nsh...@gmail.com> wrote:
> >
> > > > > The machine isn't swapping, actually.  I'll try to "catch" it

> > > > > happening next time and see if I can get more information about the
> > > > > connections used . . . and also look into upgrading to 1.4.1,
> > > > > hopefully that helps.
> >
> > > > > On Sep 15, 6:19 pm, Vladimir <vli...@veus.hr> wrote:
> >
> > > > > > I do question whether those would actually cause load to spike up.
> > > > > > Perhaps connection refused but I suspect those two ie. load spike and
> > > > > > connection refused are linked. Please correct if I am wrong. I just
> > > > > > checked my tcp_time_wait metrics and they peak around 600 even during
> > > > > > these load spikes.
> >
> > > > > > Eric Day wrote:
> > > > > > > If you discover this is a TIME_WAIT issue (too many TCP sockets
> > > > > > > waiting around in kernel), you can tweak this in the kernel:
> >
> > > > > > > # cat /proc/sys/net/ipv4/tcp_fin_timeout
> > > > > > > 60
> >
> > > > > > > # cat /proc/sys/net/ipv4/ip_local_port_range
> > > > > > > 32768   61000

> >
> > > > > > > 61000-32768= 28232
> >
> > > > > > > (these are the defaults on Debian Linux).
> >
> > > > > > > So you only have a pool of 28232 sockets to work with, and each will
> > > > > > > linger around for 60 seconds in a TIME_WAIT state even after being
> > > > > > > close()d on both ends. You can increase your port range and lower
> > > > > > > your TIME_WAIT value to buy you a larger window. Something to keep
> > > > > > > in mind though for any clients/servers that have a high connect rate.
> >
> > > > > > > -Eric
> >
> > > > > > > On Tue, Sep 15, 2009 at 08:48:39PM -0400, Vladimir wrote:
> >
> > > > > > >>    Too many connections in CLOSE_WAIT state ?
> >
> > > > > > >>    Anyways I would highly recommend installing something like Ganglia to get
> > > > > > >>    some types of metrics.
> >

> > > > > > >>    Also at 35-50 machine is not doing much other than swapping.
> >
> > > > > > >>    Stephen Johnston wrote:
> >
> > > > > > >>      This is a total long shot, but we spent alot of time figuring out a
> > > > > > >>      similar issue that ended up being ephemeral port exhaustion.
> >
> > > > > > >>      Stephen Johnston

> >
> > > > > > >>      On Tue, Sep 15, 2009 at 8:27 PM, Vladimir <vli...@veus.hr> wrote:
> >
> > > > > > >>        nsheth wrote:
> >
> > > > > > >>          About once a day, usually during peak traffic times, I hit some
> > > > > > >>          major
> > > > > > >>          load issues.  I'm running memached on the same boxes as my
> > > > > > >>          webservers.  Load usually spikes to 35-50, and I see the apache
> > > > > > >>          error

> > > > > > >>          log flooded with messages like the following:
> >
> > > > > > >>          [Sun Sep 13 14:54:34 2009] [error] [client 10.0.0.2] PHP Warning:
> > > > > > >>          memcache_pconnect() [<a href='function.memcache-pconnect'>function.
> > > > > > >>          memcache-pconnect</a>]: Can't connect to 10.0.0.5:11211, Unknown
> > > > > > >>          error
> > > > > > >>          (0) in /var/www/html/memcache.php on line 174, referer: xxxx
> >
> > > > > > >>          Any thoughts?  Restart apache, and everything clears up.

> >
> > > > > > >>        It's PHP. I have seen something but in last couple weeks it has
> > > > > > >>        "cleared" itself. It could be coincidental with using memcached 1.4.1,
> > > > > > >>        code changes etc. I actually have some Ganglia snapshots of the
> > > > > > >>        behavior you are describing here
> >
> > > > > > >>        http://2tu.us/pgr
> >
> > > > > > >>        Reason why load goes to 35-50 is that Apache starts consuming greater
> > > > > > >>        and greater amounts of memory indicating a PHP memory leak. Granted it
> > > > > > >>        could also have something to do with session garbage collection.
> >
> > > > > > >>          I'm running memcached 1.2.5 currently (which looks to be a bit out
> > > > > > >>          of

> > > > > > >>          date at this point, so perhaps an update is in order).
> >
> > > > > > >>        I think that would be a wise choice.
> > > > > > >>        Vladimir
>


Vladimir Vuksan

unread,
Sep 22, 2009, 9:06:36 PM9/22/09
to memc...@googlegroups.com
Neil Sheth wrote:
> I'm running now. Looking at memcached stats, the value of
> listen_disabled_num is 0.
>
> My pecl/memcache library is 2.2.5, latest stable.
>
> VLadmir, I do have a cacti installation. Looking at that, I see a cpu
> peak at that time, but that may just be a result of having 40 apache
> threads actively churning?


I'd be interested in what the memory utilization is at the time.

Vladimir

Vladimir Vuksan

unread,
Sep 22, 2009, 9:19:27 PM9/22/09
to memc...@googlegroups.com
Also does dmesg show anything interesting ?

Neil Sheth

unread,
Sep 24, 2009, 2:05:43 AM9/24/09
to memc...@googlegroups.com
Vladmir -

I don't see anything odd about the memory utilization, nothing swapping, no huge spike in usage.  It does climb slightly, but doesn't look significant at all.

I ran dmesg, not sure how to make sense of it, but does show some segfaults, both with php and httpd.  Not sure how to tell if this is related:

httpd[24013]: segfault at 00007fff4b88b368 rip 00002b4367a61046 rsp 00007fff4b88b370 error 6
httpd[24522]: segfault at 00007fff4b88b368 rip 00002b4367a61046 rsp 00007fff4b88b370 error 6
httpd[24519]: segfault at 00007fff4b88b368 rip 00002b4367a61046 rsp 00007fff4b88b370 error 6
httpd[32640]: segfault at 00007fff9cf661f0 rip 00002b1a165d121d rsp 00007fff9cf661d0 error 6
php[32216]: segfault at 0000000000000015 rip 00002ae33cc53704 rsp 00007fff71ce6128 error 4
Reply all
Reply to author
Forward
0 new messages