Thousands of CLOSE_WAIT connections

4,874 views
Skip to first unread message

Nic Jansma

unread,
Sep 30, 2013, 11:36:18 AM9/30/13
to memc...@googlegroups.com
Using memcached 1.4.15
 
Over the past few months I've been having a few issue with memcached, notably the issue here: http://code.google.com/p/memcached/issues/detail?id=260
 
However, another, maybe related issue, is that once every few days, clients (websites running PHP) cannot connect to the localhost memcached instance.
 
I'm running with a connection limit of 4096:
memcached -d -p 11211 -m 1024 -c 4096 -P /var/run/memcached/memcached.pid -l 127.0.0.1 -k
 
When the error started happening today, I looked at the log and see several "Too many open connections" lines.  Here's the first one:
...
>4094 Writing bin response:
...
<4094 connection closed.
Too many open connections
...
 
Connection #4094 is right around my 4096 connection limit...
 
Looking at netstat and lsof, I see around 4000 connections -- most of them are in CLOSE_WAIT.
 
lsof:
...
memcached 477 root 1366u  IPv4          622713568        0t0       TCP localhost:memcache->localhost:43324 (CLOSE_WAIT)
memcached 477 root 1367u  IPv4          622713571        0t0       TCP localhost:memcache->localhost:43325 (CLOSE_WAIT)
...
 
netstat:
...
tcp       39      0 localhost:memcache          localhost:45768             CLOSE_WAIT
tcp       25      0 localhost:memcache          localhost:42766             CLOSE_WAIT
...
 
The rest of the connections are in FIN_WAIT1, ESTABLISHED, etc, but the vast majority are the ~4,000 CLOSE_WAIT connections.
 
I'm going to try bumping up the connection limit to 32k to avoid this for now, but clearly there's something going on where memcached isn't closing connections.
 
How can I debug this further?

Les Mikesell

unread,
Sep 30, 2013, 12:23:42 PM9/30/13
to memc...@googlegroups.com
On Mon, Sep 30, 2013 at 10:36 AM, Nic Jansma <nicj...@gmail.com> wrote:
> ...
>
> The rest of the connections are in FIN_WAIT1, ESTABLISHED, etc, but the vast
> majority are the ~4,000 CLOSE_WAIT connections.
>
> I'm going to try bumping up the connection limit to 32k to avoid this for
> now, but clearly there's something going on where memcached isn't closing
> connections.

That sounds like your client is opening persistent connections but not
reusing them.

--
Les Mikesell
lesmi...@gmail.com

Nic Jansma

unread,
Sep 30, 2013, 12:44:30 PM9/30/13
to memc...@googlegroups.com

On Monday, September 30, 2013 12:23:42 PM UTC-4, LesMikesell wrote:
That sounds like your client is opening persistent connections but not
reusing them.
 
Doesn't the CLOSE_WAIT state indicate that the client has sent a FIN and yet memcached hasn't internally closed/cleaned up the connection?
 
I don't fully understand how pecl memcache/memcached's persistent connections feature uses the socket, but I don't imagine they would send a FIN then still try to re-use the connection later.
 
- Nic

Les Mikesell

unread,
Sep 30, 2013, 1:42:32 PM9/30/13
to memc...@googlegroups.com
Yes, I guess I had it backwards. CLOSE_WAIT should mean the other end
has sent the FIN and is done sending. The server still needs to send
the requested data before closing its side - maybe the client isn't
reading fast enough and the tcp window is full at this point.

--
Les Mikesell
lesmi...@gmail.com

Hirotaka Yamamoto

unread,
Oct 1, 2013, 1:34:56 AM10/1/13
to memc...@googlegroups.com
You may try php-yrmcds that is another memcache extension for PHP.

Nic Jansma

unread,
Oct 3, 2013, 5:50:17 PM10/3/13
to memc...@googlegroups.com
I don't think the problem is caused by the PHP extension at this point, since having the sockets in a CLOSE_WAIT state is most likely indicative of something in the memcached server not closing the connection after it gets a FIN.
 
I've been monitoring memcached over the past few days, and it normally doesn't get in this state.  i.e. I don't normally see ANY CLOSE_WAIT connections open.
 
However, when it starts not responding to connection requests, it appears to be in this state where there's thousands of CLOSE_WAIT connections.
 
Does anyone have suggestions for how I can debug this further?  Memcached is not being production-ready reliable for me right now (including the weekly crash listed in the original post, which may or may not be related).

Les Mikesell

unread,
Oct 3, 2013, 6:39:54 PM10/3/13
to memc...@googlegroups.com
On Thu, Oct 3, 2013 at 4:50 PM, Nic Jansma <nicj...@gmail.com> wrote:
> I don't think the problem is caused by the PHP extension at this point,
> since having the sockets in a CLOSE_WAIT state is most likely indicative of
> something in the memcached server not closing the connection after it gets a
> FIN.
>
The send/receive sides of a socket are closed separately. The
client's FIN means it isn't going to send any more. That's almost
unrelated to the server having to finish sending the requested data
before closing it's side. It can very well be that the client isn't
reading the response it requested and your tcp window is full as the
server tries to send.

> I've been monitoring memcached over the past few days, and it normally
> doesn't get in this state. i.e. I don't normally see ANY CLOSE_WAIT
> connections open.
>
> However, when it starts not responding to connection requests, it appears to
> be in this state where there's thousands of CLOSE_WAIT connections.

You have a finite number of sockets.

> Does anyone have suggestions for how I can debug this further? Memcached is
> not being production-ready reliable for me right now (including the weekly
> crash listed in the original post, which may or may not be related).

Lots of people are using various versions of the server in production.
Can you try to reproduce the behavior with some other client or
benchmark/load test tool? My guess is that it is something unrelated
in the code calling the client library that can cause it to request
but not consume the response.

--
Les Mikesell
lesmi...@gmail.com

Nic Jansma

unread,
Oct 6, 2013, 1:14:09 PM10/6/13
to memc...@googlegroups.com
I tried running the mc_conn_tester script, but wasn't able to trigger the problem with that.
 
The interesting thing is that even if I close down all apps that are communicating with memcached, eg. Apache (due to the pecl memcached modules), the memcached daemon still has thousands of CLOSE_WAIT connections.  There are no other outstanding apps or connections talking to it, yet there are thousands of sockets stuck in a CLOSE_WAIT state.
 
Shouldn't memcached be cleaning those up as the sockets hard close?
 

Roberto Spadim

unread,
Oct 6, 2013, 2:01:57 PM10/6/13
to memc...@googlegroups.com
hi guys
i had a similar problem in past but not with memcached, it was solved with some tcp options, i don't remember exactly how, but something with SO_LINGER and others tunes
maybe something like it could help



2013/10/6 Nic Jansma <nicj...@gmail.com>
I tried running the mc_conn_tester script, but wasn't able to trigger the problem with that.
 
The interesting thing is that even if I close down all apps that are communicating with memcached, eg. Apache (due to the pecl memcached modules), the memcached daemon still has thousands of CLOSE_WAIT connections.  There are no other outstanding apps or connections talking to it, yet there are thousands of sockets stuck in a CLOSE_WAIT state.
 
Shouldn't memcached be cleaning those up as the sockets hard close?
 

--
 
---
You received this message because you are subscribed to the Google Groups "memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Roberto Spadim

Nic Jansma

unread,
Oct 6, 2013, 4:13:17 PM10/6/13
to memc...@googlegroups.com
memcached.c appears to disable SO_LINGER

     ...
     &ling = {0, 0};
     error = setsockopt(sfd, SOL_SOCKET, SO_LINGER, (void *)&ling, 
sizeof(ling));
     ...

dormando

unread,
Oct 7, 2013, 12:14:52 AM10/7/13
to memc...@googlegroups.com
Is memcached handling any traffic at all when you see it in this state? If
you strace it is it still reading/writing/anything happening?

It sort of sound like it's hardlocking and your clients are trying to
close off (not persistent or whatever), but it's just not doing anything
at all. Which means it could be related to the aforementioned bug.

If existing connections do still work, it's possibly something else
entirely.

You can see this any number of ways: does bandwidth to the memcached box
drop to nothing, if you write a test script that connects and does
sets/fetches once per second, does it hang? etc.

Nic Jansma

unread,
Oct 15, 2013, 5:49:04 PM10/15/13
to memc...@googlegroups.com

On Monday, October 7, 2013 12:14:52 AM UTC-4, Dormando wrote:
Is memcached handling any traffic at all when you see it in this state? If
you strace it is it still reading/writing/anything happening?
 
memcached isn't deadlocked at this point, as a small number of connections seem to connect OK.
 
Since I haven't found a way to reduce the occurrence of problem, I switched from using memcached to APC's user cache since I don't need the distributed cache aspect.  APC has been reliable for me thus far.
 
It's probably just something strange with the hardware/OS/software combination I'm using since the problem doesn't seem widespread.  Once I upgrade hardware I might try again (memcached had been working nearly flawlessly in past 8 years of projects on older hardware).

mr.ji...@gmail.com

unread,
Apr 22, 2018, 9:42:34 PM4/22/18
to memcached
Hello, i had the same problem with you . 
Server OS :Windows Server 2008
memcached 1.4.15
there are many CLOSE_WAIT.and memcached dosent work.
did you resoleved this issue?
Thanks.

在 2013年9月30日星期一 UTC+8下午11:36:18,Nic Jansma写道:

jjo...@smugmug.com

unread,
Oct 12, 2018, 8:40:32 PM10/12/18
to memcached
We're testing verison 1.5.10 as a possible upgrade candidate for out older memcached servers, using a pool 9 servers.  They are running in parallel with the production pool, also 9 servers.  For the test all read requests are going to the production pool, and all updates (set, delete, etc...) are sent to one server in production pool and one server in the 1.5.10 pool via the key hashing algorithm.

That setup had been running without incident for about 12 days then yesterday two of the servers experienced the mass of CLOSE_WAIT connections similar to what's been described here.  We were able to collect some data, but not enough to figure out what's happening.  So I'm hoping to kickstart a discussion here about how to diagnose what's going on.  Until we can find way to explain (and prevent) another problem like this, we're unable to upgrade.

I can provide more information about our configuration.  I'm just not sure what bits are useful/interesting.  I will note that we're using "extstore" functionality on the new servers.

-jj

dormando

unread,
Oct 12, 2018, 9:24:56 PM10/12/18
to memcached
Hey,

Probably unrelated to the original thread. I'm highly interested in
understanding what's going on here though, and would appreciate any help
you can spare.

What data did you collect? The more information you can share the better:

1) start arguments for the daemon
2) stats output (stats, stats items, stats slabs, and importantly "stats
conns" - you will probably have to redact some of the output.
3) were you still able to connect to and run commands on those daemons
while they had a mass of CLOSE_WAIT?

Thanks,
-Dormando
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to memcached+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Jim Jones

unread,
Oct 13, 2018, 2:07:31 AM10/13/18
to memc...@googlegroups.com
The commandline arguments used are:

-u memcached -m 236544 -c 64000 -p 11211 -t 32 -C -n 5 -f 1.05 -o ext_path=/mnt/memcache:1700G,ext_path=/mnt1/memcache:1700G,ext_path=/mnt2/memcache:1700G,ext_path=/mnt3/memcache:1700G,ext_threads=32,ext_item_size=64

And we have some data, but frankly when the issue was happening we focused on the Memcache servers late in the process.  The initial errors suggested the problem was in a very different part of the collection of technical components in our larger service.  When we realized the "memcached" processes were not accepting new connections, we wanted to correct the behavior quickly since a fair amount of time had already passed.

First, sockets in use on the servers...

One system, call it server-248, shows TCP sockets on the system hovering around 1900 after traffic ramped up for the day.  If held at that level from ~6:45am until 10:06am.  We collect SAR data every 2 minutes, so the next reading was at 10:08 and the TCP sockets jumped to 63842.  Meaning it didn't grow slowly over time, it jumped frlmo 1937 to 63842 in a 2 minute window.  That number was 63842-63844 until 12:06pm when we restarted the "memcached" process.  After that the number dropped over time back to a more typical level.

10:02am   1937
10:04am   1937
10:06am   1937
10:08am  63842
10:10am  63842
...etc...
12:08pm  63843
12:06pm  63844
12:08pm  18415
12:10pm  17202
12:12pm  16333
12:14pm  16197
12:16pm  16134
12:18pm  16099
12:20pm   1617


The other system which ran into trouble, which I'll call server-85, exhibited similar behavior but started later.  Here's a sample of TCP socket counts from that server.

11:30am   1805
11:32am   1801
11:34am   1830
11:36am   1817
11:38am  63905
11:40am  63905
...etc...
12:20pm  63908
12:22pm  63908
12:24pm   1708
12:26pm   1720
12:28pm   1747


There were other network-centric datapoints that show the systems grind to a halt in terms of accepting new connections, like the bandwidth going in/out of the NIC's, etc...  But it's all in support of the same idea, that the "memcached" server stopped accepting new connections.

Second, details about the sockets...

During the incident, we did capture summary information on the socket states from various servers on the network, and a full "netstat -an" listing from one of the servers.  Both server-248 and server-85 showed tens of thousands of sockets in a CLOSE_WAIT state, hundreds in a SYN_RCVD state, and a small number of ESTABLISHED sockets.

There may continue to be traffic on the ESTABLISHED connections to the "memcached" servers, but if there is it's a trivial amount.  Multiple people were running "memkeys" at the time and report seeing no activity.

Third, "stats" from the incapacitated "memcached" processes...

We do not have stats from either server-248 or server-85 during the time they were in trouble.  In hindsight, that was a big oversight.
It's not clear that we could have gotten a connection to the server to pull the stats, but I'd really like to know what those counters said!

I do have the results from "stats", "stats slabs", "stats items" and "stats conns" from 17:10 the previous evening.  That doesn't show any obvious errors/problems slowly building up, waiting for some event to trigger a massive failure.  But it's from ~15 hours before the server got into trouble so I don't think it's all that helpful.

-jj

dormando

unread,
Oct 13, 2018, 1:07:23 PM10/13/18
to memc...@googlegroups.com
Sounds like the daemon hardlocked. Some of your start arguments are fairly
aggressive (ext_item_size and -n especially), I'll double check that those
won't cause problems like this.

First, to confirm: these two hung machines were only getting writes the
whole time? no reads?

Any info you can share about the client? What library, protocol (ascii or
binary, etc)?

If you can privately share with me any of your stats snapshots that would
help a lot as well, since I can better determine which features are being
exercised.

Aside from that, there don't seem to be many hints here. If it happens
again, please do:

1) see if you can still connect to it and run commands, grab stats if so.
2) grab a GDB backtrace before killing the daemon:
gdb -p $(pidof memcached)
thread apply all bt
^ full thread backtrace.

If it's hardlocked the backtrace *should* immediately tell me what's going
on.

Sorry for the trouble!

Jim Jones

unread,
Oct 13, 2018, 9:17:14 PM10/13/18
to memc...@googlegroups.com
The "-n" and "ext_item_size" options were picked to help handle the large volume of small key values our service generates.  But if increasing either will prevent potential deadlocks, I'm sure we'd prefer to accept the efficiency hit.

The two servers that stopped accepting connections were only getting write operations, that's correct.  They are operating as shadow copies of the cache during this parallel testing period.  I do see trivial numbers of "get" requests, but it's a tiny number compared to the number of "set" and "delete" commands.

The code generating the memcache requests is PHP, with "twemproxy" (v0.4.1) handling all the communications with the memcache servers.  I can dig into the code to get more specific information on the PHP setup if that's helpful.

I'll relay stats from one of the servers via private e-mail in case those counters help suggest things we can try.

And we'll definitely gather the info you described in the event we hit this problem again.

-jj

dormando

unread,
Oct 13, 2018, 9:41:10 PM10/13/18
to memc...@googlegroups.com
inline responses. also, I dunno if I missed it but what version were you
on originally? are the start arguments the same?

On Sat, 13 Oct 2018, Jim Jones wrote:

> The "-n" and "ext_item_size" options were picked to help handle the large
> volume of small key values our service generates.  But if increasing either
> will prevent potential deadlocks, I'm sure we'd prefer to accept the
> efficiency hit.

Those values don't necessarily translate to doing what you want; I'll
illustrate that once I get a gander at your stats.

> The two servers that stopped accepting connections were only getting write
> operations, that's correct.  They are operating as shadow copies of the
> cache during this parallel testing period.  I do see trivial numbers of
> "get" requests, but it's a tiny number compared to the number of "set" and
> "delete" commands.

Got it.

> The code generating the memcache requests is PHP, with "twemproxy" (v0.4.1)
> handling all the communications with the memcache servers.  I can dig into
> the code to get more specific information on the PHP setup if that's
> helpful.

Ok, so ascii only, then. Just need to know the general commands being used
and the protocol, at this point.

> I'll relay stats from one of the servers via private e-mail in case those
> counters help suggest things we can try.
>
> And we'll definitely gather the info you described in the event we hit this
> problem again.

Thanks! Appreciate your patience.
Reply all
Reply to author
Forward
0 new messages