The commandline arguments used are:
-u memcached -m 236544 -c 64000 -p 11211 -t 32 -C -n 5 -f 1.05 -o ext_path=/mnt/memcache:1700G,ext_path=/mnt1/memcache:1700G,ext_path=/mnt2/memcache:1700G,ext_path=/mnt3/memcache:1700G,ext_threads=32,ext_item_size=64
And we have some data, but frankly when the issue was happening we
focused on the Memcache servers late in the process. The initial errors
suggested the problem was in a very different part of the collection of
technical components in our larger service. When we realized the
"memcached" processes were not accepting new connections, we wanted to
correct the behavior quickly since a fair amount of time had already
passed.
First, sockets in use on the servers...
One
system, call it server-248, shows TCP sockets on the system hovering
around 1900 after traffic ramped up for the day. If held at that level
from ~6:45am until 10:06am. We collect SAR data every 2 minutes, so
the next reading was at 10:08 and the TCP sockets jumped to 63842.
Meaning it didn't grow slowly over time, it jumped frlmo 1937 to 63842
in a 2 minute window. That number was 63842-63844 until 12:06pm when we
restarted the "memcached" process. After that the number dropped over
time back to a more typical level.
1
0:02am 1937
10:04am 1937
10:06am 1937
10:08am 63842
10:10am 63842
...etc...
12:08pm 63843
12:06pm 63844
12:08pm 18415
12:10pm 17202
12:12pm 16333
12:14pm 16197
12:16pm 16134
12:18pm 16099
12:20pm 1617The
other system which ran into trouble, which I'll call server-85,
exhibited similar behavior but started later. Here's a sample of TCP
socket counts from that server.
11:30am 1805
11:32am 1801
11:34am 1830
11:36am 1817
11:38am 63905
11:40am 63905
...etc...
12:20pm 63908
12:22pm 63908
12:24pm 1708
12:26pm 1720
12:28pm 1747There
were other network-centric datapoints that show the systems grind to a
halt in terms of accepting new connections, like the bandwidth going
in/out of the NIC's, etc... But it's all in support of the same idea,
that the "memcached" server stopped accepting new connections.
Second, details about the sockets...
During
the incident, we did capture summary information on the socket states
from various servers on the network, and a full "netstat -an" listing
from one of the servers. Both server-248 and server-85 showed tens of
thousands of sockets in a CLOSE_WAIT state, hundreds in a SYN_RCVD
state, and a small number of ESTABLISHED sockets.
There may
continue to be traffic on the ESTABLISHED connections to the "memcached"
servers, but if there is it's a trivial amount. Multiple people were
running "memkeys" at the time and report seeing no activity.
Third, "stats" from the incapacitated "memcached" processes...
We
do not have stats from either server-248 or server-85 during the time
they were in trouble. In hindsight, that was a big oversight.
It's
not clear that we could have gotten a connection to the server to pull
the stats, but I'd really like to know what those counters said!
I
do have the results from "stats", "stats slabs", "stats items" and
"stats conns" from 17:10 the previous evening. That doesn't show any
obvious errors/problems slowly building up, waiting for some event to
trigger a massive failure. But it's from ~15 hours before the server
got into trouble so I don't think it's all that helpful.