Re: Memcached crashing under load

1,132 views
Skip to first unread message

Dustin

unread,
Mar 12, 2009, 9:33:31 PM3/12/09
to memcached

For clarity -- are you saying the server is crashing, or the client?

On Mar 12, 6:19 pm, meppum <mmep...@gmail.com> wrote:
> I was load testing memcached and have been experiencing consistent
> crashing when approaching 11k total connections (not concurrent).
> Below is a sample of some python code I have developed to isolate this
> problem as well as my setup and the error I get. I searched google and
> couldn't seem to find an answer.
>
> ------------------------------------------
>
> Python Code:
>
> import cmemcache
>
> c = cmemcache.Client(["127.0.0.1:11211"])
> c.set('abc', '123')
> c.disconnect_all()
>
> for i in range(20000):
>         c = cmemcache.Client(["127.0.0.1:11211"])
>         c.get('abc')
>         c.disconnect_all()
>
> -----------------------------------------
>
> Error:
>
> [W...@1236906533.149172] mcm_server_connect_next_avail():2338
> [NOT...@1236906533.149172] mcm_server_connect_next_avail():2328
> [W...@1236906537.889442] mcm_server_writable():3178: timeout:
> Operation now in progress: write select(2) call timed out
> [W...@1236906537.889442] mcm_server_connect():2295: select(2) failed:
> Operation now in progress: select(2) timed out on establishing
> connection
> connect(): -1
> [NOT...@1236906537.889442] mcm_server_connect():2302: Operation
> already in progress
> [NOT...@1236906537.889442] mcm_server_connect_next_avail():2333:
> Operation already in progress
> [W...@1236906537.889442] mcm_server_connect_next_avail():2338
> [NOT...@1236906537.889442] mcm_server_connect_next_avail():2328
>
> -----------------------------------------
>
> Setup:
>
> -Ubuntu Intrepid
> -Libmemcached 1.4.0.rc2-1
> -Cmemcache 0.95
> -Memcached 1.2.6
> -Python 2.5
>
> -meppum
Message has been deleted

meppum

unread,
Mar 15, 2009, 8:51:25 PM3/15/09
to memcached
I've created a python script that should trigger this error on any
machine running python-memcached or cmemcached. Hopefully this will
clear things up. It looks like neither the client or the daemon
crash,
but that the daemon cannot respond in time so an error is thrown. I
find this odd because the curr_connections stat never exceeds the
number of possible connections. If anyone can shed some light on this
i'd appreciate it.

This is the how i started memcached:
memcached -m 8 -c 1024 -v -l 127.0.0.1 -d

The script:
try:
import cmemcache as memcache
except ImportError:
import memcache
c = memcache.Client(["127.0.0.1:11211"])
c.set('abc', '123')
c.disconnect_all()
for i in range(20000):
if i % 1000 == 0:
print "iteration: %s" % i
c = memcache.Client(["127.0.0.1:11211"])
c.get('abc')
c.disconnect_all()

On Mar 13, 7:32 am, meppum <mmep...@gmail.com> wrote:
> That's a good question. To be clear it's not a hard crash, it's just
> that one or the other starts throwing errors regarding connection
> timeouts when there should be more than enough connections available.
> How can I tell if it's the client or server throwing the errors?

Chris Goffinet

unread,
Mar 15, 2009, 9:04:16 PM3/15/09
to memc...@googlegroups.com
As a simple test, try increasing the backlog int from the default
(1024) of memcached higher.

http://linux.die.net/man/2/listen

> The backlog parameter defines the maximum length the queue of
> pending connections may grow to. If a connection request arrives
> with the queue full the client may receive an error with an
> indication of ECONNREFUSED or, if the underlying protocol supports
> retransmission, the request may be ignored so that retries succeed.


--
Chris Goffinet
MyBlogLog Senior Performance Engineer

Yahoo!
San Francisco, CA
United States

meppum

unread,
Mar 15, 2009, 9:20:09 PM3/15/09
to memcached
I should have mentioned this, the backlog int doesn't seem to have any
effect. Even if I put it as high as 20k. I think this is because the
concurrent connections never actually go above 4.

David Stanek

unread,
Mar 15, 2009, 9:29:06 PM3/15/09
to memc...@googlegroups.com
On Sun, Mar 15, 2009 at 8:49 PM, meppum <mme...@gmail.com> wrote:
>
> The script:
> try:
>        import cmemcache as memcache
> except ImportError:
>        import memcache
>
> c = memcache.Client(["127.0.0.1:11211"])

> c.set('abc', '123')
> c.disconnect_all()
>
> for i in range(20000):
>        if i % 1000 == 0:
>                print "iteration: %s" % i
>
>        c = memcache.Client(["127.0.0.1:11211"])
>        c.get('abc')
>        c.disconnect_all()
>

This script will not run multiple memcached requests in parallel. Is
that what you were going for?

--
David
blog: http://www.traceback.org
twitter: http://twitter.com/dstanek

Chris Goffinet

unread,
Mar 15, 2009, 9:44:28 PM3/15/09
to memc...@googlegroups.com
When I run your script + daemon, I noticed some timeouts. I adjusted
these sysctl and managed to keep running the script over and over, and
could not see timeouts:

sudo /sbin/sysctl -w net.ipv4.tcp_tw_recycle=1
sudo /sbin/sysctl -w net.ipv4.tcp_tw_reuse=1
sudo /sbin/sysctl -w net.ipv4.tcp_fin_timeout=10

On your system, what happens when you do the same? Do you see any
improvement?

--
Chris Goffinet
MyBlogLog Senior Performance Engineer

Yahoo!
San Francisco, CA
United States

On Mar 15, 2009, at 6:39 PM, meppum wrote:

>
> Yes, i went for the simplest script that caused the error on my
> machine.
>
> On Mar 15, 9:29 pm, David Stanek <dsta...@dstanek.com> wrote:

Henrik Schröder

unread,
Mar 16, 2009, 6:15:32 AM3/16/09
to memc...@googlegroups.com
On Fri, Mar 13, 2009 at 02:19, meppum <mme...@gmail.com> wrote:

I was load testing memcached and have been experiencing consistent
crashing when approaching 11k total connections (not concurrent).


Load testing of course has its uses, but is your scenario even remotely likely to happen in your live system? What your scripts is actually testing is how fast you can recycle sockets, but if you use a client that supports connection pooling, this is never going to become an issue.

Here's some stats from one of our cache servers:

uptime                 449124
time                   1237196926
version                1.2.5
curr_items             168078
curr_connections       24
total_connections      334
connection_structures  38
cmd_get                642209176
cmd_set                6052589
get_hits               499909822
get_misses             142299354

It's getting 1400 gets per second in average, many of those are multi-gets, so the actual amount of requests per second is maybe a tenth of that, say that it gets somewhere between 100-150 req/s, but because all clients use connection pooling, the number of concurrent connections is only 24, and in the five days it's been up, it's been going through a total of 334 connections, and those are because the connection-pools vary in size according to load, so there's some recycling going on. I cannot imagine what kind of load we would need to put on our systems to run into the problem your are testing for, but I'm sure that we'd encounter many many other problems before we reach that point.

You didn't say much about your application, but doesn't the Python client support connection pooling? Why don't you make sure you use that, instead of lowering some TCP timeout values on your servers so that they can recycle sockets faster? It seems to me that you are looking for a complex solution to a non-problem.


/Henrik
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
0 new messages