memcached-1.4.20 stuck when "Too many open connections"

Samdy Sun

unread,

Oct 29, 2014, 2:47:23 AM10/29/14

to memc...@googlegroups.com

Hello,

I got a "memcached-1.4.20 stuck" problem when EMFILE happen.

Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign -t 1 -p 11201".

cat /proc/version

Linux version 2.6.32-358.el6.x86_64 (mock...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:47:41 EST 2013

memcached-1.4.20 stuck and don't work any more when it runs for a period of time.

Here are some information for gdb:

(gdb) p stats

$2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0,

__list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align = 0}, curr_items = 149156,

total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, total_conns = 39738, rejected_conns = 0,

malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, set_cmds = 0, touch_cmds = 0,

get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 0, reclaimed = 0,

started = 0, accepting_conns = false, listen_disabled_num = 1, hash_power_level = 17,

hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, evicted_unfetched = 0,

slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = false,

disable_write_by_exptime = 0, disable_write_by_length = 0, disable_write_by_access = 0,

evicted_write_reply_timeout_times = 0}

(gdb) p allow_new_conns

$4 = false

And I found that "allow_new_conns" just set to false when "accept" failed and errno is "EMFILE".

Here are the codes:

static void drive_machine(conn *c) {

……

} else if (errno == EMFILE) {

if (settings.verbose > 0)

fprintf(stderr, "Too many open connections\n");

accept_new_conns(false);

stop = true;

} else {

……

}

If I change the flag "allow_new_conns", it can work again. As below:

(gdb) set allow_new_conns=1

(gdb) p allow_new_conns

$5 = true

(gdb) c

Continuing.

I know that "allow_new_conns" will be set to "true" when "conn_close" called. But how could it happen for the case that when "accept" failed , and errno is "EMFILE", and this connection is the only one for accepting. Notes that curr_conns = 5.

Not run out of fd:

ls /proc/1748(memcached_pid)/fd | wc -l

17

dormando

unread,

Oct 29, 2014, 2:50:29 AM10/29/14

to memc...@googlegroups.com

You're absolutely sure the running version was 1.4.20? that looks like a
bug that was fixed in .19 or .20

hmmm... maybe a unix domain bug?

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Samdy Sun

unread,

Oct 29, 2014, 3:01:43 AM10/29/14

to memc...@googlegroups.com

Thanks for the reply.
I'am absolutely sure the running version was 1.4.20. And I don't know how can this happen.

Really, I use unix domain for running, and inet domain just for telnet easily.

在 2014年10月29日星期三UTC+8下午2时50分29秒，Dormando写道：

Message has been deleted

Samdy Sun

unread,

Oct 29, 2014, 3:13:43 AM10/29/14

to memc...@googlegroups.com

Addition, the libevent version is "1.4.14b-stable".

Samdy Sun

unread,

Oct 29, 2014, 10:50:52 PM10/29/14

to memc...@googlegroups.com

There are no deadlocks,

(gdb) info thread

* 5 Thread 0xf7771b70 (LWP 24962) 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48)

at memcached.c:4044

4 Thread 0xf6d70b70 (LWP 24963) 0x007ad430 in __kernel_vsyscall ()

3 Thread 0xf636fb70 (LWP 24964) 0x007ad430 in __kernel_vsyscall ()

2 Thread 0xf596eb70 (LWP 24965) 0x007ad430 in __kernel_vsyscall ()

1 Thread 0xf77b38d0 (LWP 24961) 0x007ad430 in __kernel_vsyscall ()

(gdb) t 1

[Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0 0x007ad430 in __kernel_vsyscall ()

(gdb) bt

#0 0x007ad430 in __kernel_vsyscall ()

#1 0x005c5366 in epoll_wait () from /lib/libc.so.6

#2 0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198

#3 0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538

#4 0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795

(gdb)

(gdb) t 2

[Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0 0x007ad430 in __kernel_vsyscall ()

(gdb) bt

#0 0x007ad430 in __kernel_vsyscall ()

#1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0

#2 0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859

#3 0x00a61a49 in start_thread () from /lib/libpthread.so.0

#4 0x005c4aee in clone () from /lib/libc.so.6

(gdb) t 3

[Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0 0x007ad430 in __kernel_vsyscall ()

(gdb) bt

#0 0x007ad430 in __kernel_vsyscall ()

#1 0x005838b6 in nanosleep () from /lib/libc.so.6

#2 0x005836e0 in sleep () from /lib/libc.so.6

#3 0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819

#4 0x00a61a49 in start_thread () from /lib/libpthread.so.0

#5 0x005c4aee in clone () from /lib/libc.so.6

(gdb) t 4

[Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0 0x007ad430 in __kernel_vsyscall ()

(gdb) bt

#0 0x007ad430 in __kernel_vsyscall ()

#1 0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0

#2 0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251

#3 0x00a61a49 in start_thread () from /lib/libpthread.so.0

#4 0x005c4aee in clone () from /lib/libc.so.6

(gdb) t 5

[Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0 0x007ad430 in __kernel_vsyscall ()

(gdb) bt

#0 0x007ad430 in __kernel_vsyscall ()

#1 0x00a68998 in sendmsg () from /lib/libpthread.so.0

#2 0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044

#3 drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370

#4 event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441

#5 0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at event.c:395

#6 event_base_loop (base=0x9310658, flags=0) at event.c:547

#7 0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471

#8 0x00a61a49 in start_thread () from /lib/libpthread.so.0

#9 0x005c4aee in clone () from /lib/libc.so.6

(gdb)

strace info, there is the only event named maxconnsevent on epoll?

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0

epoll_wait(4, {}, 32, 10) = 0

clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0

在 2014年10月29日星期三UTC+8下午2时47分23秒，Samdy Sun写道：

dormando

unread,

Oct 31, 2014, 3:01:06 AM10/31/14

to memc...@googlegroups.com

Hey,

How are you reproducing this? How many connections do you typically have
open?

It's really bizarre that your curr_conns is "5", but your connections are
disabled? Even if there's still a race, as more connections close they
each have an opportunity to flip the acceptor back on.

Can you print what "stats settings" shows? If it's adjusting your actual
maxconns downward it should show there...

Samdy Sun

unread,

Oct 31, 2014, 5:40:35 AM10/31/14

to memc...@googlegroups.com

@Dormando,

I try my best to reproduce this in my environment, but failed. This just happened on my servers.

I use "stats" command to check the memcached if it is available or not. If the memcached is unavailable, we will not send request to it.

This is what I feel strange when my curr_conns is "5" and memcached can't recover itself. I think "conn_new" call maybe fail, and it call "close(fd)" directly, not "conn_close()"? Such as below?

1. malloc fails when "conn_new()"

2. event_add fails when "conn_new()"

3. other case?

Take notice of that I build "memcached" on 32-bit system and it runs on 64-bit system. Additionally, I use "-m 4000" for memcached's start.

Thanks,

Samdy Sun

在 2014年10月31日星期五UTC+8下午3时01分06秒，Dormando写道：

dormando

unread,

Oct 31, 2014, 12:30:29 PM10/31/14

to memc...@googlegroups.com

Hey,

32-bit memcached with -m 4000 will never work. the best you can do is
probably -m 1600. 32bit applications typically can only allocate up to 2G
of ram.

memcached isn't protected from a lot of malloc failure scenarios, so what
you're doing will never work.

-m 4000 only limits the slab memory usage. there're a lot of buffers/etc
outside of that. Also the hash table, which is measured separately.

Samdy Sun

unread,

Nov 3, 2014, 1:53:00 AM11/3/14

to memc...@googlegroups.com

Thanks, Dormando. I will try with smaller ram memory.

32bit applications can malloc about 3.8G of ram on my 64bit system, so "-m 3200" may be ok?

在 2014年11月1日星期六UTC+8上午12时30分29秒，Dormando写道：

dormando

unread,

Nov 3, 2014, 1:56:01 AM11/3/14

to memc...@googlegroups.com

I don't really know offhand. why are you running 32bit memcached at all?
Just run it in 64bit mode?

Samdy Sun

unread,

Nov 3, 2014, 8:03:40 AM11/3/14

to memc...@googlegroups.com

……

Because we also have 32-bit systems and 64-bit applications don't work on them. So we build memcached on 32-bit system conformably.

在 2014年11月3日星期一UTC+8下午2时56分01秒，Dormando写道：

Samdy Sun

unread,

Nov 5, 2014, 6:58:29 AM11/5/14

to memc...@googlegroups.com

Hey,

I also got a stuck when specifing "-m 200".

As mentioned previously, that case could happend as below?

1. malloc fails when "conn_new()"

2. event_add fails when "conn_new()"

3. other case?

And I find another case after code reviewing. Here is, memcached stuck for a while, for which our client close the connection because 200ms-timeout. So, if the previous 1023 connections get timeout and memcached calls "transmit" to write, "Broken pipe" error will happend. And then, memcached get "TRANSMIT_HARD_ERROR" error and calls "conn_close" immediately.

So, it will happend as below?

accept(), errno == EMFILE

fd1 close,

fd2 close,

fd3 close,

……

fd1023 close,

accept_new_conns(false) for EMFILE

That just is a supposition, but I will try to log some infomation to prove it.

Any way, is it better to call "conn_close" after for a while, such as waiting for next event when getting TRANSMIT_HARD_ERROR error then to conn_close immediately?

在 2014年10月31日星期五UTC+8下午3时01分06秒，Dormando写道：

dormando

unread,

Nov 5, 2014, 7:08:50 PM11/5/14

to memc...@googlegroups.com

There're too many things that will go wrong if malloc fails...

There's a stats counter covering some of them. Is that going up for you?

Have you disabled overcommit memory? Have you observed the process size
when it hangs? malloc should almost never actually fail under normal
conditions...

Samdy Sun

unread,

Nov 6, 2014, 2:21:31 AM11/6/14

to memc...@googlegroups.com

@Dormando,

I think it's not because of malloc fails.

I use "-m 200" and there is 48G ram of my server.

After close(c->sfd), I printf a message as "fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");"

Here are the codes,

static void conn_close(conn *c) {

……

close(c->sfd);

pthread_mutex_lock(&conn_lock);

allow_new_conns = true;

pthread_mutex_unlock(&conn_lock);

// XXX: tracing fd close

fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");

……

}

Here are the logs,

……

INFO`fd=816`reason=conn_close`allow_new_conns=true

INFO`fd=817`reason=conn_close`allow_new_conns=true