memcached-1.4.20 stuck when "Too many open connections"

1,184 views
Skip to first unread message

Samdy Sun

unread,
Oct 29, 2014, 2:47:23 AM10/29/14
to memc...@googlegroups.com
Hello,
  I got a "memcached-1.4.20 stuck" problem when EMFILE happen.
  Here are my memcached's cmdline "memcached -s /xxx/mc_usock.11201 -c 1024 -m 4000 -f 1.05 -o slab_automove -o slab_reassign  -t 1 -p 11201".
 
  cat /proc/version 
  Linux version 2.6.32-358.el6.x86_64 (mock...@x86-022.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:47:41 EST 2013

  memcached-1.4.20 stuck and don't work any more when it runs for a period of time.

  Here are some information for gdb:
  (gdb) p stats
  $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, 
        __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align = 0}, curr_items = 149156, 
  total_items = 9876811, curr_bytes = 3712501870, curr_conns = 5, total_conns = 39738, rejected_conns = 0, 
  malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, set_cmds = 0, touch_cmds = 0, 
  get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 0, reclaimed = 0, 
  started = 0, accepting_conns = false, listen_disabled_num = 1, hash_power_level = 17, 
  hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, evicted_unfetched = 0, 
  slab_reassign_running = false, slabs_moved = 20, lru_crawler_running = false, 
  disable_write_by_exptime = 0, disable_write_by_length = 0, disable_write_by_access = 0, 
  evicted_write_reply_timeout_times = 0}

  (gdb) p allow_new_conns
  $4 = false

  And I found that "allow_new_conns" just set to false when "accept" failed and errno is "EMFILE". 
  Here are the codes:  
static void drive_machine(conn *c) {
                 ……
                 } else if (errno == EMFILE) {
                   if (settings.verbose > 0)
                         fprintf(stderr, "Too many open connections\n");
                   accept_new_conns(false);
                   stop = true;
                 } else {
                 ……
}
  
  If I change the flag "allow_new_conns", it can work again. As below:
  (gdb) set allow_new_conns=1
  (gdb) p allow_new_conns
  $5 = true
  (gdb) c
  Continuing.

  I know that "allow_new_conns" will be set to "true" when "conn_close" called. But how could it happen for the case that when "accept" failed , and errno is "EMFILE", and this connection is the only one for accepting. Notes that curr_conns = 5.
  Not run out of fd:
  ls /proc/1748(memcached_pid)/fd | wc -l
  17
  

dormando

unread,
Oct 29, 2014, 2:50:29 AM10/29/14
to memc...@googlegroups.com
You're absolutely sure the running version was 1.4.20? that looks like a
bug that was fixed in .19 or .20

hmmm... maybe a unix domain bug?
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Samdy Sun

unread,
Oct 29, 2014, 3:01:43 AM10/29/14
to memc...@googlegroups.com

I'am absolutely sure the running version was 1.4.20.  And I don't know how can this happen.
Really, I use unix domain for running, and inet domain just for telnet easily.


在 2014年10月29日星期三UTC+8下午2时50分29秒,Dormando写道:
Message has been deleted

Samdy Sun

unread,
Oct 29, 2014, 3:13:43 AM10/29/14
to memc...@googlegroups.com
Addition, the libevent version is "1.4.14b-stable".

Samdy Sun

unread,
Oct 29, 2014, 10:50:52 PM10/29/14
to memc...@googlegroups.com
There are no deadlocks
(gdb) info thread
* 5 Thread 0xf7771b70 (LWP 24962)  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48)
    at memcached.c:4044
  4 Thread 0xf6d70b70 (LWP 24963)  0x007ad430 in __kernel_vsyscall ()
  3 Thread 0xf636fb70 (LWP 24964)  0x007ad430 in __kernel_vsyscall ()
  2 Thread 0xf596eb70 (LWP 24965)  0x007ad430 in __kernel_vsyscall ()
  1 Thread 0xf77b38d0 (LWP 24961)  0x007ad430 in __kernel_vsyscall ()
(gdb) t 1
[Switching to thread 1 (Thread 0xf77b38d0 (LWP 24961))]#0  0x007ad430 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005c5366 in epoll_wait () from /lib/libc.so.6
#2  0x0074a750 in epoll_dispatch (base=0x9305008, arg=0x93053c0, tv=0xff8e0cdc) at epoll.c:198
#3  0x0073d714 in event_base_loop (base=0x9305008, flags=0) at event.c:538
#4  0x08054467 in main (argc=19, argv=0xff8e2274) at memcached.c:5795
(gdb) 

(gdb) t 2
[Switching to thread 2 (Thread 0xf596eb70 (LWP 24965))]#0  0x007ad430 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x08055662 in slab_rebalance_thread (arg=0x0) at slabs.c:859
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 3
[Switching to thread 3 (Thread 0xf636fb70 (LWP 24964))]#0  0x007ad430 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x005838b6 in nanosleep () from /lib/libc.so.6
#2  0x005836e0 in sleep () from /lib/libc.so.6
#3  0x08056f6e in slab_maintenance_thread (arg=0x0) at slabs.c:819
#4  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#5  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 4
[Switching to thread 4 (Thread 0xf6d70b70 (LWP 24963))]#0  0x007ad430 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a652bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x080599f5 in assoc_maintenance_thread (arg=0x0) at assoc.c:251
#3  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#4  0x005c4aee in clone () from /lib/libc.so.6
(gdb) t 5
[Switching to thread 5 (Thread 0xf7771b70 (LWP 24962))]#0  0x007ad430 in __kernel_vsyscall ()
(gdb) bt
#0  0x007ad430 in __kernel_vsyscall ()
#1  0x00a68998 in sendmsg () from /lib/libpthread.so.0
#2  0x080509dd in transmit (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4044
#3  drive_machine (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4370
#4  event_handler (fd=431, which=2, arg=0xfef8ce48) at memcached.c:4441
#5  0x0073d9e4 in event_process_active (base=0x9310658, flags=0) at event.c:395
#6  event_base_loop (base=0x9310658, flags=0) at event.c:547
#7  0x08059fee in worker_libevent (arg=0x930c698) at thread.c:471
#8  0x00a61a49 in start_thread () from /lib/libpthread.so.0
#9  0x005c4aee in clone () from /lib/libc.so.6
(gdb) 

strace info, there is the only event named maxconnsevent on epoll?
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 10084037}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 20246365}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 30382098}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 40509766}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 50657403}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 60823841}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 71013006}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 81234264}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 91407508}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 101581187}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 111752457}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 121919049}) = 0
epoll_wait(4, {}, 32, 10)               = 0
clock_gettime(CLOCK_MONOTONIC, {8374269, 132057597}) = 0



在 2014年10月29日星期三UTC+8下午2时47分23秒,Samdy Sun写道:

dormando

unread,
Oct 31, 2014, 3:01:06 AM10/31/14
to memc...@googlegroups.com
Hey,

How are you reproducing this? How many connections do you typically have
open?

It's really bizarre that your curr_conns is "5", but your connections are
disabled? Even if there's still a race, as more connections close they
each have an opportunity to flip the acceptor back on.

Can you print what "stats settings" shows? If it's adjusting your actual
maxconns downward it should show there...

Samdy Sun

unread,
Oct 31, 2014, 5:40:35 AM10/31/14
to memc...@googlegroups.com
@Dormando,
  
  I try my best to reproduce this in my environment, but failed. This just happened on my servers. 

  I use "stats" command to check the memcached if it is available or not. If the memcached is unavailable, we will not send request to it. 

  This is what I feel strange when my curr_conns is "5" and memcached can't recover itself. I think "conn_new" call maybe fail, and it call "close(fd)" directly, not "conn_close()"? Such as below?

  1. malloc fails when "conn_new()"
  2. event_add fails when "conn_new()"
  3. other case?

  Take notice of that I build "memcached" on 32-bit system and it runs on 64-bit system. Additionally, I use "-m 4000" for memcached's start.

  Thanks,
  Samdy Sun

在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:

dormando

unread,
Oct 31, 2014, 12:30:29 PM10/31/14
to memc...@googlegroups.com
Hey,

32-bit memcached with -m 4000 will never work. the best you can do is
probably -m 1600. 32bit applications typically can only allocate up to 2G
of ram.

memcached isn't protected from a lot of malloc failure scenarios, so what
you're doing will never work.

-m 4000 only limits the slab memory usage. there're a lot of buffers/etc
outside of that. Also the hash table, which is measured separately.

Samdy Sun

unread,
Nov 3, 2014, 1:53:00 AM11/3/14
to memc...@googlegroups.com
  Thanks, Dormando. I will try with smaller ram memory.
  
  32bit applications can malloc about 3.8G of ram on my 64bit system, so "-m 3200" may be ok?
  

在 2014年11月1日星期六UTC+8上午12时30分29秒,Dormando写道:

dormando

unread,
Nov 3, 2014, 1:56:01 AM11/3/14
to memc...@googlegroups.com
I don't really know offhand. why are you running 32bit memcached at all?
Just run it in 64bit mode?

Samdy Sun

unread,
Nov 3, 2014, 8:03:40 AM11/3/14
to memc...@googlegroups.com
……
Because we also have 32-bit systems and 64-bit applications don't work on them. So we build memcached on 32-bit system conformably.

在 2014年11月3日星期一UTC+8下午2时56分01秒,Dormando写道:

Samdy Sun

unread,
Nov 5, 2014, 6:58:29 AM11/5/14
to memc...@googlegroups.com
Hey,
  
  I also got a stuck when specifing "-m 200". 
  As mentioned previously, that case could happend as below?
  1. malloc fails when "conn_new()"
  2. event_add fails when "conn_new()"
  3. other case?

  And I find another case after code reviewing. Here is, memcached stuck for a while, for which our client close the connection because 200ms-timeout. So, if the previous 1023 connections get timeout and memcached calls "transmit" to write, "Broken pipe" error will happend. And then, memcached get "TRANSMIT_HARD_ERROR" error and calls "conn_close" immediately.
  So, it will happend as below?
  accept(), errno == EMFILE
  fd1 close,
  fd2 close,
  fd3 close,
  ……
  fd1023 close,
  accept_new_conns(false) for EMFILE

  That just is a supposition, but I will try to log some infomation to prove it.
  
  Any way, is it better to call "conn_close" after for a while, such as waiting for next event when getting TRANSMIT_HARD_ERROR error then to conn_close immediately?
  

在 2014年10月31日星期五UTC+8下午3时01分06秒,Dormando写道:

dormando

unread,
Nov 5, 2014, 7:08:50 PM11/5/14
to memc...@googlegroups.com
There're too many things that will go wrong if malloc fails...

There's a stats counter covering some of them. Is that going up for you?

Have you disabled overcommit memory? Have you observed the process size
when it hangs? malloc should almost never actually fail under normal
conditions...

Samdy Sun

unread,
Nov 6, 2014, 2:21:31 AM11/6/14
to memc...@googlegroups.com
@Dormando,
  
  I think it's not because of malloc fails. 
  I use "-m 200" and there is 48G ram of my server.
  After close(c->sfd), I printf a message as "fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");"
  Here are the codes,
  static void conn_close(conn *c) {
     ……
     close(c->sfd);
 
     pthread_mutex_lock(&conn_lock);
     allow_new_conns = true;
     pthread_mutex_unlock(&conn_lock);
     // XXX: tracing fd close
     fprintf(stderr, "INFO`fd=%d`reason=%s`allow_new_conns=true\n", c->sfd, "conn_close");
     ……
  }

  Here are the logs,
  ……
INFO`fd=816`reason=conn_close`allow_new_conns=true
INFO`fd=817`reason=conn_close`allow_new_conns=true
Failed to write, and not due to blocking: Broken pipe
INFO`fd=818`reason=conn_close`allow_new_conns=true
INFO`fd=819`reason=conn_close`allow_new_conns=true
Failed to write, and not due to blocking: Broken pipe
INFO`fd=820`reason=conn_close`allow_new_conns=true
Failed to write, and not due to blocking: Broken pipe
INFO`fd=821`reason=conn_close`allow_new_conns=true
Failed to write, and not due to blocking: Broken pipe
INFO`fd=822`reason=conn_close`allow_new_conns=true
Failed to write, and not due to blocking: Broken pipe
  ……
INFO`fd=1011`reason=conn_close`allow_new_conns=true
INFO`fd=1012`reason=conn_close`allow_new_conns=true
INFO`fd=1013`reason=conn_close`allow_new_conns=true
INFO`fd=1014`reason=conn_close`allow_new_conns=true
INFO`fd=1015`reason=conn_close`allow_new_conns=true
INFO`fd=1016`reason=conn_close`allow_new_conns=true
INFO`fd=1017`reason=conn_close`allow_new_conns=true
INFO`fd=1018`reason=conn_close`allow_new_conns=true
INFO`fd=1019`reason=conn_close`allow_new_conns=true
INFO`fd=1020`reason=conn_close`allow_new_conns=true
INFO`fd=1021`reason=conn_close`allow_new_conns=true
INFO`fd=1022`reason=conn_close`allow_new_conns=true
INFO`fd=1023`reason=conn_close`allow_new_conns=true
accept(): Too many open files
Too many open connections
INFO`fd=-1`errno=24(Too many open files)`reason=Too many open connections
allow_new_conns=false


It's very strange that "accept()" fail after close(fd) has called. The flag called "allow_new_conns" is finally set to false, so memcached is stuck.
But when I found that memcached was stuck, the used fd count are small.
ls /proc/6873/fd|wc -l
17



在 2014年11月6日星期四UTC+8上午8时08分50秒,Dormando写道:

Samdy Sun

unread,
Nov 6, 2014, 2:33:18 AM11/6/14
to memc...@googlegroups.com
Additionally,  the stats counter of malloc_fails is ok.

(gdb) p stats

$1 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers = 0, {__spins = 0, 

        __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align = 0}, curr_items = 190673, 

  total_items = 7728815, curr_bytes = 3765464159, curr_conns = 5, total_conns = 28838, rejected_conns = 0, 

  malloc_fails = 0, reserved_fds = 5, conn_structs = 1012, get_cmds = 0, set_cmds = 0, touch_cmds = 0, 

  get_hits = 0, get_misses = 0, touch_hits = 0, touch_misses = 0, evictions = 0, reclaimed = 0, 

  started = 0, accepting_conns = falselisten_disabled_num = 1, hash_power_level = 17, 

  hash_bytes = 524288, hash_is_expanding = false, expired_unfetched = 0, evicted_unfetched = 0, 

  slab_reassign_running = false, slabs_moved = 19, lru_crawler_running = false, 

  disable_write_by_exptime = 0, disable_write_by_length = 0, disable_write_by_access = 0, 

  evicted_write_reply_timeout_times = 0}


  
在 2014年11月6日星期四UTC+8下午3时21分31秒,Samdy Sun写道:

Samdy Sun

unread,
Nov 6, 2014, 8:42:07 AM11/6/14
to memc...@googlegroups.com
Here is my assumption,

Firstly, accept() is interrupted by a signal after it has got EMFILE error.
Secondly, the worker threads call the first 1023 conn_close()  while accept() is interrupting.
Thirdly, accept() is recovered and still return EMFILE error because it is reentrant.

Could this happend? It's very strange.

在 2014年11月6日星期四UTC+8上午8时08分50秒,Dormando写道:

Samdy Sun

unread,
Nov 6, 2014, 10:35:09 PM11/6/14
to memc...@googlegroups.com
hmmm... I'm not sure about the assumption. 
But here is another case log,
26042 INFO`fd=568`reason=conn_close`allow_new_conns=true
26043 INFO`fd=569`reason=conn_close`allow_new_conns=true
26044 accept(): Too many open files
26045 INFO`fd=563`reason=conn_close`allow_new_conns=true
26046 Too many open connections
26047 INFO`fd=571`reason=conn_close`allow_new_conns=true
26048 INFO`fd=565`reason=conn_close`allow_new_conns=true
26049 Failed to write, and not due to blocking: Broken pipe
……
28050 INFO`fd=1020`reason=conn_close`allow_new_conns=true
28051 Failed to write, and not due to blocking: Broken pipe
28052 INFO`fd=1021`reason=conn_close`allow_new_conns=true
28053 Failed to write, and not due to blocking: Broken pipe
28054 INFO`fd=1022`reason=conn_close`allow_new_conns=true
28055 Failed to write, and not due to blocking: Broken pipe
28056 INFO`fd=1023`reason=conn_close`allow_new_conns=true
28057 INFO`fd=-1`errno=24(Too many open files)`reason=Too many open connections
28058 allow_new_conns=false

It seem to be that the main thread can't acquire the conn_lock/etc while worker thread is closing the connections, so the "allow_new_conns" flag is set to "false" finally.

在 2014年11月6日星期四UTC+8下午9时42分07秒,Samdy Sun写道:
Reply all
Reply to author
Forward
0 new messages