Can you give me a list (privately, if need be) of a few things:
- The exact OS your server is running (centos/redhat release/etc)
- The exact kernel version (and where it came from? centos/rh proper or a
3rd party repo?)
- Full list of your 3rd party repos, since I know you had some random
french thing in there.
- Full list of packages installed from 3rd party repos.
It is extremely important that all of the software matches.
- Hardware details:
- Network card(s), speeds
- CPU type, number of cores (hyperthreading?)
- Amount of RAM
- Is this a hardware machine, or a VM somewhere? If a VM, what provider?
- memcached stats snapshots again, from your machine after it's been
running a while:
- "stats", "stats slabs", "stats items", "stats settings", "stats
conns".
^ That's five commands, don't forget any.
It's too difficult to try to debug the issue when you hit it. usually
when I'm at a gdb console I'm issuing a command every second or two, but
it takes us 10 minutes to get through 3-4 commands. It'd be nice if I
could attempt to reproduce it here.
I went digging more and there're some dup() bugs with epoll, except your
libevent is new enough to have those patched.. plus we're not using dup()
in such a way to cause the bug.
There was also an EPOLL_CTL_MOD race condition in the kernel, but so far
as I can tell even with libevent 2.x libevent's not using that feature for
us.
The issue does smell like the bug that happens with dup()'s - the events
keep happening and the fd sits half closed, but again we're never closing
those sockets.
I can also make a branch with the new dup() calls explicitly removed, but
this continues to be obnoxious multi-week-long debugging.
I'm convinced that the code in memcached is correct and the bug exists
outside of it (libevent or the kernel). There's simply no way for it to
hit that code path without closing the socket, and doubly so: epoll
automatically delete's an event when the socket is closed. We delete it
then close it, and it still comes back.
It's not possible a connection ends up in the wrong thread, since both
connection initialization and close happens local to a thread. We would
need to have a new connection come in with a duplicated fd. If that
happens, nothing on your machine would work.
thanks.