Issue 363 in memcached: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

memc...@googlecode.com

unread,

Apr 22, 2014, 6:33:38 PM4/22/14

to memc...@googlegroups.com

Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 363 by i...@nodesocket.com: MemcachePool::get(): Server 127.0.0.1
(tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

We just upgraded to memcached 1.4.18 using PHP 5.4.27 and the memcache
extension 3.0.8. After the upgrade getting intermittent:

MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with:
Network timeout

Using the script suggested in the wiki:
http://consoleninja.net/code/memcached/mc_conn_tester.pl immediately see
timeouts:

➜ ~ ./mc_conn_tester.pl
Fail: (timeout: 1) (elapsed: 1.00027108) (conn: 0.00198221) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00025678) (conn: 0.00038099) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00024295) (conn: 0.00049210) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00023890) (conn: 0.00036407) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00027013) (conn: 0.00049710) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00025988) (conn: 0.00034285) (set:
0.00000000) (get: 0.00000000)
^CAverages: (conn: 0.00070960) (set: 0.00047751) (get: 0.00041979)

Is this a bug in the new version of memcached?

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

memc...@googlecode.com

unread,

Apr 22, 2014, 8:41:06 PM4/22/14

to memc...@googlegroups.com

Comment #1 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

I know we talked it over in IRC a bit, but one more question:

does mc_conn_tester.pl always fail, or is it intermittent?

memc...@googlecode.com

unread,

Apr 22, 2014, 9:46:44 PM4/22/14

to memc...@googlegroups.com

Comment #2 on issue 363 by notifica...@commando.io: MemcachePool::get():

Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

It always failed, but after restarting memcached so far no problems. Will
advise if anything comes up again.

memc...@googlecode.com

unread,

Apr 22, 2014, 9:48:57 PM4/22/14

to memc...@googlegroups.com

Comment #3 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

I'm deeply suspicious if mc_conn_tester.pl is failing 100% but you can
successfully telnet to it :/

memc...@googlecode.com

unread,

Apr 23, 2014, 1:17:31 PM4/23/14

to memc...@googlegroups.com

Comment #4 on issue 363 by i...@nodesocket.com: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

The memcached timeout issues are back on the same host, here is the output
of mc_conn_tester.pl.

➜ ~ ./mc_conn_tester.pl
Fail: (timeout: 1) (elapsed: 1.00027609) (conn: 0.00065994) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00024891) (conn: 0.00054693) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00022912) (conn: 0.00031996) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00024796) (conn: 0.00060511) (set:
0.00000000) (get: 0.00000000)
^CAverages: (conn: 0.00058240) (set: 0.00062472) (get: 0.00041881)

Also, seeing warnings thrown from PHP:

E_NOTICE: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed
with: Network timeout (0)

What else can I get you to help assist? Thanks.

memc...@googlecode.com

unread,

Apr 23, 2014, 1:28:34 PM4/23/14

to memc...@googlegroups.com

Comment #5 on issue 363 by i...@nodesocket.com: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

I had to restart memcached, but of the errors (this is a production host),
however I ran:

stats and stats cons

Before restarting: https://gist.github.com/nodesocket/fe822ef51eb0abfb8f9b

Hopefully that helps debug this.

memc...@googlecode.com

unread,

Apr 23, 2014, 10:20:04 PM4/23/14

to memc...@googlegroups.com

Comment #6 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

I'm really concerned that you're getting mc_conn_tester to fail each time
but can still connect via telnet. That should be impossible.

I see there's a timing value for conn: but not set:, which means it's
timing out on the set command.

If it happens again, can you telnet and try some set or get commands?

set foo 0 0 2
hi

(hit enter twice after typing hi). Or catch me in IRC.

If it does happen again, can you attach a gdb session to it and get a stack
trace?

gdb -p $(pidof memcached)
thread apply all bt

.. that'll show if something is holding onto a lock, hopefully.

memc...@googlecode.com

unread,

Apr 24, 2014, 2:15:40 AM4/24/14

to memc...@googlegroups.com

Comment #7 on issue 363 by jus...@commando.io: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Will do, I expect this issue to popup sometime tomorrow. I have gdb ready
when it does.

memc...@googlecode.com

unread,

Apr 26, 2014, 2:29:06 PM4/26/14

to memc...@googlegroups.com

Comment #8 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

no luck?

memc...@googlecode.com

unread,

Apr 26, 2014, 4:12:18 PM4/26/14

to memc...@googlegroups.com

Comment #9 on issue 363 by i...@nodesocket.com: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

So far no errors, very strange. Will keep you posted, thanks for the help,
really appreciated.

memc...@googlecode.com

unread,

Apr 28, 2014, 3:26:37 AM4/28/14

to memc...@googlegroups.com

Comment #10 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

I'm planning on releasing 1.4.19 sometime tomorrow. So far I haven't had
any other reports of a similar issue (~5,000 downloads of .18). If you get
some gdb info before then I'll try to sneak in a fix.

Still pretty suspicious of it not being our issue. The connection code
changed a bit but you wouldn't be able to telnet in if that broke... having
the sets hang is pretty bizarre.

but, I won't rule it out until we see more of what's going on with you.

memc...@googlecode.com

unread,

Apr 28, 2014, 5:40:48 AM4/28/14

to memc...@googlegroups.com

Comment #11 on issue 363 by notifica...@commando.io: MemcachePool::get():

Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Speak of the devil. Just got an alert:

E_NOTICE: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed
with: Network timeout (0)

Also running mc_conn_tester.pl on the host in question is failing again:

➜ ~ ./mc_conn_tester.pl
Fail: (timeout: 1) (elapsed: 1.00025582) (conn: 0.00079894) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00021482) (conn: 0.00055695) (set:
0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00023007) (conn: 0.00053406) (set:
0.00000000) (get: 0.00000000)
^CAverages: (conn: 0.00052620) (set: 0.00053876) (get: 0.00034102)

What would you like me to do?

memc...@googlecode.com

unread,

Apr 28, 2014, 6:54:28 PM4/28/14

to memc...@googlegroups.com

Comment #12 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Just more notes: we worked this out a bit in IRC.

He's not running from a source build... which I think should have (some?)
line info.

If it doesn't, he'll have to start it again with the memcached-debug binary.

Before you do that though, try to catch me in IRC. We can do
instruction-stepping if the line numbering is still missing... it'll just
be a lot more painful to look at.

I've also e-mailed steven in case he has any ideas, but I haven't had time
to stare more at the code today (and might not, busy this week).

memc...@googlecode.com

unread,

Apr 28, 2014, 7:51:31 PM4/28/14

to memc...@googlegroups.com

Comment #13 on issue 363 by notifica...@commando.io: MemcachePool::get():

Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Thanks for the update. I am now running from source from the home
directory. When it happens again, I'll jump into IRC and we try
instruction-stepping.

Thanks.

notifi...@commando.io

unread,

Apr 29, 2014, 10:27:09 AM4/29/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

It is happening right now, heading into IRC.

notifi...@commando.io

unread,

Apr 29, 2014, 11:04:02 AM4/29/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

I had to restart memcached. Here is the result of gdb, hope it helps without stepping through.

https://gist.github.com/nodesocket/31d32d838c08c47aa0d7

memc...@googlecode.com

unread,

Apr 29, 2014, 12:51:57 PM4/29/14

to memc...@googlegroups.com

Comment #14 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

heh.. if I'm not in IRC you might want to update here :P

I made a few more mistakes... I should've asked you to save a core, and
also run gdb a few times.

However from what I see in backlog is useful: there's a line number!

That line is:
switch(c->state) {

... which might seem useless, and would be a lot better with some stepping
or printing of variables (really wishing I knew what c->state is..). GDB is
a skill you should pick up.

Anyway, between 1.4.17 and 1.4.18 one new c->state was added to the
routine, conn_closed. Under the debug binary that assert's. Under the
non-debug binary however it'll endlessly loop. So pretty high odds a
connection is ending up in conn_closed state within there.

Find me in IRC and we'll walk through a quick patch to confirm this. (or if
it happens again before you see this, try to find me).

thanks!

memc...@googlecode.com

unread,

Apr 30, 2014, 3:44:26 AM4/30/14

to memc...@googlegroups.com

Updates:
Status: Accepted

Comment #15 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

So it turned out that c->state was set to "conn_closed".

Looking at this a bit more closely:

conn_closed is only ever set from conn_close().

conn_close() is only called from two spots: once, before drive_machine() is
entered (and early returned from), and once from within drive_machine(),
directly above the conn_closed case and with a stop/break before it:

case conn_closing:
if (IS_UDP(c->transport))
conn_cleanup(c);
else
conn_close(c);
stop = true;
break;

case conn_closed:

... conn_close() always deletes the event from the stack, closes the
filehandle, etc.

So, I don't see how this could happen... yet...

None of the code changed between 1.4.17 or 1.4.18 seems to add new paths
which could cause a connection to re-fire.

If conn_close() is called there's no way for that to loop again (stop =
true).

What would have to happen is the event_handler firing again, on the closed
connection, which the fd for is closed, the event is deleted, but the
memory not reused just yet... It would then enter drive_machine with the
state already set to conn_closing, and never trip a stop = true, and not
assert since it's not a debug binary.

Which is fucking terrifying. If this happened in the old code it'd just
keep running into conn_closing and re-closing itself. though I was pretty
sure that calling event_del() twice causes a crash.

The other possibility is that a UDP socket is getting closed... except that
also deletes the event, and closes the socket, so no new events should
happen.

I did a quick test and added a second conn_close() call and.... it doesn't
cause a crash. it causes the curr_connections counter to slowly drift. That
is wild.

I just pushed:
https://github.com/memcached/memcached/commit/ee961e456457728ba78057961eca357edaea1ec1

...up to master.

I'm still a bit suspicious that I'm missing something important here... so
I'm not doing a release tonight, but I might do one early tomorrow anyway.

Reporter is currently running a version of this patch in production; if his
thing hangs up again, or doesn't self-recover once hitting the condition,
we'll have a better idea I guess.

memc...@googlecode.com

unread,

Apr 30, 2014, 3:54:44 AM4/30/14

to memc...@googlegroups.com

Comment #16 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Er clarifying some typos:

"It would then enter drive_machine() with the state already in conn_closed"

-> in the old code, conn_closing is the final state (not edited again
during a close() call), so if it refired it'd immediately drop back through
the closing bits and re-close itself.

we're probably just not using libevent right. I bet fixing this bug makes
maxconns_fast a bit more reliable... I've never seen someone with a
negative curr_connections counter though. So I'm still suspicious.

memc...@googlecode.com

unread,

Apr 30, 2014, 5:06:36 PM4/30/14

to memc...@googlegroups.com

Comment #17 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Been a bit under a day. Any output in your screen session yet?

memc...@googlecode.com

unread,

Apr 30, 2014, 8:08:27 PM4/30/14

to memc...@googlegroups.com

Comment #18 on issue 363 by i...@nodesocket.com: MemcachePool::get():

Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Nothing yet, no output in the screen process.

notifi...@commando.io

unread,

May 1, 2014, 12:39:45 AM5/1/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

I am in IRC now.

Seeing stderr logged over and over again in screen session: http://i.imgur.com/FrXRGZW.png
No timeouts with ./mc_conn_tester.pl though high CPU load in memcached process. See htop: http://i.imgur.com/qw93Jgv.png

memc...@googlecode.com

unread,

May 1, 2014, 6:06:06 PM5/1/14

to memc...@googlegroups.com

Comment #19 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Turned out the upstream packager somehow built .18 against libevent-1.4.13
despite eariler versions being against 2.x. It's probably just a broken
build.

Further fiddling with gdb showed that the event_del() routine had either
not happened or happened incorrectly, so it was still linked in the
libevent queue. The patch I put in prevented it from causing hangs but it
still spun cpu. I might leave that in just in case.

I'd been over the code a bunch of times and it seemed pretty impossible...
so that's probably why.

but it seems like it's not our bug. leaving this open for a few more days
just in case though.

memc...@googlecode.com

unread,

May 3, 2014, 1:20:26 AM5/3/14

to memc...@googlegroups.com

Updates:
Status: Invalid

Comment #20 on issue 363 by dorma...@rydia.net: MemcachePool::get(): Server

127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Any repeat crashes? I'm going to close this. it looks like remi
shipped .19. reopen or open a new one if it hangs in the same way somehow...

Well. 19 won't be printing anything, and it won't hang, but if it's
actually our bug and not libevent it would end up spinning CPU. Keep an eye
out I guess.

notifi...@commando.io

unread,

May 4, 2014, 2:55:31 AM5/4/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

Just upgraded all 5 web-servers to memcached 1.4.19 with libevent 2.0.18. Will advise if I see memcached timeouts. Should be good though.

Thanks so much for all the help and patience. Really appreciated.

notifi...@commando.io

unread,

May 4, 2014, 12:33:55 PM5/4/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

Damn it, got network timeout. CPU 3 is using 100% cpu from memcached.

Here is the result of stat to verify using new version of memcached and libevent:

STAT version 1.4.19
STAT libevent 2.0.18-stable

dormando

unread,

May 4, 2014, 1:12:08 PM5/4/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

I'm stumped. (also, your e-mails aren't updating the ticket...).

It's impossible for a connection to get into the closed state without
having event_del() and close() called on the socket. A socket slot isn't
event_add()'ed again until after the state is reset to 'init_state'.

There was no code path for event_del to actually fail so far as I could
see.

I've e-mailed steven grimm for ideas but either that's not his e-mail
anymore or he's not going to respond.

I really don't know. I guess the old code would've just called conn_close
again by accident... I don't see how the logic changed in any significant
way in .18. Though again, if it happened with any frequency people's
curr_conns stat would go negative.

So... either that always happened and we never noticed, or your particular
OS is corrupt. There're probably 10,000+ installs of .18+ now and only one
complaint, so I'm a little hesitant to spend a ton of time on this until
we get more reports.

You should downgrade to .17.

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

notifi...@commando.io

unread,

May 4, 2014, 6:39:06 PM5/4/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

I'm going to try switching threads from 4 to 1. This host web2 is on the only one I am seeing it on, but it also is the only hosts that gets any real traffic. Super frustrating.

dormando

unread,

May 6, 2014, 3:07:08 AM5/6/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

and how'd that work out?

Still no other reports :/ a few thousand more downloads of .19...

notifi...@commando.io

unread,

May 6, 2014, 5:11:45 PM5/6/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

Changing from 4 threads to 1 seems to have resolved the problem. No timeouts since. Should I set to 2 threads and wait and see how things go?

dormando

unread,

May 7, 2014, 8:19:13 PM5/7/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

Hey,

try this branch:
https://github.com/dormando/memcached/tree/double_close

so far as I can tell that emulates the behavior in .17...

to build:
./autogen.sh && ./configure && make

run it in screen like you were doing with the other tests, see if it
prints "ERROR: Double Close [somefd]". If it prints that once then stops,
I guess that's what .17 was doing... if it print spams, then something
else may have changed.

I'm mostly convinced something about your OS or build is corrupt, but I
have no idea what it is. The only other thing I can think of is to
instrument .17 a bit more and have you try that (with the connection code
laid out the old way, but with a conn_closed flag to detect a double close
attempt), and see if the old .17 still did it.

notifi...@commando.io

unread,

May 7, 2014, 8:34:19 PM5/7/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

Bumped up to 2 threads and so far no timeout errors. I'm going to let it run for a few more days, then revert back to 4 threads and see if timeout errors come up again. That will tell us the problem lies in spawning more than 2 threads.

dormando

unread,

May 7, 2014, 8:38:47 PM5/7/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

That doesn't really tell us anything about the nature of the problem
though. With 2 threads it might still happen, but is a lot less likely.

notifi...@commando.io

unread,

May 8, 2014, 5:15:21 PM5/8/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

I am just speculating, and by no means have any idea what I am really talking about here. :)

With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. Increasing from 2 threads to 4 does not generate any more traffic or requests to memcached. Thus I am speculating perhaps it is a race-condition or some sort, only hitting with > 2 threads.

Why do you say it will be less likely to happen with 2 threads than 4?

dormando

unread,

May 8, 2014, 6:18:46 PM5/8/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

> I am just speculating, and by no means have any idea what I am really talking about here. :)
> With 2 threads, still solid, no timeouts, no runaway 100% cpu. Its been days. Increasing from 2 threads to 4 does not generate any more traffic or
> requests to memcached. Thus I am speculating perhaps it is a race-condition or some sort, only hitting with > 2 threads.

Doesn't tell me anything useful, since I'm already looking for potential
races and don't see any possibility outside of libevent.

> Why do you say it will be less likely to happen with 2 threads than 4?

Nature of race conditions: the more threads you have running the more
likely you are to hit them, sometimes on order of magnitudes.

It doesn't really change the fact that this has worked for many years and
the code *barely* changed recently. I just don't see it.

dormando

unread,

May 8, 2014, 6:19:22 PM5/8/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

To that note, it *is* useful if you try that branch I posted, since so far
as I can tell that should emulate the .17 behavior.

dormando

unread,

May 9, 2014, 8:56:48 PM5/9/14

to memc...@googlegroups.com, codesite...@google.com, memc...@googlecode.com

Can you give me a list (privately, if need be) of a few things:

- The exact OS your server is running (centos/redhat release/etc)
- The exact kernel version (and where it came from? centos/rh proper or a
3rd party repo?)
- Full list of your 3rd party repos, since I know you had some random
french thing in there.
- Full list of packages installed from 3rd party repos.

It is extremely important that all of the software matches.

- Hardware details:
- Network card(s), speeds
- CPU type, number of cores (hyperthreading?)
- Amount of RAM

- Is this a hardware machine, or a VM somewhere? If a VM, what provider?

- memcached stats snapshots again, from your machine after it's been
running a while:
- "stats", "stats slabs", "stats items", "stats settings", "stats
conns".
^ That's five commands, don't forget any.

It's too difficult to try to debug the issue when you hit it. usually
when I'm at a gdb console I'm issuing a command every second or two, but
it takes us 10 minutes to get through 3-4 commands. It'd be nice if I
could attempt to reproduce it here.

I went digging more and there're some dup() bugs with epoll, except your
libevent is new enough to have those patched.. plus we're not using dup()
in such a way to cause the bug.

There was also an EPOLL_CTL_MOD race condition in the kernel, but so far
as I can tell even with libevent 2.x libevent's not using that feature for
us.

The issue does smell like the bug that happens with dup()'s - the events
keep happening and the fd sits half closed, but again we're never closing
those sockets.

I can also make a branch with the new dup() calls explicitly removed, but
this continues to be obnoxious multi-week-long debugging.

I'm convinced that the code in memcached is correct and the bug exists
outside of it (libevent or the kernel). There's simply no way for it to
hit that code path without closing the socket, and doubly so: epoll
automatically delete's an event when the socket is closed. We delete it
then close it, and it still comes back.

It's not possible a connection ends up in the wrong thread, since both
connection initialization and close happens local to a thread. We would
need to have a new connection come in with a duplicated fd. If that
happens, nothing on your machine would work.

thanks.

memc...@googlecode.com

unread,

May 21, 2014, 5:52:24 AM5/21/14

to memc...@googlegroups.com

Comment #21 on issue 363 by pavel.hl...@profimedia.com:

MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with:
Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Hi, I have exactly same problem described here. My error message is:

MemcachePool::set(): Server host.domain.net (tcp 11211, udp 0) failed with:
Network timeout (0)

I'm using latest PHP 5.5.12, memcache 3.0.8 from remi repo on Centos 6.5.
We have no problems with Memcached version 1.4.4 from repo base, If i want
to upgrade Memcached from remi repo I got this message since version 1.4.17
till 1.4.20. It is not related to network issues.

Regards,

Pavel

memc...@googlecode.com

unread,

May 21, 2014, 5:59:07 AM5/21/14

to memc...@googlegroups.com

Comment #22 on issue 363 by pavel.hl...@profimedia.cz: MemcachePool::get():

Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

Hi, I have exactly same problem described here. My error message is:

MemcachePool::set(): Server host.domain.net (tcp 11211, udp 0) failed with:
Network timeout (0)

I'm using latest PHP 5.5.12, memcache 3.0.8 from remi repo on Centos 6.5.
We have no problems with Memcached version 1.4.4 from repo base, If i want
to upgrade Memcached from remi repo I got this message since version 1.4.17
till 1.4.20. It is not related to network issues.

I have two versions of libevent on system:

libevent.x86_64 1.4.13-4.el6 @base
libevent-last.x86_64 2.0.21-4.el6.remi @remi

memc...@googlecode.com

unread,

May 21, 2014, 10:52:17 AM5/21/14

to memc...@googlegroups.com

Comment #23 on issue 363 by i...@nodesocket.com: MemcachePool::get():

Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout
http://code.google.com/p/memcached/issues/detail?id=363

This was a race condition fixed with:

https://github.com/memcached/memcached/commit/e73bc2e5c0794cccd6f8ece63bc16433c40ed766

Upgrade to 1.4.20. Works for us.

Reply all

Reply to author

Forward