Re: Pools randomly hang

105 views
Skip to first unread message

Jérôme Loyet

unread,
May 23, 2013, 2:33:26 AM5/23/13
to highloa...@googlegroups.com
Hi,

is it possible for you to updated to the last php 5.4 version to see if the problem still occurs ?

++ Jerome


2013/5/21 Michael Tabolsky <mtab...@gmail.com>
Hi List,

I really hope someone can help to debug this problem since I am trying to run out of options here ...

I have a setup of two nodes, one running nginx, the other php-fpm. about 30 hosts/pools. All was good for a few months, but since some upgrade and I just can't track it back which one, I've started to get into a problem with random pools at random intervals, children that stuck like this:
[pid  7939] write(3, "122\"><a href=\"http://www.b"..., 456 <unfinished ...>
[pid  7672] write(3, "122\"><a href=\"http://www.b"..., 456^C <unfinished ...>

The fd is the connection to the nginx (tcp, naturally), which is already dropped by nginx because of timeout. There are no errors or warnings in the debug log. As soon as the pool hits the children limit, master starts to refuse connections from nginx. Just before the stuck writing of response starts, php processes don't do anything suspicious, just normally mmapping the files without any errors. If I kill these children, the master spawns the new ones as it should and they get stuck immediately in the same way. This doesn't affect other pools running under different or the same UIDs, they are still going. The only way to "recover" the "broken" pool is to restart the master.

the php (5.3.23) is running on centos 6.4 x86_64 with memcache for sessions  and no accelerators.

I also cannot correlate the problem to any external factor, like high loads or network outages.

Any guess please?

Thanks a lot in advance!

--
 
---
You received this message because you are subscribed to the Google Groups "highload-php-en" group.
To unsubscribe from this group and stop receiving emails from it, send an email to highload-php-...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Michael Tabolsky

unread,
May 27, 2013, 1:42:18 PM5/27/13
to highloa...@googlegroups.com
Hi,

thanks for the reply.
After reading all the php-fpm related bugs ... out of desperation tried to upgrade just now. 
5.4 15 changes the symptoms somewhat but generally doesn't solve the problem. 
I have a pool, only one single pool (and there are other 5 pools under the same UID) that doesn't stop stalling.
with 5.4.15, it serves some requests and then stops in :
[pid 12104] recvfrom(3,  <unfinished ...>
[pid 12097] recvfrom(3, 

where fds 3 are this:
tcp        0      0 10.x:9003          10.x:52258         FIN_WAIT2   12104/php-fpm       
tcp        0      0 10.x:9003          10.x:52059         FIN_WAIT2   12097/php-fpm

the new connections from nginx are getting backloged:
tcp     1704      0 10.x:9003          10.x:52311         ESTABLISHED -                   

and nothing works :(

i don't know what to do with it, really ...

Michael Tabolsky

unread,
May 27, 2013, 2:07:41 PM5/27/13
to highloa...@googlegroups.com
oh, there is also another difference with 5.4.15
if I kill the stalled processes, the newly spawned serve the requests for some time. with 5.3 they where stuck on the first request.

Michael Tabolsky

unread,
Jun 8, 2013, 3:32:12 AM6/8/13
to highloa...@googlegroups.com
Update,

It seems that I've succeeded to narrowing the circumstances of this strange behavior. The chain of events is as follows:
1. a php worker receives a request from nginx and takes some time to work on it. During this time, nginx reaches the fastcgi timeout and tries to close the connection.
2. the php worker doesn't receive or ignores the RST packet from nginx and continues
3. when done, it starts writing to the socket, fills the buffer and the socket becomes CLOSE_WAIT
4. at this point, the worker is stuck in 
# cat /proc/16983/stack 
[<ffffffff8140b756>] sk_stream_wait_memory+0x186/0x270
[<ffffffff8144f585>] tcp_sendmsg+0x705/0xa30
[<ffffffff81400ef1>] sock_aio_write+0x151/0x160
[<ffffffff8116d05a>] do_sync_write+0xfa/0x140
[<ffffffff8116d424>] vfs_write+0x184/0x1a0
[<ffffffff8116dd91>] sys_write+0x51/0x90
[<ffffffff81013172>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
# strace  -f -p  16983
Process 16983 attached - interrupt to quit
write(3, "ef=\"http://www.\" >"..., 51048

tcp        9  14608 php:9019          nginx:36970         CLOSE_WAIT  16983/php-fpm       

and the kernel is backloging the connections.
5. now, if I kill the process, the situation recovers immediately,  because the freshly spawned process picks up a backloged connection that is already abandoned by nginx, and again, the socket is in CLOSE_WAIT.
6. The only way to make it work again is to restart the master, so the backloged connections are dropped.

There are two problems that I don't know who to blame. Firstly, I don't know why backloged connections remain in established state and are not dropped. Secondly, I don't know why the RST packets from nginx never arrive to the php.

The latter problem is obviously seem to be the network problem. I think I could mitigate it at this point by setting up a point-to-point tunnel between the hosts, so no switch or firewall interferes with the traffic. Most probably I've hit some "intelligent" configuration in the Amazon's EC2 network. It's been 3 days since I've started tunneling the traffic and the problem doesn't represent itself anymore. I'll give it a week and will try to ask clarifications from Amazon. I don't expect much to come out of this conversation.

As for the first problem, I am not sure what's wrong here. I have a gut feeling that something is wrong with it. Any comments anyone?

Thanks!  

tech manjoy

unread,
Dec 25, 2013, 7:17:59 PM12/25/13
to highloa...@googlegroups.com
Was creating the tunnel a full resolution for this? Or did you find another solution, or is it still continuing?

Michael Tabolsky

unread,
Dec 27, 2013, 3:35:35 PM12/27/13
to highloa...@googlegroups.com
a few months ago I had to restart and since then I didn't bring the tunnel up. but it still happens like once or twice in a month.


--
 
---
You received this message because you are subscribed to a topic in the Google Groups "highload-php-en" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/highload-php-en/O4Uo5iqQURQ/unsubscribe.

To unsubscribe from this group and all its topics, send an email to highload-php-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages