Update,
It seems that I've succeeded to narrowing the circumstances of this strange behavior. The chain of events is as follows:
1. a php worker receives a request from nginx and takes some time to work on it. During this time, nginx reaches the fastcgi timeout and tries to close the connection.
2. the php worker doesn't receive or ignores the RST packet from nginx and continues
3. when done, it starts writing to the socket, fills the buffer and the socket becomes CLOSE_WAIT
4. at this point, the worker is stuck in
# cat /proc/16983/stack
[<ffffffff8140b756>] sk_stream_wait_memory+0x186/0x270
[<ffffffff8144f585>] tcp_sendmsg+0x705/0xa30
[<ffffffff81400ef1>] sock_aio_write+0x151/0x160
[<ffffffff8116d05a>] do_sync_write+0xfa/0x140
[<ffffffff8116d424>] vfs_write+0x184/0x1a0
[<ffffffff8116dd91>] sys_write+0x51/0x90
[<ffffffff81013172>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
# strace -f -p 16983
Process 16983 attached - interrupt to quit
tcp 9 14608 php:9019 nginx:36970 CLOSE_WAIT 16983/php-fpm
and the kernel is backloging the connections.
5. now, if I kill the process, the situation recovers immediately, because the freshly spawned process picks up a backloged connection that is already abandoned by nginx, and again, the socket is in CLOSE_WAIT.
6. The only way to make it work again is to restart the master, so the backloged connections are dropped.
There are two problems that I don't know who to blame. Firstly, I don't know why backloged connections remain in established state and are not dropped. Secondly, I don't know why the RST packets from nginx never arrive to the php.
The latter problem is obviously seem to be the network problem. I think I could mitigate it at this point by setting up a point-to-point tunnel between the hosts, so no switch or firewall interferes with the traffic. Most probably I've hit some "intelligent" configuration in the Amazon's EC2 network. It's been 3 days since I've started tunneling the traffic and the problem doesn't represent itself anymore. I'll give it a week and will try to ask clarifications from Amazon. I don't expect much to come out of this conversation.
As for the first problem, I am not sure what's wrong here. I have a gut feeling that something is wrong with it. Any comments anyone?
Thanks!