passenger-status not responding

95 views
Skip to first unread message

findchris

unread,
Aug 6, 2009, 6:11:36 PM8/6/09
to Phusion Passenger Discussions
One of our servers stopped responding, so I checked out passenger-
status and received no output. Ctrl+C yields this dump: http://pastie.org/574708

Also, from the apache error log:
>>>
The HTTP client closed the connection before the response could be
completely sent. As a result, you will probably see a 'Broken Pipe'
error in this log file. Please ignore it, this is normal.
[ pid=7097 file=ext/apache2/Hooks.cpp:638 time=2009-08-06
14:44:28.291 ]:
No data received from the backend application (process 5569) within
300000 msec. Either the backend application is frozen, or your TimeOut
value of 300 seconds is too low. Please check whether your application
is frozen, or increase the value of the TimeOut configuration
directive.
>>>

Any tips for isolating the cause of this?

Thanks.
-Chris

findchris

unread,
Aug 6, 2009, 6:32:43 PM8/6/09
to Phusion Passenger Discussions
Here is some sample output from an strace on some of the stuck rails
processes:
select(64, [3 6 7 63], [], [], {32, 440000}
select(65, [3 6 7 64], [], [], {26, 212000}

Perhaps more useful to the skilled eye, is this full stack trace
generated when the ERESTARTNOHAND signal is received:
http://pastie.org/574737

Any ideas?

Tom Clarke

unread,
Aug 8, 2009, 10:12:46 PM8/8/09
to Phusion Passenger Discussions
> On Aug 6, 3:11 pm, findchris <findch...@gmail.com> wrote:
> > One of our servers stopped responding, so I checked out passenger-
> > status and received no output.  Ctrl+C yields this dump:  http://pastie.org/574708
>
> > Also, from the apache error log:
>
> > The HTTP client closed the connection before the response could be
> > completely sent. As a result, you will probably see a 'Broken Pipe'
> > error in this log file. Please ignore it, this is normal.
> > [ pid=7097 file=ext/apache2/Hooks.cpp:638 time=2009-08-06
> > 14:44:28.291 ]:
> >   No data received from the backend application (process 5569) within
> > 300000 msec. Either the backend application is frozen, or your TimeOut
> > value of 300 seconds is too low. Please check whether your application
> > is frozen, or increase the value of the TimeOut configuration
> > directive.

This sounds quite a lot like an issue we had 4 or 5 times within the
space of 24 hours a week or so ago on:
passenger 2.2.4
REE 1.8.6-20081215

We upgraded passenger to the current tip of master, and the problem
hasn't occurred since. We were inspired to do so by this comment:
http://stackoverflow.com/questions/1082166/exception-errnoepipe-in-passenger-requesthandler-broken-pipe

However, I don't know whether one of these changes fixed our issue, or
if we've just been lucky.

We're also planning to upgrade to a recent REE due to the 'Fixed a
possible infinite looping bug in the garbage collector' fix in the
20090421 release.

-Tom

Tom Clarke

unread,
Aug 9, 2009, 12:22:33 PM8/9/09
to Phusion Passenger Discussions
On Aug 8, 10:12 pm, Tom Clarke <t...@u2i.com> wrote:
> However, I don't know whether one of these changes fixed our issue, or
> if we've just been lucky.

Looks like we had the issue occur this morning, so lucky (and now not
so much) it was.

> We're also planning to upgrade to a recent REE due to the 'Fixed a
> possible infinite looping bug in the garbage collector' fix in the
> 20090421 release.

So this is next.

-Tom

Tom Clarke

unread,
Aug 9, 2009, 12:45:37 PM8/9/09
to Phusion Passenger Discussions
Looking through the archives, it looks like there might be an issue
with RMagick causing a problem possibly only with older versions of
REE, per the following thread:
http://groups.google.com/group/phusion-passenger/browse_thread/thread/85a51677c89db7e8#

We've seen the problem occur almost exclusively on the server (out of
4) that does most of the RMagick related activity.

We'll do the REE upgrade this week. I'll report back as to whether
that helps.

-Tom

Tom Clarke

unread,
Aug 12, 2009, 2:59:47 PM8/12/09
to Phusion Passenger Discussions
On Aug 9, 12:45 pm, Tom Clarke <t...@u2i.com> wrote:
> We'll do the REE upgrade this week. I'll report back as to whether
> that helps.

This didn't help, we got the issue again this morning. However, I
might have figured it out.

Here's a log from the affected server.
http://pastie.org/581681

What it shows is the start of the request 'START:', followed a couple
of hours later by the stack trace dumped when the locked process
restarted. Followed by the log message showing the completion of the
request.

No other passenger workers handled a request in the meantime, so it
seems this process was somehow causing a lock up across all the
processes. Seeing that the issue was in drb, I did a bit more googling
and found this:
http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-passenger-mod_rails/

So I can certainly see that sharing a socket across workers would be a
bad idea (and we'll definitely fix), but my question is, would having
a shared socket cause the 'all passenger processes to become
unresponsive' type problem?

-Tom

Hongli Lai

unread,
Aug 12, 2009, 3:21:24 PM8/12/09
to phusion-...@googlegroups.com
On Wed, Aug 12, 2009 at 8:59 PM, Tom Clarke<t...@u2i.com> wrote:
> This didn't help, we got the issue again this morning. However, I
> might have figured it out.
>
> Here's a log from the affected server.
> http://pastie.org/581681
>
> What it shows is the start of the request 'START:', followed a couple
> of hours later by the stack trace dumped when the locked process
> restarted. Followed by the log message showing the completion of the
> request.
>
> No other passenger workers handled a request in the meantime, so it
> seems this process was somehow causing a lock up across all the
> processes. Seeing that the issue was in drb, I did a bit more googling
> and found this:
> http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-passenger-mod_rails/
>
> So I can certainly see that sharing a socket across workers would be a
> bad idea (and we'll definitely fix), but my question is, would having
> a shared socket cause the 'all passenger processes to become
> unresponsive' type problem?

That depends on what the socket is doing. In case of a DRb socket,
yes, that can definitely cause problems. You should reestablish the
DRb connection in the worker process. There are a few examples in the
Phusion Passenger users guide on how to do this.

--
Phusion | The Computer Science Company

Web: http://www.phusion.nl/
E-mail: in...@phusion.nl
Chamber of commerce no: 08173483 (The Netherlands)

findchris

unread,
Aug 12, 2009, 3:47:20 PM8/12/09
to Phusion Passenger Discussions
Thanks for the feedback.

I am not using DRb, nor do I have other daemons or processes running
on this machine that would be interacting with REE.

Does the output given in my earlier post yield any clues?

Thanks.

On Aug 12, 12:21 pm, Hongli Lai <hon...@phusion.nl> wrote:
> On Wed, Aug 12, 2009 at 8:59 PM, Tom Clarke<t...@u2i.com> wrote:
> > This didn't help, we got the issue again this morning. However, I
> > might have figured it out.
>
> > Here's a log from the affected server.
> >http://pastie.org/581681
>
> > What it shows is the start of the request 'START:', followed a couple
> > of hours later by the stack trace dumped when the locked process
> > restarted. Followed by the log message showing the completion of the
> > request.
>
> > No other passenger workers handled a request in the meantime, so it
> > seems this process was somehow causing a lock up across all the
> > processes. Seeing that the issue was in drb, I did a bit more googling
> > and found this:
> >http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-p...
>
> > So I can certainly see that sharing a socket across workers would be a
> > bad idea (and we'll definitely fix), but my question is, would having
> > a shared socket cause the 'all passenger processes to become
> > unresponsive' type problem?
>
> That depends on what the socket is doing. In case of a DRb socket,
> yes, that can definitely cause problems. You should reestablish the
> DRb connection in the worker process. There are a few examples in the
> Phusion Passenger users guide on how to do this.
>
> --
> Phusion | The Computer Science Company
>
> Web:http://www.phusion.nl/
> E-mail: i...@phusion.nl

Tom Clarke

unread,
Aug 12, 2009, 3:51:02 PM8/12/09
to Phusion Passenger Discussions
On Aug 12, 3:21 pm, Hongli Lai <hon...@phusion.nl> wrote:
> That depends on what the socket is doing. In case of a DRb socket,
> yes, that can definitely cause problems. You should reestablish the
> DRb connection in the worker process. There are a few examples in the
> Phusion Passenger users guide on how to do this.

Yes, fixing it should be straightforward.

In the interests of fully understanding the issue though, the point
where it's stuck in drb is line 575.
564 def load(soc) # :nodoc:
565 begin
566 sz = soc.read(4) # sizeof (N)
567 rescue
568 raise(DRbConnError, $!.message, $!.backtrace)
569 end
570 raise(DRbConnError, 'connection closed') if sz.nil?
571 raise(DRbConnError, 'premature header') if sz.size < 4
572 sz = sz.unpack('N')[0]
573 raise(DRbConnError, "too large packet #{sz}") if
@load_limit < sz
574 begin
575 str = soc.read(sz)
576 rescue
577 raise(DRbConnError, $!.message, $!.backtrace)
578 end

So that makes sense that the worker would be stuck, some other process
has come along and read off the socket so this reader would be stuck
waiting to read from this socket. It's contained within a timeout
block (not SystemTimer), but that doesn't work with system calls
(another thing to fix).

What doesn't quite make sense to me, is why it should lock up all the
passenger processes. Is there some reason why passenger could end up
blocking indefinitely here across all processes?

-Tom

Tom Clarke

unread,
Aug 12, 2009, 4:07:40 PM8/12/09
to Phusion Passenger Discussions
On Aug 12, 2:59 pm, Tom Clarke <t...@u2i.com> wrote:
> No other passenger workers handled a request in the meantime, so it
> seems this process was somehow causing a lock up across all the
> processes. Seeing that the issue was in drb, I did a bit more googling
> and found this:http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-p...

I just started getting a 404 on the above link, so for future
reference it suggests closing and re-opening the connections in the
environment.rb as follows:

if defined?(PhusionPassenger)
# monkey patch drb so we can close its connections
class DRb::DRbConn
def self.close_all
@mutex.synchronize do
@pool.each {|c| c.close}
@pool = []
end
end
end

PhusionPassenger.on_event(:starting_worker_process) do |forked|
if forked
# We're in smart spawning mode.
CACHE.reset # memcached
DRb::DRbConn.close_all # ferret
else
# We're in conservative spawning mode. We don't need to do
anything.
end
end
end

-Tom

Tom Clarke

unread,
Aug 12, 2009, 5:27:40 PM8/12/09
to Phusion Passenger Discussions
On Aug 12, 3:51 pm, Tom Clarke <t...@u2i.com> wrote:
> What doesn't quite make sense to me, is why it should lock up all the
> passenger processes. Is there some reason why passenger could end up
> blocking indefinitely here across all processes?

Taking a look around the passenger code to see if there were any
opportunities for locking, I noticed this in:
lib/phusion_passenger/abstract_server.rb

Timeout::timeout(3) do
Process.waitpid(@pid) rescue nil
end

Having just been burnt by the timeout isn't always reliable with
syscalls issue, I'm curious if there's ever a scenario where this
might not actually timeout.

It's just a thought - i certainly haven't traced all the locking code
to tell whether even if this was a realistic scenario, whether it
could cause a global lock.

-Tom

Tom Clarke

unread,
Aug 12, 2009, 6:49:38 PM8/12/09
to Phusion Passenger Discussions
On Aug 12, 5:27 pm, Tom Clarke <t...@u2i.com> wrote:
> On Aug 12, 3:51 pm, Tom Clarke <t...@u2i.com> wrote:
>
> > What doesn't quite make sense to me, is why it should lock up all the
> > passenger processes. Is there some reason why passenger could end up
> > blocking indefinitely here across all processes?

Sorry to keep responding to myself, but I was rechecking the logs and
I realise I misread them (bad assumptions about their
chronologicalness).

In fact the other passenger processes were doing fine - just one dead
process due to me not using drb correctly. So, maybe the recent REE
and/or passenger helped there. Given that, I haven't yet had an issue
with passenger/REE with the latest versions which is good news.

Thanks,

-Tom

findchris

unread,
Sep 8, 2009, 4:49:12 PM9/8/09
to Phusion Passenger Discussions
Hey Tom.

Still free of this issue? We just had to apache server instances stop
responding over the weekend.

We aren't using DRb, so I am unsure of the culprit.

Here is some sample output for what it's worth:
>>
[ pid=16267 file=ext/apache2/Hooks.cpp:638 time=2009-09-06
21:12:45.877 ]:
No data received from the backend application (process 31830) within
300000 msec. Either the backend application is frozen, or your TimeOut
value of 300 seconds is too low. Please check whether your application
is frozen, or increase the value of the TimeOut configuration
directive.
>>

Hey Hongli, any chance to sponsor a fix for this issue?

Thanks,
Chris

trustfundbaby

unread,
Sep 9, 2009, 4:23:55 AM9/9/09
to Phusion Passenger Discussions
I had the same exact problem with Passenger and didn't get any
conclusive answer.
But I've upgraded to Passenger 2.2.5 and haven't had the problem since
(fingers crossed)
This was what made me think there might be a bug in 2.2.4 that might
have been causing the problem
---------------------
[Apache] Fixed I/O timeouts for communication with backend processes
Got rid of the code for enforcing I/O timeouts when reading from or
writing to a backend process. This caused more problems than it
solved.
------------------------
What version of passenger are you running?

Hongli Lai

unread,
Sep 9, 2009, 7:17:31 AM9/9/09
to phusion-...@googlegroups.com
On Tue, Sep 8, 2009 at 10:49 PM, findchris<find...@gmail.com> wrote:
> Hey Hongli, any chance to sponsor a fix for this issue?

Hey Chris.

Well I'm not sure whether this is really a bug in Phusion Passenger.
It might also be a problem in your application. In any case, solving
the problem requires investigation. If you're interested in commercial
support, please email in...@phusion.nl and we can discuss the
possibilities.

With kind regards,
Hongli Lai

findchris

unread,
Sep 9, 2009, 7:36:18 PM9/9/09
to Phusion Passenger Discussions
Thanks Hongli. My coworker might be contacting you soon.
-Chris

On Sep 9, 4:17 am, Hongli Lai <hon...@phusion.nl> wrote:
> On Tue, Sep 8, 2009 at 10:49 PM, findchris<findch...@gmail.com> wrote:
> > Hey Hongli, any chance to sponsor a fix for this issue?
>
> Hey Chris.
>
> Well I'm not sure whether this is really a bug in Phusion Passenger.
> It might also be a problem in your application. In any case, solving
> the problem requires investigation. If you're interested in commercial
> support, please email i...@phusion.nl and we can discuss the
> possibilities.
>
> With kind regards,
> Hongli Lai
> --
> Phusion | The Computer Science Company
>
> Web:http://www.phusion.nl/
> E-mail: i...@phusion.nl
Reply all
Reply to author
Forward
0 new messages