One of our servers stopped responding, so I checked out passenger-
status and received no output. Ctrl+C yields this dump: http://pastie.org/574708
Also, from the apache error log:
The HTTP client closed the connection before the response could be
completely sent. As a result, you will probably see a 'Broken Pipe'
error in this log file. Please ignore it, this is normal.
[ pid=7097 file=ext/apache2/Hooks.cpp:638 time=2009-08-06
14:44:28.291 ]:
No data received from the backend application (process 5569) within
300000 msec. Either the backend application is frozen, or your TimeOut
value of 300 seconds is too low. Please check whether your application
is frozen, or increase the value of the TimeOut configuration
directive.
Here is some sample output from an strace on some of the stuck rails
processes:
select(64, [3 6 7 63], [], [], {32, 440000}
select(65, [3 6 7 64], [], [], {26, 212000}
Perhaps more useful to the skilled eye, is this full stack trace
generated when the ERESTARTNOHAND signal is received:
http://pastie.org/574737
Any ideas?
On Aug 6, 3:11 pm, findchris <findch...@gmail.com> wrote:
> One of our servers stopped responding, so I checked out passenger-
> status and received no output. Ctrl+C yields this dump: http://pastie.org/574708
> Also, from the apache error log:
> The HTTP client closed the connection before the response could be
> completely sent. As a result, you will probably see a 'Broken Pipe'
> error in this log file. Please ignore it, this is normal.
> [ pid=7097 file=ext/apache2/Hooks.cpp:638 time=2009-08-06
> 14:44:28.291 ]:
> No data received from the backend application (process 5569) within
> 300000 msec. Either the backend application is frozen, or your TimeOut
> value of 300 seconds is too low. Please check whether your application
> is frozen, or increase the value of the TimeOut configuration
> directive.
> On Aug 6, 3:11 pm, findchris <findch...@gmail.com> wrote:
> > One of our servers stopped responding, so I checked out passenger-
> > status and received no output. Ctrl+C yields this dump: http://pastie.org/574708
> > Also, from the apache error log:
> > The HTTP client closed the connection before the response could be
> > completely sent. As a result, you will probably see a 'Broken Pipe'
> > error in this log file. Please ignore it, this is normal.
> > [ pid=7097 file=ext/apache2/Hooks.cpp:638 time=2009-08-06
> > 14:44:28.291 ]:
> > No data received from the backend application (process 5569) within
> > 300000 msec. Either the backend application is frozen, or your TimeOut
> > value of 300 seconds is too low. Please check whether your application
> > is frozen, or increase the value of the TimeOut configuration
> > directive.
This sounds quite a lot like an issue we had 4 or 5 times within the
space of 24 hours a week or so ago on:
passenger 2.2.4
REE 1.8.6-20081215
However, I don't know whether one of these changes fixed our issue, or
if we've just been lucky.
We're also planning to upgrade to a recent REE due to the 'Fixed a
possible infinite looping bug in the garbage collector' fix in the
20090421 release.
On Aug 8, 10:12 pm, Tom Clarke <t...@u2i.com> wrote:
> However, I don't know whether one of these changes fixed our issue, or
> if we've just been lucky.
Looks like we had the issue occur this morning, so lucky (and now not
so much) it was.
> We're also planning to upgrade to a recent REE due to the 'Fixed a
> possible infinite looping bug in the garbage collector' fix in the
> 20090421 release.
On Aug 9, 12:22 pm, Tom Clarke <t...@u2i.com> wrote:
> On Aug 8, 10:12 pm, Tom Clarke <t...@u2i.com> wrote:
> > We're also planning to upgrade to a recent REE due to the 'Fixed a
> > possible infinite looping bug in the garbage collector' fix in the
> > 20090421 release.
What it shows is the start of the request 'START:', followed a couple
of hours later by the stack trace dumped when the locked process
restarted. Followed by the log message showing the completion of the
request.
No other passenger workers handled a request in the meantime, so it
seems this process was somehow causing a lock up across all the
processes. Seeing that the issue was in drb, I did a bit more googling
and found this:
http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-p...
So I can certainly see that sharing a socket across workers would be a
bad idea (and we'll definitely fix), but my question is, would having
a shared socket cause the 'all passenger processes to become
unresponsive' type problem?
On Wed, Aug 12, 2009 at 8:59 PM, Tom Clarke<t...@u2i.com> wrote: > This didn't help, we got the issue again this morning. However, I > might have figured it out.
> What it shows is the start of the request 'START:', followed a couple > of hours later by the stack trace dumped when the locked process > restarted. Followed by the log message showing the completion of the > request.
> No other passenger workers handled a request in the meantime, so it > seems this process was somehow causing a lock up across all the > processes. Seeing that the issue was in drb, I did a bit more googling > and found this: > http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-p...
> So I can certainly see that sharing a socket across workers would be a > bad idea (and we'll definitely fix), but my question is, would having > a shared socket cause the 'all passenger processes to become > unresponsive' type problem?
That depends on what the socket is doing. In case of a DRb socket, yes, that can definitely cause problems. You should reestablish the DRb connection in the worker process. There are a few examples in the Phusion Passenger users guide on how to do this.
-- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
> On Wed, Aug 12, 2009 at 8:59 PM, Tom Clarke<t...@u2i.com> wrote:
> > This didn't help, we got the issue again this morning. However, I
> > might have figured it out.
> > What it shows is the start of the request 'START:', followed a couple
> > of hours later by the stack trace dumped when the locked process
> > restarted. Followed by the log message showing the completion of the
> > request.
> > No other passenger workers handled a request in the meantime, so it
> > seems this process was somehow causing a lock up across all the
> > processes. Seeing that the issue was in drb, I did a bit more googling
> > and found this:
> >http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-p...
> > So I can certainly see that sharing a socket across workers would be a
> > bad idea (and we'll definitely fix), but my question is, would having
> > a shared socket cause the 'all passenger processes to become
> > unresponsive' type problem?
> That depends on what the socket is doing. In case of a DRb socket,
> yes, that can definitely cause problems. You should reestablish the
> DRb connection in the worker process. There are a few examples in the
> Phusion Passenger users guide on how to do this.
> --
> Phusion | The Computer Science Company
> Web:http://www.phusion.nl/ > E-mail: i...@phusion.nl
> Chamber of commerce no: 08173483 (The Netherlands)
On Aug 12, 3:21 pm, Hongli Lai <hon...@phusion.nl> wrote:
> That depends on what the socket is doing. In case of a DRb socket,
> yes, that can definitely cause problems. You should reestablish the
> DRb connection in the worker process. There are a few examples in the
> Phusion Passenger users guide on how to do this.
Yes, fixing it should be straightforward.
In the interests of fully understanding the issue though, the point
where it's stuck in drb is line 575.
564 def load(soc) # :nodoc:
565 begin
566 sz = soc.read(4) # sizeof (N)
567 rescue
568 raise(DRbConnError, $!.message, $!.backtrace)
569 end
570 raise(DRbConnError, 'connection closed') if sz.nil?
571 raise(DRbConnError, 'premature header') if sz.size < 4
572 sz = sz.unpack('N')[0]
573 raise(DRbConnError, "too large packet #{sz}") if
@load_limit < sz
574 begin
575 str = soc.read(sz)
576 rescue
577 raise(DRbConnError, $!.message, $!.backtrace)
578 end
So that makes sense that the worker would be stuck, some other process
has come along and read off the socket so this reader would be stuck
waiting to read from this socket. It's contained within a timeout
block (not SystemTimer), but that doesn't work with system calls
(another thing to fix).
What doesn't quite make sense to me, is why it should lock up all the
passenger processes. Is there some reason why passenger could end up
blocking indefinitely here across all processes?
On Aug 12, 2:59 pm, Tom Clarke <t...@u2i.com> wrote:
> No other passenger workers handled a request in the meantime, so it
> seems this process was somehow causing a lock up across all the
> processes. Seeing that the issue was in drb, I did a bit more googling
> and found this:http://pennysmalls.com/2009/03/02/using-acts-as-ferret-with-phusion-p...
I just started getting a 404 on the above link, so for future
reference it suggests closing and re-opening the connections in the
environment.rb as follows:
if defined?(PhusionPassenger)
# monkey patch drb so we can close its connections
class DRb::DRbConn
def self.close_all
@mutex.synchronize do
@pool.each {|c| c.close}
@pool = []
end
end
end
PhusionPassenger.on_event(:starting_worker_process) do |forked|
if forked
# We're in smart spawning mode.
CACHE.reset # memcached
DRb::DRbConn.close_all # ferret
else
# We're in conservative spawning mode. We don't need to do
anything.
end
end
end
On Aug 12, 3:51 pm, Tom Clarke <t...@u2i.com> wrote:
> What doesn't quite make sense to me, is why it should lock up all the
> passenger processes. Is there some reason why passenger could end up
> blocking indefinitely here across all processes?
Taking a look around the passenger code to see if there were any
opportunities for locking, I noticed this in:
lib/phusion_passenger/abstract_server.rb
Timeout::timeout(3) do
Process.waitpid(@pid) rescue nil
end
Having just been burnt by the timeout isn't always reliable with
syscalls issue, I'm curious if there's ever a scenario where this
might not actually timeout.
It's just a thought - i certainly haven't traced all the locking code
to tell whether even if this was a realistic scenario, whether it
could cause a global lock.
On Aug 12, 5:27 pm, Tom Clarke <t...@u2i.com> wrote:
> On Aug 12, 3:51 pm, Tom Clarke <t...@u2i.com> wrote:
> > What doesn't quite make sense to me, is why it should lock up all the
> > passenger processes. Is there some reason why passenger could end up
> > blocking indefinitely here across all processes?
Sorry to keep responding to myself, but I was rechecking the logs and
I realise I misread them (bad assumptions about their
chronologicalness).
In fact the other passenger processes were doing fine - just one dead
process due to me not using drb correctly. So, maybe the recent REE
and/or passenger helped there. Given that, I haven't yet had an issue
with passenger/REE with the latest versions which is good news.
Still free of this issue? We just had to apache server instances stop
responding over the weekend.
We aren't using DRb, so I am unsure of the culprit.
Here is some sample output for what it's worth:
[ pid=16267 file=ext/apache2/Hooks.cpp:638 time=2009-09-06
21:12:45.877 ]:
No data received from the backend application (process 31830) within
300000 msec. Either the backend application is frozen, or your TimeOut
value of 300 seconds is too low. Please check whether your application
is frozen, or increase the value of the TimeOut configuration
directive.
Hey Hongli, any chance to sponsor a fix for this issue?
Thanks,
Chris
On Aug 12, 3:49 pm, Tom Clarke <t...@u2i.com> wrote:
> On Aug 12, 5:27 pm, Tom Clarke <t...@u2i.com> wrote:
> > On Aug 12, 3:51 pm, Tom Clarke <t...@u2i.com> wrote:
> > > What doesn't quite make sense to me, is why it should lock up all the
> > > passenger processes. Is there some reason why passenger could end up
> > > blocking indefinitely here across all processes?
> Sorry to keep responding to myself, but I was rechecking the logs and
> I realise I misread them (bad assumptions about their
> chronologicalness).
> In fact the other passenger processes were doing fine - just one dead
> process due to me not using drb correctly. So, maybe the recent REE
> and/or passenger helped there. Given that, I haven't yet had an issue
> with passenger/REE with the latest versions which is good news.
I had the same exact problem with Passenger and didn't get any
conclusive answer.
But I've upgraded to Passenger 2.2.5 and haven't had the problem since
(fingers crossed)
This was what made me think there might be a bug in 2.2.4 that might
have been causing the problem
---------------------
[Apache] Fixed I/O timeouts for communication with backend processes
Got rid of the code for enforcing I/O timeouts when reading from or
writing to a backend process. This caused more problems than it
solved.
------------------------
What version of passenger are you running?
On Sep 8, 3:49 pm, findchris <findch...@gmail.com> wrote:
> Still free of this issue? We just had to apache server instances stop
> responding over the weekend.
> We aren't using DRb, so I am unsure of the culprit.
> Here is some sample output for what it's worth:
> [ pid=16267 file=ext/apache2/Hooks.cpp:638 time=2009-09-06
> 21:12:45.877 ]:
> No data received from the backend application (process 31830) within
> 300000 msec. Either the backend application is frozen, or your TimeOut
> value of 300 seconds is too low. Please check whether your application
> is frozen, or increase the value of the TimeOut configuration
> directive.
> Hey Hongli, any chance to sponsor a fix for this issue?
> Thanks,
> Chris
> On Aug 12, 3:49 pm, Tom Clarke <t...@u2i.com> wrote:
> > On Aug 12, 5:27 pm, Tom Clarke <t...@u2i.com> wrote:
> > > On Aug 12, 3:51 pm, Tom Clarke <t...@u2i.com> wrote:
> > > > What doesn't quite make sense to me, is why it should lock up all the
> > > > passenger processes. Is there some reason why passenger could end up
> > > > blocking indefinitely here across all processes?
> > Sorry to keep responding to myself, but I was rechecking the logs and
> > I realise I misread them (bad assumptions about their
> > chronologicalness).
> > In fact the other passenger processes were doing fine - just one dead
> > process due to me not using drb correctly. So, maybe the recent REE
> > and/or passenger helped there. Given that, I haven't yet had an issue
> > with passenger/REE with the latest versions which is good news.
On Tue, Sep 8, 2009 at 10:49 PM, findchris<findch...@gmail.com> wrote: > Hey Hongli, any chance to sponsor a fix for this issue?
Hey Chris.
Well I'm not sure whether this is really a bug in Phusion Passenger. It might also be a problem in your application. In any case, solving the problem requires investigation. If you're interested in commercial support, please email i...@phusion.nl and we can discuss the possibilities.
With kind regards, Hongli Lai -- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
> On Tue, Sep 8, 2009 at 10:49 PM, findchris<findch...@gmail.com> wrote:
> > Hey Hongli, any chance to sponsor a fix for this issue?
> Hey Chris.
> Well I'm not sure whether this is really a bug in Phusion Passenger.
> It might also be a problem in your application. In any case, solving
> the problem requires investigation. If you're interested in commercial
> support, please email i...@phusion.nl and we can discuss the
> possibilities.
> With kind regards,
> Hongli Lai
> --
> Phusion | The Computer Science Company
> Web:http://www.phusion.nl/ > E-mail: i...@phusion.nl
> Chamber of commerce no: 08173483 (The Netherlands)