I have just deployed 7 rails applications using switchpipe onto a
Solaris 11 machine.
I now have a major problem where switchpipe just stops responding to
requests and i have to give it a "./script/switchpipe restart" to get
it going again. It seems to happen after a random amount of time,
1hr~3hours. It has never lasted more than 3 hours.
Has anybody come across this before?
Any help is greatly appreciated as I really want to use switchpipe.
On Mar 27, 5:47 pm, ChrisR <EvilGeen...@gmail.com> wrote:
> I have just deployed 7 rails applications using switchpipe onto a
> Solaris 11 machine.
> I now have a major problem where switchpipe just stops responding to
> requests and i have to give it a "./script/switchpipe restart" to get
> it going again. It seems to happen after a random amount of time,
> 1hr~3hours. It has never lasted more than 3 hours.
> Has anybody come across this before?
This was a common issue prior to 1.04. I imagine you are running 1.04
or higher (trunk) though, which in my own tests has been running for
over a month now uninterrupted on one of my machine. Someone else has
reported the same issue with the trunk version, however, and it began
after they upgraded Apache. I'm still waiting for more information
from them so I can replicate this issue, but if you can provide more
info, it might help triangulate the issue :) Essentially.. I need to
know what's in front of your SwitchPipe? Apache? Nginx?
Once I can replicate the problem, it'll probably be a pretty easy fix.
The pre 1.04 issue was related to some UNIX quirks I wasn't taking
into account, so that could also be the issue here since I have not
extensively tested on anything else than BSD or Linux so far.
I'm having a similar problem, but anywhere from a few hours to over 24. The last time I restarted I think it lasted about 8-10 hours (just had to restart then).
This is happening on a Gentoo box (2.6.18-xen kernel), Ruby 1.8.6-111. I did notice that daemons gem I'm running is 1.0.9 vs. 1.0.10 which is latest. All other gems that the site asks to install are at their latest version.
Apache was a minor upgrade - it's now 2.2.8, and was formerly working well on 2.2.7(-something). I'm not certain that this is the cause, but it's one of the only things to have changed that I can think of.
I've just gone through and put in a bunch of logging requests throughout the code in the hope that one of them will lead me to where it's stalling. I'll keep you posted if I find anything interesting.
Best regards, Jason
On 28/03/2008, Peter Cooper <pcoo...@gmail.com> wrote:
> On Mar 27, 5:47 pm, ChrisR <EvilGeen...@gmail.com> wrote: > > I have just deployed 7 rails applications using switchpipe onto a > > Solaris 11 machine.
> > I now have a major problem where switchpipe just stops responding to > > requests and i have to give it a "./script/switchpipe restart" to get > > it going again. It seems to happen after a random amount of time, > > 1hr~3hours. It has never lasted more than 3 hours.
> > Has anybody come across this before?
> This was a common issue prior to 1.04. I imagine you are running 1.04 > or higher (trunk) though, which in my own tests has been running for > over a month now uninterrupted on one of my machine. Someone else has > reported the same issue with the trunk version, however, and it began > after they upgraded Apache. I'm still waiting for more information > from them so I can replicate this issue, but if you can provide more > info, it might help triangulate the issue :) Essentially.. I need to > know what's in front of your SwitchPipe? Apache? Nginx?
> Once I can replicate the problem, it'll probably be a pretty easy fix. > The pre 1.04 issue was related to some UNIX quirks I wasn't taking > into account, so that could also be the issue here since I have not > extensively tested on anything else than BSD or Linux so far.
The Delegate proxy (http://www.delegate.org) is the only thing in
front of switchpipe. When it crashes, I can't even access it directly
(without going through the proxy).
Maybe you could give me a modified version of switchpipe.rb that
created a log file with more information, then I could give you back
the log file today after it hangs.
On Mar 28, 4:32 pm, ChrisR <EvilGeen...@gmail.com> wrote:
> I'm sure its probably nothing but it says :
> VERSION = "1.03"
> at the top of the switchpipe.rb in the trunk and in v1.04
Thanks for the extra updates. I'm travelling at Ruby conferences at
the moment but when I get back I'll rig up SwitchPipe with Apache 2 in
front of it (I have only used it with Apache 1.3) and see if I can
recreate some of these issues.
Brief status report on my logging exercises. Last failure took about 48 hours to come about, which made it a bit slow to work out where it's dying.
Data was still coming in, but a backend instance was never getting launched. I've since added more logging in between the request coming in, and when it's processed in the hope of working out exactly where it's aborting.
Chris contacted me off list, and I've sent him a diff with the additional logger calls. If his SwitchPipe is dying as often as he says, hopefully it will be able to give us a really good idea of where it's dying without the 48 hour-or-so wait that I'm battling against.
I've attached the diff just so as that you can stay clued in to where I'm looking. Nothing fancy (just calls to the logger).
Enjoy the conference!
Best regards, Jason
On 31/03/2008, Peter Cooper <pcoo...@gmail.com> wrote:
> On Mar 28, 4:32 pm, ChrisR <EvilGeen...@gmail.com> wrote: > > I'm sure its probably nothing but it says : > > VERSION = "1.03"
> > at the top of the switchpipe.rb in the trunk and in v1.04
> Thanks for the extra updates. I'm travelling at Ruby conferences at > the moment but when I get back I'll rig up SwitchPipe with Apache 2 in > front of it (I have only used it with Apache 1.3) and see if I can > recreate some of these issues.
Sorry about the delay - I've been waiting for things to crash again.
I'm getting exactly the same thing as you. It's like EventMachine.defer() is swallowing the request, never to be seen again.
I've hacked the EventMachine.defer code to give me some logs of what's actually happening. EventMachine starts 20 threads that sit around and watch a Queue that has all the incoming requests. I'm wondering if something's happening that's making these 20 threads slowly die, or block or something like that. The logging I've added just tells me which thread is handling each request, how many items are waiting, etc.
Hopefully I'll have something to share within a few days.
In the mean time, it's a dirty dirty hack, but have you thought about restarting switchpipe every hour or so before it dies via cron? That's what I'm currently looking at as the alternative if we can't track down the bug. Nasty "solution", I know.
All the best, Jason
On 01/04/2008, ChrisR <EvilGeen...@gmail.com> wrote:
I believe I've found the problem, and created a fix. Diff against r40 attached.
I finally managed to reproduce the error once I could see what was happening with the EventMachine threads.
It seems that "GET / HTTP/1.0" requests were being issued to the SwitchPipe process, presumably sent by mod_proxy trying to check that the back-end was still alive. I remember seeing log history of this before, but I'm unable to find the exact configuration as of yet.
When these requests come in, SwitchPipe tries to work out which application to launch - first based upon the host-name of the request (which my particular configuration doesn't define), and then by the path of the request. The code that works out the path (app_from_path) in the case of the request for "/" will return false. This then causes an exception to be raised in App.find_by_path as it tries to call to_sym() on false. This causes the Thread to die, with no notification. As EventMachine creates 20 Threads initially, and never maintains them, each request for "/" causes a Thread to die. Eventually, EventMachine runs out the Threads and the requests just keep getting queued up. This is when SwitchPipe obviously stalls.
The attached patch changes App.find_by_path so as that if to_sym can not be called on the supplied argument, it returns nil. This then causes the rest of the process() code to realise that an application could not be found, and the connection is dropped. It also adds a few extra lines of logging to identify when some errors occur.
With the patch applied, issuing "GET / HTTP/1.0" requests to the SwitchPipe process no longer causes Threads to die, but rather abandons the connection cleanly.
Please let me know if the patch works for you - it appears to be working here. Many thanks to Chris for his help.
> Sorry about the delay - I've been waiting for things to crash again.
> I'm getting exactly the same thing as you. It's like EventMachine.defer() > is swallowing the request, never to be seen again.
> I've hacked the EventMachine.defer code to give me some logs of what's > actually happening. EventMachine starts 20 threads that sit around and watch > a Queue that has all the incoming requests. I'm wondering if something's > happening that's making these 20 threads slowly die, or block or something > like that. The logging I've added just tells me which thread is handling > each request, how many items are waiting, etc.
> Hopefully I'll have something to share within a few days.
> In the mean time, it's a dirty dirty hack, but have you thought about > restarting switchpipe every hour or so before it dies via cron? That's what > I'm currently looking at as the alternative if we can't track down the bug. > Nasty "solution", I know.
> All the best, > Jason
> On 01/04/2008, ChrisR <EvilGeen...@gmail.com> wrote:
> > I applied the diff file and waited for it to hang, here is the last > > few entries of the log file:
# Return an app matching on the path
def self.find_by_path(path)
- @@apps[path.to_sym]
+ # NOTE: We need to ignore when we can't symbolize the path, as
+ # app_from_path() returns FALSE in some cases
+ @@apps[path.to_sym] if path.respond_to?(:to_sym)
end
# Return an app matching on the hostname
@@ -419,6 +421,7 @@
rescue
# When all else fails..
+ LOG.error "Caught #{$!} - close_connection called"
close_connection
end