Switchpipe stops responding after a random period of time

ChrisR

unread,

Mar 27, 2008, 1:47:26 PM3/27/08

to SwitchPipe

I have just deployed 7 rails applications using switchpipe onto a
Solaris 11 machine.

I now have a major problem where switchpipe just stops responding to
requests and i have to give it a "./script/switchpipe restart" to get
it going again. It seems to happen after a random amount of time,
1hr~3hours. It has never lasted more than 3 hours.

Has anybody come across this before?

Any help is greatly appreciated as I really want to use switchpipe.

Thanks
Chris

Peter Cooper

unread,

Mar 27, 2008, 2:46:59 PM3/27/08

to SwitchPipe

On Mar 27, 5:47 pm, ChrisR <EvilGeen...@gmail.com> wrote:
> I have just deployed 7 rails applications using switchpipe onto a
> Solaris 11 machine.
>
> I now have a major problem where switchpipe just stops responding to
> requests and i have to give it a "./script/switchpipe restart" to get
> it going again. It seems to happen after a random amount of time,
> 1hr~3hours. It has never lasted more than 3 hours.
>
> Has anybody come across this before?

This was a common issue prior to 1.04. I imagine you are running 1.04
or higher (trunk) though, which in my own tests has been running for
over a month now uninterrupted on one of my machine. Someone else has
reported the same issue with the trunk version, however, and it began
after they upgraded Apache. I'm still waiting for more information
from them so I can replicate this issue, but if you can provide more
info, it might help triangulate the issue :) Essentially.. I need to
know what's in front of your SwitchPipe? Apache? Nginx?

Once I can replicate the problem, it'll probably be a pretty easy fix.
The pre 1.04 issue was related to some UNIX quirks I wasn't taking
into account, so that could also be the issue here since I have not
extensively tested on anything else than BSD or Linux so far.

Cheers,
Pete

Jason Stirk

unread,

Mar 28, 2008, 12:14:12 AM3/28/08

to SwitchPipe

Ooh! My ears are burning!

I'm having a similar problem, but anywhere from a few hours to over 24. The last time I restarted I think it lasted about 8-10 hours (just had to restart then).

This is happening on a Gentoo box (2.6.18-xen kernel), Ruby 1.8.6-111. I did notice that daemons gem I'm running is 1.0.9 vs. 1.0.10 which is latest. All other gems that the site asks to install are at their latest version.

Apache was a minor upgrade - it's now 2.2.8, and was formerly working well on 2.2.7(-something). I'm not certain that this is the cause, but it's one of the only things to have changed that I can think of.

I've just gone through and put in a bunch of logging requests throughout the code in the hope that one of them will lead me to where it's stalling. I'll keep you posted if I find anything interesting.

Best regards,
Jason

ChrisR

unread,

Mar 28, 2008, 6:36:30 AM3/28/08

to SwitchPipe

The Delegate proxy (http://www.delegate.org) is the only thing in
front of switchpipe. When it crashes, I can't even access it directly
(without going through the proxy).

Maybe you could give me a modified version of switchpipe.rb that
created a log file with more information, then I could give you back
the log file today after it hangs.

Thanks
Chris

ChrisR

unread,

Mar 28, 2008, 11:32:35 AM3/28/08

to SwitchPipe

I'm sure its probably nothing but it says :
VERSION = "1.03"

at the top of the switchpipe.rb in the trunk and in v1.04

Peter Cooper

unread,

Mar 31, 2008, 7:43:34 AM3/31/08

to SwitchPipe

Thanks for the extra updates. I'm travelling at Ruby conferences at
the moment but when I get back I'll rig up SwitchPipe with Apache 2 in
front of it (I have only used it with Apache 1.3) and see if I can
recreate some of these issues.

Cheers,
Pete

Jason Stirk

unread,

Mar 31, 2008, 8:03:36 AM3/31/08

to Peter Cooper, SwitchPipe

Pete,

Brief status report on my logging exercises. Last failure took about 48 hours to come about, which made it a bit slow to work out where it's dying.

Data was still coming in, but a backend instance was never getting launched. I've since added more logging in between the request coming in, and when it's processed in the hope of working out exactly where it's aborting.

Chris contacted me off list, and I've sent him a diff with the additional logger calls. If his SwitchPipe is dying as often as he says, hopefully it will be able to give us a really good idea of where it's dying without the 48 hour-or-so wait that I'm battling against.

I've attached the diff just so as that you can stay clued in to where I'm looking. Nothing fancy (just calls to the logger).

Enjoy the conference!

Best regards,
Jason

switchpipe-logging-31Mar2008.diff

ChrisR

unread,

Mar 31, 2008, 10:34:14 AM3/31/08

to SwitchPipe

I applied the diff file and waited for it to hang, here is the last
few entries of the log file:

D, [2008-03-31T14:58:25.260526 #4153] DEBUG -- : Received data (494
bytes)
D, [2008-03-31T14:58:25.260786 #4153] DEBUG -- : Got request (494
bytes)
D, [2008-03-31T14:58:25.261239 #4153] DEBUG -- : Entered process()
D, [2008-03-31T14:58:26.670501 #4153] DEBUG -- : Received data (698
bytes)
D, [2008-03-31T14:58:26.670759 #4153] DEBUG -- : Got request (698
bytes)
D, [2008-03-31T14:58:26.671257 #4153] DEBUG -- : Entered process()
D, [2008-03-31T14:58:47.779328 #4153] DEBUG -- : Received data (494
bytes)
D, [2008-03-31T14:58:47.779589 #4153] DEBUG -- : Got request (494
bytes)
D, [2008-03-31T14:58:47.780121 #4153] DEBUG -- : Entered process()
D, [2008-03-31T14:58:49.857491 #4153] DEBUG -- : Received data (669
bytes)
D, [2008-03-31T14:58:49.857720 #4153] DEBUG -- : Got request (669
bytes)
D, [2008-03-31T14:58:49.858210 #4153] DEBUG -- : Entered process()
D, [2008-03-31T15:00:04.834368 #4153] DEBUG -- : Received data (1018
bytes)
D, [2008-03-31T15:00:04.834629 #4153] DEBUG -- : Got request (1018
bytes)
D, [2008-03-31T15:00:04.835248 #4153] DEBUG -- : Entered process()

So its getting stuck at "Entered process()" for every request, which
is just before the line

EventMachine.defer(before, after)

The mystery continues, is it EventMachine's fault?

Thanks

Chris

Jason Stirk

unread,

Apr 3, 2008, 10:35:36 PM4/3/08

to ChrisR, SwitchPipe

Hi Chris,

Sorry about the delay - I've been waiting for things to crash again.

I'm getting exactly the same thing as you. It's like EventMachine.defer() is swallowing the request, never to be seen again.

I've hacked the EventMachine.defer code to give me some logs of what's actually happening. EventMachine starts 20 threads that sit around and watch a Queue that has all the incoming requests. I'm wondering if something's happening that's making these 20 threads slowly die, or block or something like that. The logging I've added just tells me which thread is handling each request, how many items are waiting, etc.

Hopefully I'll have something to share within a few days.

In the mean time, it's a dirty dirty hack, but have you thought about restarting switchpipe every hour or so before it dies via cron? That's what I'm currently looking at as the alternative if we can't track down the bug. Nasty "solution", I know.

All the best,
Jason

Jason Stirk

unread,

Apr 4, 2008, 12:12:41 AM4/4/08

to ChrisR, Peter Cooper, SwitchPipe

Hi all,

I believe I've found the problem, and created a fix. Diff against r40 attached.

I finally managed to reproduce the error once I could see what was happening with the EventMachine threads.

It seems that "GET / HTTP/1.0" requests were being issued to the SwitchPipe process, presumably sent by mod_proxy trying to check that the back-end was still alive. I remember seeing log history of this before, but I'm unable to find the exact configuration as of yet.

When these requests come in, SwitchPipe tries to work out which application to launch - first based upon the host-name of the request (which my particular configuration doesn't define), and then by the path of the request. The code that works out the path (app_from_path) in the case of the request for "/" will return false. This then causes an exception to be raised in App.find_by_path as it tries to call to_sym() on false. This causes the Thread to die, with no notification. As EventMachine creates 20 Threads initially, and never maintains them, each request for "/" causes a Thread to die. Eventually, EventMachine runs out the Threads and the requests just keep getting queued up. This is when SwitchPipe obviously stalls.

The attached patch changes App.find_by_path so as that if to_sym can not be called on the supplied argument, it returns nil. This then causes the rest of the process() code to realise that an application could not be found, and the connection is dropped. It also adds a few extra lines of logging to identify when some errors occur.

With the patch applied, issuing "GET / HTTP/1.0" requests to the SwitchPipe process no longer causes Threads to die, but rather abandons the connection cleanly.

Please let me know if the patch works for you - it appears to be working here. Many thanks to Chris for his help.

Best regards,
Jason Stirk
Achernar Solutions
http://achernarsolutions.com.au/

switchpipe-fix-04Apr2008.diff.txt

ChrisR

unread,

Apr 4, 2008, 10:49:50 AM4/4/08

to SwitchPipe

Hi Jason,

thanks for the solution, now hopefully I can redeploy my apps on
switchpipe without worry!!

In my network the Delegate proxy (http://www.delegate.org) is being
used instead of Apache's mod_proxy, so I'm assuming delegate exhibits
this same behavior.

I'm applying the patch now so I should see very soon if it works or
not.

Thanks again Jason!

Just out of interest Jason, how exactly did you figure it out? did you
just log the get requests coming in?

Peter Cooper

unread,

Apr 4, 2008, 11:08:04 AM4/4/08

to SwitchPipe

Wow! I've got home and someone has already fixed the problem.. I love
you guys!

I can now appreciate what the issue was. mod_proxy /does/ issue some
sort of "heartbeat" type request to monitor the backends in certain
situations. I think this might be an Apache 2.x versus 1.x thing (or
maybe not, but as a primary 1.x user, I have not seen it very often.)
That would certainly cause the issue you have debugged.

As such, I have committed your fix to trunk and given it a brief test.
Please upgrade to use this new version and let me know how it goes. I
will progress from here to a 1.05 release very shortly.

Cheers,
Peter Cooper

Reply all

Reply to author

Forward