Passenger hangs at PassengerWatchdog on reboot; killing watchdog allows Passenger start to continue

562 views
Skip to first unread message

Tom Harrison

unread,
May 21, 2012, 1:56:51 PM5/21/12
to phusion-...@googlegroups.com
Several weeks ago, I reported a problem here (https://groups.google.com/forum/?fromgroups#!topic/phusion-passenger/RWgxl6sRycY) which was never resolved.

I am now seeing what appears to be a related problem: passenger will not start completely and errors are sent to the /var/log/kern.log.  However, the symtoms are a little different, and I have a reproducible case.

Since my original post, I installed the rails stack, including passenger on an earlier version of Linux (Ubuntu Natty/11-04) running on Amazon EC2, and everything has been fine until May 18th.  I am running a pretty standard Rails application: Apache 2.2.x, MySql 5.1, Ruby 1.9.3 using RVM

After a clean boot, the system processes relating to Apache and Passenger look like this:

0 S root       719   539  0  80   0 -  1056 wait   16:03 ?        00:00:00 /bin/sh /etc/rc2.d/S91apache2 start
0 S root       728   719  0  80   0 -  1056 wait   16:03 ?        00:00:00 /bin/sh /usr/sbin/apache2ctl start
4 S root       733   728  0  80   0 - 22649 wait   16:03 ?        00:00:00 /usr/sbin/apache2 -k start
4 S root       734   733  0  80   0 -  5926 hrtime 16:03 ?        00:00:00 PassengerWatchdog
0 Z root       737   734  0  80   0 -     0 exit   16:03 ?        00:00:00 [PassengerHelper] <defunct>
4 Z nobody     746   734  0  80   0 -     0 exit   16:03 ?        00:00:00 [PassengerLoggin] <defunct>


A trace is generated in kern.log (see below).  After some experimentation, I found that killing the PassengerWatchdog process just after reboot clears up the problem -- Passenger and Apache continue their start and run without any problems.  After this: restarting apache (e.g. sudo apache2ctl restart, sudo /etc/init.d/apache2 restart, sudo service apache2 restart) all work fine.  The problem appears only to occur at boot time.

One clue: I changed the contents of /etc/hostname between when everything was working, and now, when I have this problem is.  Formerly it was the internal DNS name of the server provided by Amazon, now it's a more meaningful name related to our domain.  Based on my notes, and on bash history, this appears to be the only system change I made.  

Could hostname have anything to do with this issue?

(Also, to be clear, this problem does not occur if I disable the Passenger module -- apache starts fine.)

I am mystified.  Any help the leads to understanding or resolution is welcome!

Thanks --

Tom Harrison

=== Trace from kern.log ===

May 21 15:50:48 test kernel: [    3.843709] alignment check: 0000 [#2] SMP 
May 21 15:50:48 test kernel: [    3.843716] last sysfs file: /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map
May 21 15:50:48 test kernel: [    3.843722] CPU 0 
May 21 15:50:48 test kernel: [    3.843726] Modules linked in: acpiphp
May 21 15:50:48 test kernel: [    3.843733] 
May 21 15:50:48 test kernel: [    3.843736] Pid: 723, comm: PassengerWatchd Tainted: G      D     2.6.38-11-virtual #50-Ubuntu  
May 21 15:50:48 test kernel: [    3.843747] RIP: e030:[<ffffffff8100b2b5>]  [<ffffffff8100b2b5>] __setup_rt_frame+0x105/0x360
May 21 15:50:48 test kernel: [    3.843757] RSP: e02b:ffff88002391ddd8  EFLAGS: 00050246
May 21 15:50:48 test kernel: [    3.843762] RAX: ffff88002391dfd8 RBX: ffff88002391dee8 RCX: ffff880023912da0
May 21 15:50:48 test kernel: [    3.843769] RDX: ffff88002391de68 RSI: ffff88002391dfd8 RDI: 0000000000000000
May 21 15:50:48 test kernel: [    3.843774] RBP: ffff88002391de28 R08: 0000000000000000 R09: 0000000000000008
May 21 15:50:48 test kernel: [    3.843780] R10: 0000000000000001 R11: 0000000000000202 R12: 00007f5529a87b38
May 21 15:50:48 test kernel: [    3.843786] R13: ffff88002391df58 R14: 00007f5529a87d00 R15: ffff8800239132d8
May 21 15:50:48 test kernel: [    3.843794] FS:  00007f5529a88700(0000) GS:ffff8800266a8000(0000) knlGS:0000000000000000
May 21 15:50:48 test kernel: [    3.843801] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
May 21 15:50:48 test kernel: [    3.843806] CR2: 000000000048f55f CR3: 00000000021ac000 CR4: 0000000000002620
May 21 15:50:48 test kernel: [    3.843812] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 21 15:50:48 test kernel: [    3.843819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
May 21 15:50:48 test kernel: [    3.843825] Process PassengerWatchd (pid: 723, threadinfo ffff88002391c000, task ffff880023912da0)
May 21 15:50:48 test kernel: [    3.843831] Stack:
May 21 15:50:48 test kernel: [    3.843834]  ffff88002391de68 00007f5529a87b38 ffff880023912da0 0000000a2391df58
May 21 15:50:48 test kernel: [    3.843845]  ffff880023912da0 ffff88002391df58 ffff88002391dee8 00007f5529a889c0
May 21 15:50:48 test kernel: [    3.843856]  000000000000000a 0000000000000003 ffff88002391de58 ffffffff8100b821
May 21 15:50:48 test kernel: [    3.843867] Call Trace:
May 21 15:50:48 test kernel: [    3.843872]  [<ffffffff8100b821>] handle_signal+0x1d1/0x260
May 21 15:50:48 test kernel: [    3.843878]  [<ffffffff8100b923>] do_signal+0x73/0x180
May 21 15:50:48 test kernel: [    3.843885]  [<ffffffff811643a0>] ? vfs_read+0x120/0x180
May 21 15:50:48 test kernel: [    3.843891]  [<ffffffff8116447c>] ? sys_read+0x7c/0x90
May 21 15:50:48 test kernel: [    3.843897]  [<ffffffff8100bab5>] do_notify_resume+0x65/0x80
May 21 15:50:48 test kernel: [    3.843903]  [<ffffffff8105f667>] ? schedule_tail+0x27/0xb0
May 21 15:50:48 test kernel: [    3.843909]  [<ffffffff8100c290>] int_signal+0x12/0x17
May 21 15:50:48 test kernel: [    3.843914] Code: ff 0f 85 67 02 00 00 f6 43 08 04 0f 85 05 02 00 00 65 48 8b 04 25 88 cc 00 00 44 8b 80 88 e0 ff ff c7 80 88 e0 ff ff 00 00 00 00 <48> 8b 05 68 54 ab 00 a9 00 00 00 04 0f 85 89 01 00 00 49 c7 44 
May 21 15:50:48 test kernel: [    3.844001] RIP  [<ffffffff8100b2b5>] __setup_rt_frame+0x105/0x360
May 21 15:50:48 test kernel: [    3.844009]  RSP <ffff88002391ddd8>
May 21 15:50:48 test kernel: [    3.844014] ---[ end trace 47e2fb6364f8c188 ]---

Ronie Gilberto Henrich

unread,
May 21, 2012, 2:17:09 PM5/21/12
to phusion-...@googlegroups.com
I have the same issue and I reported it on Wed Dec 21 2011, but this bug was never solved.
Below is the last email I sent to the list, Cc to Hongli Lai

Best regards,
Ronie
-------- Original Message  --------
Subject: Re: [phusion-passenger] PassengerWatchd /proc/2999/oom_adj is deprecated, please use /proc/2999/oom_score_adj instead
From: Ronie Gilberto Henrich <ro...@ronie.com.br>
To: phusion-...@googlegroups.com
Cc: Hongli Lai <hon...@phusion.nl>
Date: Wed Dec 21 2011 16:30:26 GMT-0500 (Eastern Standard Time)

PassengerHelperAgent (see strace below) also gets into a never ending loop when PassengerWatchd is in a loop.

=====
Process 3007 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x79a9b0, FUTEX_WAKE_PRIVATE, 1)  = 0
tgkill(3007, 3012, SIGUSR1)             = 0
gettimeofday({1324502761, 95625}, NULL) = 0
futex(0x79a9dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 164845, {1324502761, 105625000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x79a9b0, FUTEX_WAKE_PRIVATE, 1)  = 0
tgkill(3007, 3012, SIGUSR1)             = 0
gettimeofday({1324502761, 105951}, NULL) = 0
futex(0x79a9dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 164847, {1324502761, 115951000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x79a9b0, FUTEX_WAKE_PRIVATE, 1)  = 0
tgkill(3007, 3012, SIGUSR1)             = 0
gettimeofday({1324502761, 116330}, NULL) = 0
futex(0x79a9dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 164849, {1324502761, 126330000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x79a9b0, FUTEX_WAKE_PRIVATE, 1)  = 0
tgkill(3007, 3012, SIGUSR1)             = 0
gettimeofday({1324502761, 126691}, NULL) = 0
futex(0x79a9dc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 164851, {1324502761, 136691000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x79a9b0, FUTEX_WAKE_PRIVATE, 1)  = 0
tgkill(3007, 3012, SIGUSR1)             = 0
=====


-------- Original Message  --------
Subject: Re: [phusion-passenger] PassengerWatchd /proc/2999/oom_adj is deprecated, please use /proc/2999/oom_score_adj instead
From: Ronie Gilberto Henrich <ro...@ronie.com.br>
To: phusion-...@googlegroups.com
Cc: Hongli Lai <hon...@phusion.nl>
Date: Wed Dec 21 2011 16:23:21 GMT-0500 (Eastern Standard Time)
"oom_adj is deprecated warning" is gone, but PassengerWatchdog still gets into a never ending loop right after it is started.

I am running passenger 3.0.11, Apache 2.2.21, Gentoo Linux 3.0.6
Is there any other information you need or tests you want me to do to figure out what kernel changes is triggering this loop in PassengerWatchdog?

Thanks,
Ronie



Tom Harrison wrote:
--
You received this message because you are subscribed to the Google Groups "Phusion Passenger Discussions" group.
To view this discussion on the web visit https://groups.google.com/d/msg/phusion-passenger/-/_1wgVUfTscwJ.
To post to this group, send email to phusion-...@googlegroups.com.
To unsubscribe from this group, send email to phusion-passen...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/phusion-passenger?hl=en.

Hongli Lai

unread,
May 21, 2012, 2:29:37 PM5/21/12
to phusion-...@googlegroups.com
I'm guessing this is a kernel bug. Userspace processes should never be
able to crash the kernel to such an extend. However the log output is
completely unclear on what the problem is. I see that it's probably
something related to signal handling but not much other information.

What happens if you postpone the starting of Apache to a later time?
Does that help?
> --
> You received this message because you are subscribed to the Google Groups
> "Phusion Passenger Discussions" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/phusion-passenger/-/_1wgVUfTscwJ.
> To post to this group, send email to phusion-...@googlegroups.com.
> To unsubscribe from this group, send email to
> phusion-passen...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/phusion-passenger?hl=en.



--
Phusion | Ruby & Rails deployment, scaling and tuning solutions

Web: http://www.phusion.nl/
E-mail: in...@phusion.nl
Chamber of commerce no: 08173483 (The Netherlands)

Tom Harrison

unread,
May 21, 2012, 2:37:12 PM5/21/12
to phusion-...@googlegroups.com
I will try to postpone server starting and report back.

Tom


On Monday, May 21, 2012 2:29:37 PM UTC-4, Hongli Lai wrote:
I'm guessing this is a kernel bug. Userspace processes should never be
able to crash the kernel to such an extend. However the log output is
completely unclear on what the problem is. I see that it's probably
something related to signal handling but not much other information.

What happens if you postpone the starting of Apache to a later time?
Does that help?

> To post to this group, send email to phusion-passenger@googlegroups.com.
> To unsubscribe from this group, send email to

Tom Harrison

unread,
May 21, 2012, 3:11:26 PM5/21/12
to phusion-...@googlegroups.com
Postponing server startup by adding a "sleep 60" then a "sleep 120" in the init.d script did not change the behavior.  However as I believe the rc scripts are synchronous, this does not rule out the possibility that a dependency that runs later has not yet executed.

So, I tested manual startup (after everything else had booted).  I got the same behavior. Running /etc/init.d/apache2 hung (same process status as in the failure case), and when I cancelled the init script, the PassengerWatchdog was still running.  Killing PassengerWatchdog, then starting apache again worked.  In other words, I got the same symptoms when running apache manually as when it runs from rc.d bootup scripts.

What state could exist at boot time that prevents Passenger from starting?  While I understand that the logs provide little information, and may indicate a kernel bug, I hope there's a way that I can reasonably work around the issue.  I have run similar configurations on the same version of Ubuntu (the same AMI, I believe) and have not encountered this problem.  Obviously, security fixes and other changes to all software could have introduced this problem. 

I am building a production system, so I need to understand the cause of the problem.

Thank you for helping me make passenger work properly on this far-from-unusual configuration!

Tom Harrison

Hongli Lai

unread,
May 21, 2012, 3:34:57 PM5/21/12
to phusion-...@googlegroups.com
On Mon, May 21, 2012 at 9:11 PM, Tom Harrison <tom.har...@gmail.com> wrote:
> Postponing server startup by adding a "sleep 60" then a "sleep 120" in the
> init.d script did not change the behavior.  However as I believe the rc
> scripts are synchronous, this does not rule out the possibility that a
> dependency that runs later has not yet executed.
>
> So, I tested manual startup (after everything else had booted).  I got the
> same behavior. Running /etc/init.d/apache2 hung (same process status as in
> the failure case), and when I cancelled the init script, the
> PassengerWatchdog was still running.  Killing PassengerWatchdog, then
> starting apache again worked.  In other words, I got the same symptoms when
> running apache manually as when it runs from rc.d bootup scripts.
>
> What state could exist at boot time that prevents Passenger from starting?
>  While I understand that the logs provide little information, and may
> indicate a kernel bug, I hope there's a way that I can reasonably work
> around the issue.  I have run similar configurations on the same version of
> Ubuntu (the same AMI, I believe) and have not encountered this problem.
>  Obviously, security fixes and other changes to all software could have
> introduced this problem.
>
> I am building a production system, so I need to understand the cause of the
> problem.
>
> Thank you for helping me make passenger work properly on this
> far-from-unusual configuration!

I've read that some EC2 machines run a very old version of Xen which
could cause all kinds of kernel crashes.

Could you 'gem install crash-watch' and 'apt-get install gdb' and
'crash-watch --dump <PID_OF_WATCHDOG>' and send me the output?

Tom Harrison

unread,
May 21, 2012, 4:46:21 PM5/21/12
to phusion-...@googlegroups.com
(for other readers, sent crash-dump output by email)

On Monday, May 21, 2012 3:34:57 PM UTC-4, Hongli Lai wrote:

Tom Harrison

unread,
May 22, 2012, 5:45:27 PM5/22/12
to phusion-...@googlegroups.com
(for other readers, crash dump provided no useful information.  Hongli Lai asserts that it must be a kernel bug.)  

I have the following workaround, which is a total hack, but here goes

first: create a script, perhaps in /root/workaround_passenger_hang_at_boot.sh

#!/bin/bash
/bin/date
/bin/echo "Running $0"
/bin/sleep 10
/bin/pidof PassengerWatchdog
if [ $? -eq 0 ]; then
    /usr/bin/killall PassengerWatchdog
    /bin/echo "Killall PassengerWatchdog with return code $?"
else
    /bin/echo "PassengerWatchdog not running"
fi
/usr/sbin/apache2ctl restart
/bin/sleep 5
/bin/pidof apache2
if [ $? -eq 1 ]; then
    /bin/echo "ERROR: Apache not restarted"
else
    /bin/echo "OK: Apache is running"
fi

Then, run that script from /etc/rc.local by adding the line

sh /root/workaround_passenger_boot_hang.sh >> /root/workaround_passenger_boot_hang.log 2>&1

This will append output to a log file.  Seems to be working for me.

Ronie Gilberto Henrich

unread,
May 22, 2012, 6:24:18 PM5/22/12
to phusion-...@googlegroups.com
I've been using also a workaround for this in the last 3 monts, also in rc.local:
kill -9 $(ps -C PassengerWatchdog -o pid --no-headers)
sleep 10
/etc/init.d/apache2 restart


Tom Harrison wrote:
--
You received this message because you are subscribed to the Google Groups "Phusion Passenger Discussions" group.
To view this discussion on the web visit https://groups.google.com/d/msg/phusion-passenger/-/pS-sDqQBLcQJ.
To post to this group, send email to phusion-...@googlegroups.com.
To unsubscribe from this group, send email to phusion-passen...@googlegroups.com.

Hongli Lai

unread,
Jun 1, 2012, 2:22:02 AM6/1/12
to phusion-...@googlegroups.com
I got a bug report at http://code.google.com/p/phusion-passenger/issues/detail?id=761
Do you also happen to be running an x86 userland on an x86_64 kernel?

Tom Harrison

unread,
Jun 1, 2012, 8:14:24 AM6/1/12
to phusion-...@googlegroups.com
not that I know. The bug occurs on a 64-bit instance on Amazon EC2, pretty much straight up Ubuntu installl. I could probably get you access to an instance with the issue, if that would be helpful. 

Tom
--
You received this message because you are subscribed to the Google Groups "Phusion Passenger Discussions" group.
To view this discussion on the web visit https://groups.google.com/d/msg/phusion-passenger/-/z5ufnDmCUhYJ.
Reply all
Reply to author
Forward
0 new messages