Weird problem with timeouts on Nginx/Phusion/Redmine

576 views
Skip to first unread message

Dan Sullivan

unread,
Nov 11, 2013, 5:25:35 PM11/11/13
to phusion-...@googlegroups.com
Hi, Everyone,

I have kind of a weird problem with Nginx/Passenger/Redmine.  I don't necessarily think this problem to be Redmine related (my suspicion is that it is related to the integration between Nginx and Passenger, however this has yet to be determined).  To start, I would like to say that this problem appears to be consistently reproducible.  Basically, here is what I do to reproduce the problem:
  1. I load Redmine (i.e. my passenger driven rails app) via a web browser and let it sit on the login screen.
  2. I let my web browser sit there for an hour or so.
  3. I click the reload button.  After about 5-10 minutes the attempt times out and Nginx throws a 500 error.  In certain situations after about 5-10 minutes the page will actually load, although it will never be quick (i.e. the wait duration significantly longer than the app loading into memory).
  4. If I click reload again, the page always loads fine and loads immediately.  This behavior is consistent.
Here is what I have verified:
  1. I am fairly certain that the inbound HTTP request is hitting the web server because when I do a tcpdump, I see traffic coming in on port 443.
  2. Similarly, I am fairly confident that although the inbound HTTP traffic is hitting the server no reply traffic is being sent back out on TCP port 443 (I discern this from inspecting the tcpdump logs as well).
  3. I am tailing both the nginx-access log and nginx-error log, and when this request (when I click reload) hits the server, it does not appear an entry is either written to the nginx-access or nginx-error logs.
  4. When the request does hit the server (i.e. when I click refresh), I definitely see activity in the passenger-debug.log file (immediately), although nothing is transmitted back to the end user.  The log data in this is somewhat cryptic, I have posted it to pastebin in case anybody would like to see it (http://pastebin.com/biZbe7tK).
Also, I would like to note that occasionally I will experience slow load times when I first visit the site, even if I did not have a browser open.  I am a little bit concerned about this because I have configured passenger_pool_idle_time=0 as well as passenger_min_instances, and passenger_pre_start).  I have posted my Nginx configuration here (http://pastebin.com/PrLAp97k).  I can definitely see the applications loaded in memory by typing passenger-memory-stats as root.
At this point I figure my options are:
  1. Recompiling Nginx with debug support to discern more visibility to what is going on in Nginx (presumably this would require rebuilding passenger Phusion).  Based on the output of nginx –V, it does not look like my Nginx binary was built with --debug.  I can do this, but it’s a production system and I’m concerned that even if I do I won’t be able to use this information in meaningful way.
  2. Switch to a different application server and or web server (i.e. Apache and or thin).  I really would like to understand what the actual problem is and resolve it as opposed to taking this course of action, although I am willing to go down this road as a last resort.
  3. Email passenger and or Redmine message boards to see if somebody can help me troubleshoot this further. 
To summarize:

Although I am doing passenger_pre_start,  passenger_pool_idle_time=0,  and passenger_min_instances=high number, and definitely see that Redmine RackApp processes appear to be loaded in memory, I am seeing timeouts when visiting a Redmine  instance.  This appears sometimes occurs sporadically when I either open a browser for the first time, or try to use a session that has been idle on an already open browser for an extended period of time (such as 1 hour).  It seems to me like there is some  breakdown between Nginx and Passenger, although I can’t quite put my finger on the root cause of this issue.

Here are the software versions that I am using:

Redmine 2.3.3
Passenger 4.0.19
Nginx 1.4.2 (built by Passenger).
RHEL 6.4

If anybody could provide some insight into what is actually occurring here or how I could possibly troubleshoot this further I would greatly appreciate it.  Thank you so much I wish you a wonderful day.

Dan Sullivan

Hongli Lai

unread,
Nov 15, 2013, 9:23:29 AM11/15/13
to phusion-passenger
On Mon, Nov 11, 2013 at 11:25 PM, Dan Sullivan <dansu...@gmail.com> wrote:
> I am fairly certain that the inbound HTTP request is hitting the web server
> because when I do a tcpdump, I see traffic coming in on port 443.
> Similarly, I am fairly confident that although the inbound HTTP traffic is
> hitting the server no reply traffic is being sent back out on TCP port 443
> (I discern this from inspecting the tcpdump logs as well).
> I am tailing both the nginx-access log and nginx-error log, and when this
> request (when I click reload) hits the server, it does not appear an entry
> is either written to the nginx-access or nginx-error logs.
> When the request does hit the server (i.e. when I click refresh), I
> definitely see activity in the passenger-debug.log file (immediately),
> although nothing is transmitted back to the end user. The log data in this
> is somewhat cryptic, I have posted it to pastebin in case anybody would like
> to see it (http://pastebin.com/biZbe7tK).

According to the log data, Redmine never returned a response. It's as
if Redmine got stuck. There could be all kinds of reasons why this
happens (database problems, firewall problems, bugs in the app), but
it's outside Passenger's control.

If you're using Enterprise, you may use the passenger_max_request_time
option to kill the Redmine process if it takes too long to respond. It
keeps the site available but doesn't solve the core problem.

If you want to debug the problem in Redmine, try this:
http://blog.phusion.nl/2012/09/21/the-right-way-to-deal-with-frozen-processes-on-unix/

--
Phusion | Ruby & Rails deployment, scaling and tuning solutions

Web: http://www.phusion.nl/
E-mail: in...@phusion.nl
Chamber of commerce no: 08173483 (The Netherlands)

Dan Sullivan

unread,
Nov 19, 2013, 2:42:47 PM11/19/13
to
Hi, Hongli,

Thank you very much for taking the time to respond to me.  I have read through the documentation you provided, and it has been helpful, however I am seeing what I believe to be somewhat of an unusual issue.

Basically, what I did was bound to the actual rack process (from the output of passenger-status), and reproduced the issue.  I actually got some good visibility into what was going on.  When the error presented itself I receievd the following system calls over and over again:

--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = -1 EINTR (Interrupted system call)
read(10, 0x37810c0, 16384)              = -1 EINTR (Interrupted system call)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = -1 EINTR (Interrupted system call)
read(10, 0x37810c0, 16384)              = -1 EINTR (Interrupted system call)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = -1 EINTR (Interrupted system call)
read(10, 0x37810c0, 16384)              = -1 EINTR (Interrupted system call)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = -1 EINTR (Interrupted system call)
read(10, 0x37810c0, 16384)              = -1 EINTR (Interrupted system call)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0x1a)                      = -1 EINTR (Interrupted system call)
read(10,

Based on the limited amount of research I did on this particular issue, I was experiencing the problem documented in the following URL (I am using RHEL 6.4 and verified that the default system ruby is ruby is ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux].).

http://timetobleed.com/ruby-threading-bugfix-small-fix-goes-a-long-way/

I understand that puhsion develop REE http://www.rubyenterpriseedition.com/download.html, but to be perfectly honest since 1.8.7 has fallen out of support, and since I am using a new version of Redmine, I figured it would be easiest to just not use 1.8.7 and instead use RVM to use a newer version of ruby in conjunction with Nginx.  So what I did was build a new RVM as root (i.e. ruby-2.0.0-p247), and reference this in my Nginx configuration, i.e.

passenger_root /usr/local/rvm/gems/ruby-2.0.0-p247/gems/passenger-4.0.24;
passenger_ruby /usr/local/rvm/wrappers/ruby-2.0.0-p247/ruby;

Now, when I do this and look at the passenger-debug logs, it is clear to me, that the new phusion processes are using this particular gem.  Unfortunately, the same problem persists (i.e. I let the browser sit for an hour and then a user clicks refresh and weird timeout-like errors occur).

Now, when I spawn passenger (via running Nginx) I confident my RVM ruby is loading, i.e. the following log entriess are in

Pool2/Spawner.h:411 ]: [App 4289 stdin >>] passenger_root: /usr/local/rvm/gems/ruby-2.0.0-p247/gems/passenger-4.0.24
Pool2/Spawner.h:411 ]: [App 4289 stdin >>] start_command: /usr/local/rvm/wrappers/ruby-2.0.0-p247/ruby    /usr/local/rvm/gems/ruby-2.0.0-p247/gems/passenger-4.0.24/helper-scripts/rack-loader.rb

When I enumerate my passenger processes with passenger_stats, I see a few pids (because I have passenger_min_instances set).  What I will do is trace several processes (i.e. all of the processes enumerated through passenger-stats), for example strace -p 4254 -p 4261 -p 4268 -p 4275 -p 4282 -p 4289 (from the output of the command below).

----------- Application groups -----------
/opt/rails/redmine-2.3.3#default:
  App root: /opt/rails/redmine-2.3.3
  Requests in queue: 0
  * PID: 4254    Sessions: 0       Processed: 11      Uptime: 1h 20m 18s
    CPU: 0%      Memory  : 65M     Last used: 50m 32s
  * PID: 4261    Sessions: 0       Processed: 0       Uptime: 1h 20m 18s
    CPU: 0%      Memory  : 8M      Last used: 1h 20m 1
  * PID: 4268    Sessions: 0       Processed: 0       Uptime: 1h 20m 18s
    CPU: 0%      Memory  : 7M      Last used: 1h 20m 1
  * PID: 4275    Sessions: 0       Processed: 0       Uptime: 1h 20m 18s
    CPU: 0%      Memory  : 7M      Last used: 1h 20m 1
  * PID: 4282    Sessions: 0       Processed: 0       Uptime: 1h 20m 17s
    CPU: 0%      Memory  : 7M      Last used: 1h 20m 1
  * PID: 4289    Sessions: 0       Processed: 0       Uptime: 1h 20m 17s
    CPU: 0%      Memory  : 8M      Last used: 1h 20m 1

#

What happens is when I do a trace of the processes, I get this (and nothing else), even after continuing to use the app.  For some reason it appears that system tracing does not function properly when the app is spawned via RVM.

# strace -p 4254 -p 4261 -p 4268 -p 4275 -p 4282 -p 4289
Process 4254 attached - interrupt to quit
Process 4261 attached - interrupt to quit
Process 4268 attached - interrupt to quit
Process 4275 attached - interrupt to quit
Process 4282 attached - interrupt to quit
Process 4289 attached - interrupt to quit
[pid  4289] select(11, [0 10], NULL, NULL, NULL <unfinished ...>
[pid  4282] select(11, [0 10], NULL, NULL, NULL <unfinished ...>
[pid  4275] select(11, [0 10], NULL, NULL, NULL <unfinished ...>
[pid  4268] select(11, [0 10], NULL, NULL, NULL <unfinished ...>
[pid  4261] select(11, [0 10], NULL, NULL, NULL <unfinished ...>
[pid  4254] select(11, [0 10], NULL, NULL, NULL

As I said, at this point I can use the app and it functions fine, however I never see a single system call from any of these processes.   The problem is of course, if I let it sit for a while, I see the exact same error message that I was seeing before.  Has anybody seen an inability to trace passenger phusion processes when spawned from RVM?  I can trace the processes fine using the system ruby (/usr/bin/ruby).  Thank you so much for taking the time to answer my questions and to help me troubleshoot this problem. 

In the meantime I am going to try REE, just to see what happens...  Thank you again.

Dan Sullivan

Hongli Lai

unread,
Nov 19, 2013, 2:52:28 PM11/19/13
to phusion-passenger
Try running strace with -f. Otherwise strace does not properly trace
system calls by other threads in the process.
> Dan Sullivan
>
>
>
>
> On Friday, November 15, 2013 8:23:29 AM UTC-6, Hongli Lai wrote:
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Phusion Passenger Discussions" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to phusion-passen...@googlegroups.com.
> To post to this group, send email to phusion-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/phusion-passenger.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/phusion-passenger/5c97543d-c449-4021-aaee-5d3a8c22f4ed%40googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Dan Sullivan

unread,
Nov 19, 2013, 5:38:34 PM11/19/13
to phusion-...@googlegroups.com
Thank you sir.  This worked.

Dan

Janowski Marcin

unread,
Apr 30, 2014, 6:39:53 PM4/30/14
to phusion-...@googlegroups.com
Hello, anyone know how to fix this problem? I have this same issue on Ubuntu 12.04, Ruby 1.9.3 and Redmine 2.5.

Dan Sullivan

unread,
May 8, 2014, 10:17:14 AM5/8/14
to phusion-...@googlegroups.com
For what it is worth after troubleshooting for hours upon end I was never able to resolve this problem (I was the original author of this thread).  One thing I didn't try was making sure that the database server was on the same layer 3 VLAN as the web frontend (not sure if your deployment has a layer 3 hop involved, or if your database is local).  Have you tried putting the DB server on the same box / VLAN, or is it already?  Somebody pretty high up in the Redmine / Passenger Phusion support community on IRC (I can’t remember who) recommended that I localize the database and or remove all firewalls in-between.  I never did this, mostly because I got lazy and tired of looking at this problem.

Also, depending on how long this takes for things to hang up on your system you might be able to set a session activity timeout (screenshot attached).  I’ve never tested this but it might work.

If you find a real / elegant solution to this problem please let me know, I’m still interested :-)

Good luck,

Dan



--
You received this message because you are subscribed to a topic in the Google Groups "Phusion Passenger Discussions" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/phusion-passenger/HAtFaSI3dwY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to phusion-passen...@googlegroups.com.

To post to this group, send email to phusion-...@googlegroups.com.
Visit this group at http://groups.google.com/group/phusion-passenger.
Reply all
Reply to author
Forward
0 new messages