I recently deployed an app with Passenger and MRI. And I've been
getting some seemingly random 100% CPU usages, which, looking through
some threads, has happened to others.
This definitely happens to me and is the main reason why I have to
keep switching from REE back to normal Ruby (I really wish I could use
REE consistently, it's much faster and memory-efficient).
Does anyone know how to debug an already running passenger thread? It
happens so sporadically that I have no idea how to reproduce it. Is
there some way to attach to a running passenger process a la gdb?
Btw, Phusion do great work, passenger is amazing and once I can use
REE consistently, it will be amazing as well!
Kennon Ballou wrote: > This definitely happens to me and is the main reason why I have to > keep switching from REE back to normal Ruby (I really wish I could use > REE consistently, it's much faster and memory-efficient).
> Does anyone know how to debug an already running passenger thread? It > happens so sporadically that I have no idea how to reproduce it. Is > there some way to attach to a running passenger process a la gdb?
> Btw, Phusion do great work, passenger is amazing and once I can use > REE consistently, it will be amazing as well!
shmay, Kennon, what platforms are you running REE on?
-- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
This happens to us a lot. Some of the things we've tried to do to fix
it:
- used 'conservative' spawning method
- used regular ruby
- wrote a bash script to find and kill these runaway processes (we've
now isolated the runaway processes to machines that handle media
uploads)
I tried once to use gdb to identify the issue. I got as far as seeing
an infinite loop in Ruby GC's finalize_list, but that was with REE on
Rails 1.2 (we've since upgraded). I haven't used REE on production
machines since. If you're looking for better performance with regular
ruby, try applying stefan kaes' GC patch.
On Nov 18, 1:03 am, Kennon Ballou <ken...@angryturnip.com> wrote:
> This definitely happens to me and is the main reason why I have to
> keep switching from REE back to normal Ruby (I really wish I could use
> REE consistently, it's much faster and memory-efficient).
> Does anyone know how to debug an already running passenger thread? It
> happens so sporadically that I have no idea how to reproduce it. Is
> there some way to attach to a running passenger process a la gdb?
> Btw, Phusion do great work, passenger is amazing and once I can use
> REE consistently, it will be amazing as well!
amos wrote: > This happens to us a lot. Some of the things we've tried to do to fix > it:
> - used 'conservative' spawning method > - used regular ruby > - wrote a bash script to find and kill these runaway processes (we've > now isolated the runaway processes to machines that handle media > uploads)
> I tried once to use gdb to identify the issue. I got as far as seeing > an infinite loop in Ruby GC's finalize_list, but that was with REE on > Rails 1.2 (we've since upgraded). I haven't used REE on production > machines since. If you're looking for better performance with regular > ruby, try applying stefan kaes' GC patch.
Hi Amos.
Do you happen to have a spare staging server around on which you can reproduce this problem? So far we haven't been able to find the cause of this problem, or even seen the problem ourselves. If you had saved the gdb backtrace somewhere then we'd be happy to take a look at it.
Also, finalize_list does not have any infinite loops. :) But I don't exclude the possibility that something very weird might be going on which causes finalize_list to freeze anyway.
Finally, what platform are your servers running on?
-- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
> amos wrote:
> > This happens to us a lot. Some of the things we've tried to do to fix
> > it:
> > - used 'conservative' spawning method
> > - used regular ruby
> > - wrote a bash script to find and kill these runaway processes (we've
> > now isolated the runaway processes to machines that handle media
> > uploads)
> > I tried once to use gdb to identify the issue. I got as far as seeing
> > an infinite loop in Ruby GC's finalize_list, but that was with REE on
> > Rails 1.2 (we've since upgraded). I haven't used REE on production
> > machines since. If you're looking for better performance with regular
> > ruby, try applying stefan kaes' GC patch.
> Hi Amos.
> Do you happen to have a spare staging server around on which you can
> reproduce this problem? So far we haven't been able to find the cause of
> this problem, or even seen the problem ourselves. If you had saved the
> gdb backtrace somewhere then we'd be happy to take a look at it.
> Also, finalize_list does not have any infinite loops. :) But I don't
> exclude the possibility that something very weird might be going on
> which causes finalize_list to freeze anyway.
> Finally, what platform are your servers running on?
> --
> Phusion | The Computer Science Company
> Web:http://www.phusion.nl/ > E-mail: i...@phusion.nl
> Chamber of commerce no: 08173483 (The Netherlands)
shmay wrote: > I'm using regular MRI, on a Linode Ubuntu 360. Is that what you > meant?
Actually I was referring to Ruby Enterprise Edition.
But if you need to debug an application that has gone crazy, please try to generate a backtrace for it and post it to this mailing list. You can do it as follows: 1. Identify the PID of the process that has gone crazy. 2. Type: sudo gdb attach 12345 <--- replace "12345" with the actual PID thread apply all bt
Please copy & paste the result.
-- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
Centos 5 x86_64
Rails 2.1.0 (although the above trace was in Rails 1.2.3)
Some libraries:
five_runs memcache-client with our crc32 extension (http://github.com/ fiveruns/memcache-client/tree/master)
our version of data_fabric (which had some issues with passenger
reopening db connections in spawn_server since data_fabric does not
define a :production db connection)
rmagick
hpricot
I can't reproduce this problem on a staging server, only production
ones. I can possibly give you access to a live machine in a hanging
state if you contact me directly.
On Nov 18, 4:10 pm, Hongli Lai <hon...@phusion.nl> wrote:
> shmay wrote:
> > I'm using regular MRI, on a Linode Ubuntu 360. Is that what you
> > meant?
> Actually I was referring to Ruby Enterprise Edition.
> But if you need to debug an application that has gone crazy, please try
> to generate a backtrace for it and post it to this mailing list. You can
> do it as follows:
> 1. Identify the PID of the process that has gone crazy.
> 2. Type:
> sudo gdb
> attach 12345 <--- replace "12345" with the actual PID
> thread apply all bt
> Please copy & paste the result.
> --
> Phusion | The Computer Science Company
> Web:http://www.phusion.nl/ > E-mail: i...@phusion.nl
> Chamber of commerce no: 08173483 (The Netherlands)
We're running REE and Passenger on several 64-bit CentOS 5.2 boxes and have
not experienced this issue. That said, the applications on those boxes do
not use memcache.
> shmay wrote:
> > I'm using regular MRI, on a Linode Ubuntu 360. Is that what you
> > meant?
> Actually I was referring to Ruby Enterprise Edition.
> But if you need to debug an application that has gone crazy, please try
> to generate a backtrace for it and post it to this mailing list. You can
> do it as follows:
> 1. Identify the PID of the process that has gone crazy.
> 2. Type:
> sudo gdb
> attach 12345 <--- replace "12345" with the actual PID
> thread apply all bt
> Please copy & paste the result.
> --
> Phusion | The Computer Science Company
> Web:http://www.phusion.nl/ > E-mail: i...@phusion.nl
> Chamber of commerce no: 08173483 (The Netherlands)
Thanks for the link. I seem to have missed that message.
> We're using:
> Centos 5 x86_64 > Rails 2.1.0 (although the above trace was in Rails 1.2.3)
> Some libraries: > five_runs memcache-client with our crc32 extension (http://github.com/ > fiveruns/memcache-client/tree/master) > our version of data_fabric (which had some issues with passenger > reopening db connections in spawn_server since data_fabric does not > define a :production db connection) > rmagick > hpricot
I've been spending some time stress testing a sample Rails app which uses FiveRuns's memcache-client and RMagick with REE on 64-bit Ubuntu 8.10 server. So far I haven't been able to find any stability issues.
-- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
amos wrote: > I'm running a test with REE again. I'll try to give you another stack > trace if I see a runaway process.
OK.
The stack trace you gave previously might not be accurate because of compiler optimizations. Could you install REE without optimizations? You can do that by setting the CFLAGS environment variable to an empty string, then running the installer:
export CFLAGS= ./installer
And I'm also wondering whether your REE installation uses the system Ruby's gems. By default, REE adds the system Ruby's gem path to its own gem path, allowing it to use already-installed gems. But I've received a few reports from users who say that this can cause crashes when REE tries to load a native extension that's compiled for the system's Ruby. So you are advised to reinstall all your gems for REE, e.g. with '/path/to/ree/bin/gem install imagemagick'.
-- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
unfortunately or fortunately i haven't been able to reproduce that old
infinite loop. i have a hunch it might when trying to allocate a bunch
of memory real quickly by loading in a ton of ar objects from a db
table. we've since optimized our code and removed a few memory leaks
by limiting the number of models we initialize at one time. i'll try
to do some debugging with the compilation options you gave me.
in the meantime, we still get the 100% usage on our upload servers.
i'm looking at one now. here is the output of strace:
#0 0x00000037f0ad1e5b in lseek64 () from /lib64/libc.so.6
#1 0x00000037f0a6b536 in _IO_new_do_write () from /lib64/libc.so.6
#2 0x00000037f0a6c9b2 in _IO_new_file_xsputn () from /lib64/libc.so.6
#3 0x00000037f0a61d8b in fwrite () from /lib64/libc.so.6
#4 0x00000037f1a50835 in rb_io_fptr_finalize () from /usr/lib64/
libruby.so.1.8
#5 0x00000037f1a5673b in rb_io_eof () from /usr/lib64/libruby.so.1.8
#6 0x00000037f1a31731 in rb_exc_jump () from /usr/lib64/libruby.so.
1.8
#7 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so.
1.8
#8 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8
#9 0x00000037f1a3d65d in rb_apply () from /usr/lib64/libruby.so.1.8
#10 0x00000037f1a31753 in rb_exc_jump () from /usr/lib64/libruby.so.
1.8
#11 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so.
1.8
#12 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8
#13 0x00000037f1a3fc0c in rb_apply () from /usr/lib64/libruby.so.1.8
#14 0x00000037f1a3c08b in rb_apply () from /usr/lib64/libruby.so.1.8
#15 0x00000037f1a3d65d in rb_apply () from /usr/lib64/libruby.so.1.8
#16 0x00000037f1a31753 in rb_exc_jump () from /usr/lib64/libruby.so.
1.8
#17 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so.
1.8
#18 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8
#19 0x00000037f1a3e00a in rb_apply () from /usr/lib64/libruby.so.1.8
....
this is with:
ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
i don't know if it's something specific to our system or passenger,
but i don't think it ever happened with mongrel. any suggestions on
how to debug this one further?
On Nov 20, 1:48 am, Hongli Lai <hon...@phusion.nl> wrote:
> amos wrote:
> > I'm running a test with REE again. I'll try to give you another stack
> > trace if I see a runaway process.
> OK.
> The stack trace you gave previously might not be accurate because of
> compiler optimizations. Could you install REE without optimizations? You
> can do that by setting the CFLAGS environment variable to an empty
> string, then running the installer:
> export CFLAGS=
> ./installer
> And I'm also wondering whether your REE installation uses the system
> Ruby's gems. By default, REE adds the system Ruby's gem path to its own
> gem path, allowing it to use already-installed gems. But I've received a
> few reports from users who say that this can cause crashes when REE
> tries to load a native extension that's compiled for the system's Ruby.
> So you are advised to reinstall all your gems for REE, e.g. with
> '/path/to/ree/bin/gem install imagemagick'.
> --
> Phusion | The Computer Science Company
> Web:http://www.phusion.nl/ > E-mail: i...@phusion.nl
> Chamber of commerce no: 08173483 (The Netherlands)
amos wrote: > unfortunately or fortunately i haven't been able to reproduce that old > infinite loop. i have a hunch it might when trying to allocate a bunch > of memory real quickly by loading in a ton of ar objects from a db > table. we've since optimized our code and removed a few memory leaks > by limiting the number of models we initialize at one time. i'll try > to do some debugging with the compilation options you gave me.
> in the meantime, we still get the 100% usage on our upload servers. > i'm looking at one now. here is the output of strace:
> #0 0x00000037f0ad1e5b in lseek64 () from /lib64/libc.so.6 > #1 0x00000037f0a6b536 in _IO_new_do_write () from /lib64/libc.so.6 > #2 0x00000037f0a6c9b2 in _IO_new_file_xsputn () from /lib64/libc.so.6 > #3 0x00000037f0a61d8b in fwrite () from /lib64/libc.so.6 > #4 0x00000037f1a50835 in rb_io_fptr_finalize () from /usr/lib64/ > libruby.so.1.8 > #5 0x00000037f1a5673b in rb_io_eof () from /usr/lib64/libruby.so.1.8 > #6 0x00000037f1a31731 in rb_exc_jump () from /usr/lib64/libruby.so. > 1.8 > #7 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so. > 1.8 > #8 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8 > #9 0x00000037f1a3d65d in rb_apply () from /usr/lib64/libruby.so.1.8 > #10 0x00000037f1a31753 in rb_exc_jump () from /usr/lib64/libruby.so. > 1.8 > #11 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so. > 1.8 > #12 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8 > #13 0x00000037f1a3fc0c in rb_apply () from /usr/lib64/libruby.so.1.8 > #14 0x00000037f1a3c08b in rb_apply () from /usr/lib64/libruby.so.1.8 > #15 0x00000037f1a3d65d in rb_apply () from /usr/lib64/libruby.so.1.8 > #16 0x00000037f1a31753 in rb_exc_jump () from /usr/lib64/libruby.so. > 1.8 > #17 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so. > 1.8 > #18 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8 > #19 0x00000037f1a3e00a in rb_apply () from /usr/lib64/libruby.so.1.8 > ....
> this is with: > ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
> i don't know if it's something specific to our system or passenger, > but i don't think it ever happened with mongrel. any suggestions on > how to debug this one further?
I've been spending some time reading the Ruby source code and glibc source code for clues. I haven't found any so far, but I suspect it might have something to do with the way we handle Unix sockets.
Could you try the following: 1. Create a script, say '/usr/bin/ruby-wrapper-script', which sets PASSENGER_NO_ABSTRACT_NAMESPACE_SOCKETS=1 and executes Ruby, like this:
Over the last two week I have had the 100% cpu usage problem also on
my servers. Traffic has gone up during the last two weeks and also the
number of CPUs has been increased on these machines.
I found out that these cases always have been related to an illegal
seek in my production log. Seems similar to the previous mentioned
strace.
> amos wrote:
> > unfortunately or fortunately i haven't been able to reproduce that old
> > infinite loop. i have a hunch it might when trying to allocate a bunch
> > of memory real quickly by loading in a ton of ar objects from a db
> > table. we've since optimized our code and removed a few memory leaks
> > by limiting the number of models we initialize at one time. i'll try
> > to do some debugging with the compilation options you gave me.
> > in the meantime, we still get the 100% usage on our upload servers.
> > i'm looking at one now. here is the output of strace:
> > #0 0x00000037f0ad1e5b in lseek64 () from /lib64/libc.so.6
> > #1 0x00000037f0a6b536 in _IO_new_do_write () from /lib64/libc.so.6
> > #2 0x00000037f0a6c9b2 in _IO_new_file_xsputn () from /lib64/libc.so.6
> > #3 0x00000037f0a61d8b in fwrite () from /lib64/libc.so.6
> > #4 0x00000037f1a50835 in rb_io_fptr_finalize () from /usr/lib64/
> > libruby.so.1.8
> > #5 0x00000037f1a5673b in rb_io_eof () from /usr/lib64/libruby.so.1.8
> > #6 0x00000037f1a31731 in rb_exc_jump () from /usr/lib64/libruby.so.
> > 1.8
> > #7 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so.
> > 1.8
> > #8 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8
> > #9 0x00000037f1a3d65d in rb_apply () from /usr/lib64/libruby.so.1.8
> > #10 0x00000037f1a31753 in rb_exc_jump () from /usr/lib64/libruby.so.
> > 1.8
> > #11 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so.
> > 1.8
> > #12 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8
> > #13 0x00000037f1a3fc0c in rb_apply () from /usr/lib64/libruby.so.1.8
> > #14 0x00000037f1a3c08b in rb_apply () from /usr/lib64/libruby.so.1.8
> > #15 0x00000037f1a3d65d in rb_apply () from /usr/lib64/libruby.so.1.8
> > #16 0x00000037f1a31753 in rb_exc_jump () from /usr/lib64/libruby.so.
> > 1.8
> > #17 0x00000037f1a31c38 in rb_exc_jump () from /usr/lib64/libruby.so.
> > 1.8
> > #18 0x00000037f1a3bd86 in rb_apply () from /usr/lib64/libruby.so.1.8
> > #19 0x00000037f1a3e00a in rb_apply () from /usr/lib64/libruby.so.1.8
> > ....
> > this is with:
> > ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
> > i don't know if it's something specific to our system or passenger,
> > but i don't think it ever happened with mongrel. any suggestions on
> > how to debug this one further?
> I've been spending some time reading the Ruby source code and glibc
> source code for clues. I haven't found any so far, but I suspect it
> might have something to do with the way we handle Unix sockets.
> Could you try the following:
> 1. Create a script, say '/usr/bin/ruby-wrapper-script', which sets
> PASSENGER_NO_ABSTRACT_NAMESPACE_SOCKETS=1 and executes Ruby, like this:
MarcoJ wrote: > Over the last two week I have had the 100% cpu usage problem also on > my servers. Traffic has gone up during the last two weeks and also the > number of CPUs has been increased on these machines.
> I found out that these cases always have been related to an illegal > seek in my production log. Seems similar to the previous mentioned > strace.
> I found an interesting post from ep's blog who had the samen problem > and published a monkey patch that resolved it for him. He mentioned > that it is related to uploading a file: http://ep.blogware.com/blog/_archives/2008/10/14/3930392.html
> I haven't tried the monkey patch myself yet. Oh, and I am using > FiveRuns as well, but I am not sure if that is related to the problem.
> Hope this helps,
> Marco
Very interesting. Can others confirm whether this money patch works and whether the problem description is correct?
-- Phusion | The Computer Science Company
Web: http://www.phusion.nl/ E-mail: i...@phusion.nl Chamber of commerce no: 08173483 (The Netherlands)
> MarcoJ wrote:
> > Over the last two week I have had the 100% cpu usage problem also on
> > my servers. Traffic has gone up during the last two weeks and also the
> > number of CPUs has been increased on these machines.
> > I found out that these cases always have been related to an illegal
> > seek in my production log. Seems similar to the previous mentioned
> > strace.
> > I found an interesting post from ep's blog who had the samen problem
> > and published a monkey patch that resolved it for him. He mentioned
> > that it is related to uploading a file:http://ep.blogware.com/blog/_archives/2008/10/14/3930392.html
> > I haven't tried the monkey patch myself yet. Oh, and I am using
> > FiveRuns as well, but I am not sure if that is related to the problem.
> > Hope this helps,
> > Marco
> Very interesting. Can others confirm whether this money patch works and
> whether the problem description is correct?
> --
> Phusion | The Computer Science Company
> Web:http://www.phusion.nl/ > E-mail: i...@phusion.nl
> Chamber of commerce no: 08173483 (The Netherlands)
> I just implemented the patch on one of our production machines. I'll
> respond back later if I feel like it's working.
> On Nov 24, 8:20 am, Hongli Lai <hon...@phusion.nl> wrote:
> > MarcoJ wrote:
> > > Over the last two week I have had the 100% cpu usage problem also on
> > > my servers. Traffic has gone up during the last two weeks and also the
> > > number of CPUs has been increased on these machines.
> > > I found out that these cases always have been related to an illegal
> > > seek in my production log. Seems similar to the previous mentioned
> > > strace.
> > > I found an interesting post from ep's blog who had the samen problem
> > > and published a monkey patch that resolved it for him. He mentioned
> > > that it is related to uploading a file:http://ep.blogware.com/blog/_archives/2008/10/14/3930392.html
> > > I haven't tried the monkey patch myself yet. Oh, and I am using
> > > FiveRuns as well, but I am not sure if that is related to the problem.
> > > Hope this helps,
> > > Marco
> > Very interesting. Can others confirm whether this money patch works and
> > whether the problem description is correct?
> > --
> > Phusion | The Computer Science Company
> > Web:http://www.phusion.nl/ > > E-mail: i...@phusion.nl
> > Chamber of commerce no: 08173483 (The Netherlands)
I found an ApplicationSpawner that was at 99% for over 5 minutes. I had to kill it, but before I did, I did what you asked: https://gist.github.com/1337203