rails3: running script/delayed_job immediately dies

969 views
Skip to first unread message

nathanvda

unread,
Sep 15, 2010, 3:53:36 AM9/15/10
to delayed_job
I am using rails3 and delayed job (the version from git). In
development all was fine and dandy, as i was using rake jobs:work.

Now in production, if i want to start script/delayed_job it always
immediately dies.
The job is started, i get a pid, i see that the log is opened, and
then the job is gone.

There is no outstanding work waiting.

If i run RAILS_ENV=production rake jobs:work it does work.

I have changed the last line in script/delayed_job as follows:

Delayed::Command.new(ARGV).run

so no daemonize, and then it works (but no deamon of course).

Does anybody have any clues?

Sung Pae

unread,
Sep 15, 2010, 8:47:24 AM9/15/10
to delay...@googlegroups.com
On Sep 15, 2:53 am, nathanvda <nathan...@gmail.com> wrote:

> Does anybody have any clues?

tl;dr: try this for now:
http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delayed/daemon_tasks.rb

There are alot of people having the same issue as you. It seems the
Daemons-backed `delayed_job' script has developed some issues in the
push towards 2.1.0.

Presently there are two solutions:

1) Run `rake jobs:work' in production and use a process monitoring
daemon like monit or god to manage your worker processes

2) Rearchitect the delayed_job script as a classic pre-forking daemon
that dispatches jobs to worker processes. See
http://github.com/collectiveidea/delayed_job/issues#issue/25

The second solution is the one we should be pushing towards, but it
doesn't exist presently. When it is developed though, I suspect the
Daemons library will not be involved. I believe it is at the center of
this issue.

The design of the Daemons gem hinges upon using process groups and
pidfiles to manage daemonized scripts. Besides being brittle, it seems
like an odd decision, since forking daemons usually involve a master
process that forks children processes and manages them with SIGTERM,
SIGHUP, and SIGCLD.

However, Daemons does not fork proper children, but instead double
forks standalone daemons, managing them via signals to the process
group. Also if a monitoring process is spawned, it checks to see if
your processes are still alive every 30s. This is necessary because
the daemons process neither receives SIGCLD signals, nor can it use a
simple blocking `Process.wait' call to detect the death of children
processes.

Finally, though it may not actually be part of the problem, the author
of the gem consistently traps signals without reraising them or
restoring the default signal handlers. If you work with processes, you
should always propagate signals (see
http://www.cons.org/cracauer/sigint.html); `exit(130)` is not the same
as `Process.kill :TERM, $$`.

I don't mean to harsh on the Daemons gem. It seems like it could work
fine for simple scripts, but I don't think I would write a daemonizing
library in the same way. If I did write such a library, these are the
steps I would follow:

* silence the standard streams
* double fork the master to detach from the controlling process
* call setsid(2) to create a new process group
* create some children processes
* resurrect dead children on receipt of SIGCLD, or when wait(2)
returns a pid
* restart children on SIGHUP
* reopen logfiles on SIGUSR1 (optional but nice)
* terminate and reap children on SIGTERM, restore the default handler
for TERM, and resend TERM to self

--

Right now I am waiting for Brandon to finalize the API for 2.1.0, and
after that, time permitting, I would like to attempt to write a
replacement for script/delayed_job.

Until then however, I do have a forking daemon that I wrote some weeks
ago for a file-uploading site. It doesn't dispatch jobs like the final
solution should (the workers just fight for a db lock), but it does
reliably launch and manage multiple worker processes. You can see the
code here (I recommend reading the header):

http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delayed/daemon_tasks.rb

It's just a beginning, it's not beautiful, and it should be more
configurable, but I have it running flawlessly in production right
now. Also, I've been keeping it up to date with all of Brandon's
changes.

Directions:

Point your Gemfile at the repo:

gem 'delayed_job', :git=>'git://github.com/guns/delayed_job.git',
:branch=>'delayed_job_daemon'

Run as a rake task:

WORKERS=n RAILS_ENV=production rake jobs:daemon:start

where `n' is the number of processes you'd like to spawn.

Further directions are in the header of `daemon_tasks.rb', the file
linked above.

Cheers,
guns

Brandon Keepers

unread,
Sep 15, 2010, 10:18:22 AM9/15/10
to delay...@googlegroups.com
guns,

wow, you are my hero.  Very insightful post.  See comments below…

On Sep 15, 2010, at 8:47 AM, Sung Pae wrote:

There are alot of people having the same issue as you. It seems the
Daemons-backed `delayed_job' script has developed some issues in the
push towards 2.1.0.

There have been issues for a LONG time, it's just getting worse. :)


I don't mean to harsh on the Daemons gem. It seems like it could work
fine for simple scripts, but I don't think I would write a daemonizing
library in the same way. If I did write such a library, these are the
steps I would follow:

* silence the standard streams
* double fork the master to detach from the controlling process
* call setsid(2) to create a new process group
* create some children processes
* resurrect dead children on receipt of SIGCLD, or when wait(2)
 returns a pid
* restart children on SIGHUP
* reopen logfiles on SIGUSR1 (optional but nice)
* terminate and reap children on SIGTERM, restore the default handler
 for TERM, and resend TERM to self

Are there any libraries out there that do this well? Unicorn is the only example I've seen, but it's a little too dense for me to wrap my mind around, no matter how many times I read http://tomayko.com/writings/unicorn-is-unix.

I would love to see someone write a great blog post about how to do this properly. *hint hint*

Right now I am waiting for Brandon to finalize the API for 2.1.0, and
after that, time permitting, I would like to attempt to write a
replacement for script/delayed_job.

The single biggest issue with delayed_job right now is the daemon, so I'm happy to put all other work aside to focus on this.  Heck, if we could get this thing rewritten soonish, I'd love to just skip 2.1 and go straight to 3.0 with a killer daemon.

Here are my only requirements:

* preforking worker
* master dispatches jobs to the worker processes



Cheers,
guns

Thanks for your great work on this.  My "unix process management"-fu is weak, so I really appreciate you heading this up.

=b

David Genord II

unread,
Sep 15, 2010, 11:35:16 AM9/15/10
to delay...@googlegroups.com
I have actually done a second go round at just this. For this round I modeled my work off of phusion passenger. I still have to rework the test suite but I have been running it on our production server for about a month with no issues.

The basic overview is that script/delayed_job creates a server which spawns a producer to find and reserve jobs. These jobs are sent back to the server which spawns consumers on demand to run the jobs.
When the consumers are no longer needed they are killed by the server. This also changes the meaning of -n to maximum number of consumers.

I like the suggestions for handling the different signals as well, so I will probably try and work that in when I get the chance.

One of the other things I would like to implement from this change is a configuration file. So that you could do something like the following and the server would handle all the scheduling
max_total_consumers: 10
pools:
- :max_consumers: 8
  :min_consumers: 1
  :max_priority: 0
  :min_priority: 5
- :max_consumers: 6
  :min_consumers: 2
  :max_priority: 6
  :min_priority: 10
- :max_consumers: 3
  :min_consumers: 0
  :max_priority: 11

This would allow the server to juggle fluctuating work loads at different priorities without needing to commit complete DJ stacks to each priority level and potentially load more consumers than you wanted.

The work I have done so far can be found at http://github.com/xspond/delayed_job/tree/multi-process-rework

David Genord II

Sung Pae

unread,
Sep 15, 2010, 5:35:28 PM9/15/10
to delay...@googlegroups.com
On 15 Sep 2010, at 9:18 AM, Brandon Keepers wrote:

> Are there any libraries out there that do this well? Unicorn is the
> only example I've seen, but it's a little too dense for me to wrap my
> mind around, no matter how many times I read
> http://tomayko.com/writings/unicorn-is-unix.

I don't know of any ruby libraries, but writing a forking Unix daemon
is really well-travelled territory, and a big project like delayed_job
would really benefit from pulling the logic in-house.

> The single biggest issue with delayed_job right now is the daemon,
> so I'm happy to put all other work aside to focus on this. Heck, if
> we could get this thing rewritten soonish, I'd love to just skip 2.1
> and go straight to 3.0 with a killer daemon.

That would be awesome! My schedule has been tight recently though, so
I don't know if I'd be able to get a final version out the door before
the end of the month. We should look at David's branch and try
polishing that if he already is most of the way there.


On 15 Sep 2010, at 10:35 AM, David Genord II wrote:

> I have actually done a second go round at just this. For this round
> I modeled my work off of phusion passenger.

Extracting from Passenger sounds like an excellent idea. I cloned your
branch and did a quick `git diff -w v2.1.0.pre' to scan your code.
Unfortunately, I wasn't able to run script/delayed_job from your
branch (I think it has a conflict with rails 3 release or ruby 1.9.2),
but I think the overall design is solid. I do have some nitpicks wrt
Passenger's paranoid signal handling, and I think the command line
script needs to be overhauled, but these are little issues.

One clarification:

> The basic overview is that script/delayed_job creates a server which
> spawns a producer to find and reserve jobs. These jobs are sent back
> to the server which spawns consumers on demand to run the jobs.

> One of the other things I would like to implement from this change


> is a configuration file. So that you could do something like the
> following and the server would handle all the scheduling

So the goal would be to have a machine-global spawn server that can
handle mutliple producers for different apps, and then manage the
global worker pool based on an rc file? That would be really awesome.

It would require, though, that the delayed_job script be divorced from
any single app, in favor of a single master script with a single
global config file. This would make administration of delayed_job dead
simple. :)

Is your branch mostly there, or would you like some help implementing
the details?

guns

David Genord II

unread,
Sep 15, 2010, 7:44:50 PM9/15/10
to delay...@googlegroups.com
The conflict must be against Ruby 1.9.2. I'll see what I can find out.

The config setup and spawn server would not be machine global, just within the app. While what you describe would be more useful, it would also be exponentially more difficult.

The functionality in my branch is mostly complete but I need to do a lot of work on the test suite. Also I think Brandon recently pulled the alternate backends out into there own gems, so I need to catch up with that. I'll dig through it tonight and see what I need and what I could use help with.

David Genord II

Sung Pae

unread,
Sep 15, 2010, 9:23:12 PM9/15/10
to delay...@googlegroups.com
On 15 Sep 2010, at 6:44 PM, David Genord II wrote:

> The config setup and spawn server would not be machine global, just within the app. While what you describe would be more useful, it would also be exponentially more difficult.

Oh I don't think it would be _exponentially_ more difficult. :)

> I'll dig through it tonight and see what I need and what I could use help with.

Well then I'll hang back and keep an eye on your branch. I hope we'll have a nice solution for the next release.

guns

David Genord II

unread,
Sep 15, 2010, 11:36:48 PM9/15/10
to delay...@googlegroups.com
Spec suite passes! I still need specs for daemonizing, but I did generate a new rails 3 app, loaded a simple job, and successfully ran both script/delayed_job and rake jobs:work. Everything worked including log file output. I was not able to load 1.9.2 on my home computer, rvm was complaining about not being able to connect to the download sites, so I will give that a whirl at the office hopefully Thursday.
For now good night,

David Genord II

David Genord II

unread,
Sep 16, 2010, 10:18:13 AM9/16/10
to delay...@googlegroups.com
I just tested ruby 1.9.2-p0 with rails 3 and everything worked. What OS/DB/Backend were you testing my branch against?

If we are going to proceed with using my branch, I would suggest we add a couple of deprecations to 2.1, such as work_off does nothing in my branch and will be removed, and we can just release 3.0 pretty quickly on the heels of 2.1. This gives people who want the old style workers most of the new features, and allows us to release a fairly substantial internal change as the next major release.

David Genord II

Sung Pae

unread,
Sep 16, 2010, 10:30:59 AM9/16/10
to delay...@googlegroups.com
On 16 Sep 2010, at 9:18 AM, David Genord II wrote:

> I just tested ruby 1.9.2-p0 with rails 3 and everything worked. What
> OS/DB/Backend were you testing my branch against?

It was on OS X 10.6 / MySQL 5 / ActiveRecord.

I didn't spend much time looking into it; I'll give it another go
later today, with a new app and an existing one.

nathanvda

unread,
Sep 17, 2010, 8:10:20 AM9/17/10
to delayed_job
Hi guys,

awesome how much work you guys pulled in this area. Thank you for your
efforts!

I did a test with

gem "delayed_job", :git => 'http://github.com/xspond/
delayed_job.git', :branch => 'multi-process-rework'

first. Because i seemed to understand this thread was aiming towards
that solution.
But then i noticed, while it does process jobs correctly, the script/
daemon does not return and keeps running.
Is that the intention? (i am using ruby ree 1.8.7 and rails 3)
How should i use that then, if i want to start a job through
capistrano? (i am a bit of linux newbie).
Should i start a new shell, with that command, as a background-job?
But since it depends on the capistrano ssh session,
it will die once that is done. Any suggestions? I must be overlooking
something obvious i guess.

Then i tried the option guns created

gem "delayed_job", :git => 'http://github.com/guns/
delayed_job.git', :branch => 'delayed_job_daemon'

and using the rake-tasks

rake jobs:daemon:start

and

rake jobs:daemon:stop

i can see a new process being created nicely, with a worker process.
Only the trouble is: this process seems to kill itself continuously
without ever doing anything.
I do not know where to look to see what could be going wrong.
If i look in the delayed_job.development.log i see something like:

2010-09-17T13:50:39+0200: [delayed_worker.master] Spawning 1 worker(s)
2010-09-17T13:50:40+0200: [Worker(host:nvaws pid:10125)] Starting job
worker
2010-09-17T13:50:44+0200: [delayed_worker.master] SIGHUP received!
Restarting workers.
2010-09-17T13:50:44+0200: [Worker(host:nvaws pid:10125)] Exiting...
2010-09-17T13:50:45+0200: [delayed_worker.master] SIGHUP received!
Restarting workers.
2010-09-17T13:50:45+0200: [delayed_worker.master] Restarting dead
worker: delayed_worker.0
2010-09-17T13:50:46+0200: [Worker(host:nvaws pid:10129)] Starting job
worker
2010-09-17T13:50:54+0200: [delayed_worker.master] SIGHUP received!
Restarting workers.
2010-09-17T13:50:54+0200: [Worker(host:nvaws pid:10129)] Exiting...
2010-09-17T13:50:56+0200: [delayed_worker.master] SIGHUP received!
Restarting workers.
2010-09-17T13:50:56+0200: [delayed_worker.master] Restarting dead
worker: delayed_worker.0
2010-09-17T13:50:57+0200: [Worker(host:nvaws pid:10132)] Starting job
worker
2010-09-17T13:51:04+0200: [delayed_worker.master] SIGHUP received!
Restarting workers.
2010-09-17T13:51:04+0200: [Worker(host:nvaws pid:10132)] Exiting...

continiously. Any clues why that should be?

If i do

rake jobs:work

it does work.

So i am a bit at a loss which option i should take, or how to proceed
from here. Hope you guys can shed some light :)

Thanks in advance!, with a new app and an existing one.

David Genord II

unread,
Sep 17, 2010, 8:31:33 AM9/17/10
to delay...@googlegroups.com
On what platform are you running "./script/delayed_job start" against the xspond rework and not having it detach from the console?

David Genord II

nathanvda

unread,
Sep 17, 2010, 8:51:36 AM9/17/10
to delayed_job
Small addition: i did a deploy to my production environment, and there
the solution from guns is working fine.

Any hints/tips/suggestions how i could trace/find what goes wrong in
my development environment?

David Genord II

unread,
Sep 17, 2010, 10:15:48 AM9/17/10
to delay...@googlegroups.com
Are you running ree on both machines and what patch level? What OS/Version is each system running?

David Genord II

Sung Pae

unread,
Sep 17, 2010, 3:01:27 PM9/17/10
to delay...@googlegroups.com
On 17 Sep 2010, at 7:10 AM, nathanvda wrote:

> i can see a new process being created nicely, with a worker process.
> Only the trouble is: this process seems to kill itself continuously
> without ever doing anything.

> 2010-09-17T13:50:44+0200: [delayed_worker.master] SIGHUP received!
> Restarting workers.


> 2010-09-17T13:50:45+0200: [delayed_worker.master] SIGHUP received!
> Restarting workers.

> 2010-09-17T13:50:54+0200: [delayed_worker.master] SIGHUP received!
> Restarting workers.

> 2010-09-17T13:50:56+0200: [delayed_worker.master] SIGHUP received!
> Restarting workers.

> 2010-09-17T13:51:04+0200: [delayed_worker.master] SIGHUP received!
> Restarting workers.

> Small addition: i did a deploy to my production environment, and there


> the solution from guns is working fine.

Oops!

Well, we are moving towards David's branch, but if you're still interested in kicking around with mine, I pushed a bigfix that corrects the issue you were having.

Currently the master poll tmp/restart.txt for timestamp updates and sends a SIGHUP to itself on update (like Passenger). I thought I was correctly handling the case where tmp/restart.txt didn't exist, but I was wrong! It's fixed now.

guns

Brandon Keepers

unread,
Sep 18, 2010, 10:20:31 PM9/18/10
to delayed_job
David and guns,

First, great work so far on the new daemon.

I had some free time this week and wanted to play around with this
stuff a little. I haven't done much unix programming, so I wanted to
see if I could test-drive the development of a new daemon.

I wanted to start from scratch, because it already drives me nuts that
the current Worker class is so messy. guns' rake task was amazingly
clean, so I started with the code from that, commented it all out, and
just started writing tests, uncommenting code, and refactoring.

Here's where I'm at:

http://github.com/collectiveidea/delayed_job/compare/master...satanize

The #run method is not well tested and there's definitely a bug in
it. I got lazy yesterday and just wanted to get to a working daemon.

One thing of interest: I wanted to see how well it would work to have
the daemon spawn a new child for each job. Resque does this, and I
love the idea since it would guarantee that the jobs wouldn't cause
memory leaks. It will also significantly simplify the daemon because
we don't have to worry about inter-process communication; we can lock
a job in the master process, fork it and run the job. There are some
performance drawbacks with this, but I'd like to get it working and
benchmark it to see just how significant they are. Here are some quick
benchmarks to just see how much overhead there is in forking.

http://gist.github.com/584214

Those numbers don't encourage me, but it's interesting.

What do you think?

=b

On Sep 15, 8:47 am, Sung Pae <sung...@gmail.com> wrote:
> On Sep 15, 2:53 am, nathanvda <nathan...@gmail.com> wrote:
>
> > Does anybody have any clues?
>
> tl;dr: try this for now:http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delaye...
> should always propagate signals (seehttp://www.cons.org/cracauer/sigint.html);`exit(130)` is not the same
> as `Process.kill :TERM, $$`.
>
> I don't mean to harsh on the Daemons gem. It seems like it could work
> fine for simple scripts, but I don't think I would write a daemonizing
> library in the same way. If I did write such a library, these are the
> steps I would follow:
>
> * silence the standard streams
> * double fork the master to detach from the controlling process
> * call setsid(2) to create a new process group
> * create some children processes
> * resurrect dead children on receipt of SIGCLD, or when wait(2)
>   returns a pid
> * restart children on SIGHUP
> * reopen logfiles on SIGUSR1 (optional but nice)
> * terminate and reap children on SIGTERM, restore the default handler
>   for TERM, and resend TERM to self
>
> --
>
> Right now I am waiting for Brandon to finalize the API for 2.1.0, and
> after that, time permitting, I would like to attempt to write a
> replacement for script/delayed_job.
>
> Until then however, I do have a forking daemon that I wrote some weeks
> ago for a file-uploading site. It doesn't dispatch jobs like the final
> solution should (the workers just fight for a db lock), but it does
> reliably launch and manage multiple worker processes. You can see the
> code here (I recommend reading the header):
>
> http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delaye...
>
> It's just a beginning, it's not beautiful, and it should be more
> configurable, but I have it running flawlessly in production right
> now. Also, I've been keeping it up to date with all of Brandon's
> changes.
>
> Directions:
>
> Point your Gemfile at the repo:
>
>     gem 'delayed_job', :git=>'git://github.com/guns/delayed_job.git',
>                        :branch=>'delayed_job_daemon'
>
> Run as a rake task:
>
>     WORKERS=n RAILS_ENV=production rake jobs:daemon:start
>
> where `n' is the number of processes you'd like to spawn.
>
> Further directions are in the header of `daemon_tasks.rb', the file
> linked above.
>
> Cheers,
> guns

On Sep 15, 8:47 am, Sung Pae <sung...@gmail.com> wrote:
> On Sep 15, 2:53 am, nathanvda <nathan...@gmail.com> wrote:
>
> > Does anybody have any clues?
>
> tl;dr: try this for now:http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delaye...
> should always propagate signals (seehttp://www.cons.org/cracauer/sigint.html);`exit(130)` is not the same
> as `Process.kill :TERM, $$`.
>
> I don't mean to harsh on the Daemons gem. It seems like it could work
> fine for simple scripts, but I don't think I would write a daemonizing
> library in the same way. If I did write such a library, these are the
> steps I would follow:
>
> * silence the standard streams
> * double fork the master to detach from the controlling process
> * call setsid(2) to create a new process group
> * create some children processes
> * resurrect dead children on receipt of SIGCLD, or when wait(2)
>   returns a pid
> * restart children on SIGHUP
> * reopen logfiles on SIGUSR1 (optional but nice)
> * terminate and reap children on SIGTERM, restore the default handler
>   for TERM, and resend TERM to self
>
> --
>
> Right now I am waiting for Brandon to finalize the API for 2.1.0, and
> after that, time permitting, I would like to attempt to write a
> replacement for script/delayed_job.
>
> Until then however, I do have a forking daemon that I wrote some weeks
> ago for a file-uploading site. It doesn't dispatch jobs like the final
> solution should (the workers just fight for a db lock), but it does
> reliably launch and manage multiple worker processes. You can see the
> code here (I recommend reading the header):
>
> http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delaye...

David Genord II

unread,
Sep 19, 2010, 12:20:41 AM9/19/10
to delay...@googlegroups.com
I actually tried what you propose. That was my first iteration, for the results see xspond/master. Complications I ran into and the reason for my latest incarnation mostly include issues with maintaining a database connection. My current incarnation does not maintain a database connection in the server process. The producer and workers setup and maintain their own connections. If you do not disconnect from the database before forking the child processes you can be left with hanging or closed connections when you do not expect them. Also in disconnecting before each fork I had encountered an odd mutex deadlock that I don't believe I had fully solved. Basically after running for a while, the workers would just hang waiting for something to unlock. My best guess for what was happening went as follows:
main process starts fork procedure
main process disconnects all connections
main process queries for more jobs triggering a new connection and sets lock for connection pool
main process waits for new connection
main process forks of new worker while waiting for new connection
new worker tries to get a new connection but is blocked because the main process has locked the connection pool

In regard to your memory leak prevention, one of the features I would like to add to my latest iteration is a max jobs before restart and possibly allowing a job to trigger a restart of a consumer process immediately or on completion.

David Genord II
Reply all
Reply to author
Forward
0 new messages