Hi Keith,
1. "See what sorts of failures happen in production": yes, I have logging for my workers. In fact, my webapp is half http, half cli and the second half also takes care of the workers. I have app-wide logging to various channels, so I log the workers too. However, how do you relate the errors from jobs to your log? I want to create a web interface with information about the buried jobs. I must open my log file, parse it, peek the bury list (preferable all at once), relate the job ids to the logging information and show that to the user.
I log the exceptions too, but to make the process more comfortable, I see nothing against it to store the exception type + message in the job data itself as well. It gets only more complicated with the reasoning I see here a lot: "just delete the job and put a new one back". Ids are lost and tracking is near-impossible.
2. "It might also make sense to retry some jobs only a limited number of times before deleting them": I can only come up with a method to delete the job and put a clone back. Releasing a job with a delay can't be done for above reasons (ie, where do you store the counter?).
3. "For retries, [...], but do add a time delay with exponential backoff": same as above, where do you store this logic?
Ad 1: In case you want to minimize errors (always good) and thus reduce the bury queue (also good), you probably need more data. The complete stack trace might be logged into your real logging service. If you have that goal, you use the log to process the information. But for quick inspection about the job's reason to be buried it's much too complicated I think.
Instead of adding features to beanstalkd to handle other
things (such as logging stack traces, or tracking the
application's long-term state), it's better to keep beanstalkd
itself focused on scheduling work to be done, and leave
those other things to other tools.
I completely agree. The simplicity of beanstalk is something I really appreciate. However, I think the described enhancements could leverage beanstalk's usage without losing focus, compromise on memory footprint or end up being a clumsy one-size-fits-all solution.
If you have a work item that has two separate phases
of execution, and those phases can fail independently,
it might make sense to break it apart into two jobs.
It's what we do already :) We have now a custom solution as a nodejs app where we schedule using Redis as a queue service. All jobs are atomic, to make no mistakes with chains of failures.
Having said all that, we definitely need ways to get better
visibility into beanstalkd's internal state while it's running
in production.
I really would like to hear your thoughts about this more. If you have more information, thoughts, developments or whatsoever, please share :)
---
Jurian Sluiman