Background job system for Cloud Controller

723 views
Skip to first unread message

Matthew Kocher

unread,
Jul 26, 2013, 1:10:52 AM7/26/13
to vcap...@cloudfoundry.org
Hi All,

Were looking to add a background job / queue system to the Cloud Controller. The current use case for this is managing the complex lifecycle of Services which occur via 3rd party web transactions, which obviously fail occasionally. Further use cases could include staging & starting in the background, scheduled jobs, moving some processes that are a better fit for the background job model out of the event driven code, etc.

We looked at 4 different libraries: Delayed Job (https://github.com/collectiveidea/delayed_job), Resque (https://github.com/resque/resque), Sidekiq (https://github.com/mperham/sidekiq), and BeanStalkd (http://kr.github.io/beanstalkd/).

When looking through the choices, we looked across the following dimensions: 
  • Durability
  • Performance
  • Scalability
  • Retry w/ back-off
  • Scheduling/recurring jobs
  • Ability to query job status
  • Admin interface
  • Database dependencies
  • Queues local / non-local to the Cloud Controller box
  • Monitoring (statsd).
  • Licensing 
We decided against BeanStalkd because its ruby plugin ecosystem is small, and its persistence model is based on flat files, which doesn't suit our dimensions of high durability, scalability and importantly running jobs locally/non-locally to the Cloud Controller.

We ruled out Delayed Job primarily due to its scalability issues. One use case we have coming up is for uploading app files from the CC to the blobstore, which requires the workers to be local to the CC instance that listen to their own queue.  With Delayed Job adding new CC instances would require a migration in the DB for a new queue. Furthermore, Delayed Job currently only works with ActiveRecord which is currently not a dependency. This left us with only the two Redis based systems.

For all intents and purposes the two remaining systems are highly similar. Resque is the older, more mature and more stared/forked of the two and has a larger ecosystem. Sidekiq is newer, has built on the learnings of Resque, and has a fancy Pro version (could be considered a pro or a con). If we did use Sidekiq, then obviously in the standard cf-release we would use the free LGPLv3 version (Resque uses MIT). Resque is based on a fork model, while Sidekiq is based on a threading model which, since we use MRI, means using green threads. Sidekiq seems to have more functionality out of the box, but Resque makes up for it with the plugins.

There is a lot of online lore on the pros and cons of the two, but we would love to hear what the CF community sees as the best for our particular requirements.

Matthew Boedicker

unread,
Jul 26, 2013, 1:39:14 AM7/26/13
to vcap...@cloudfoundry.org
bosh has been using resque with rufus scheduler for a while and it's been working well. It's a part of the infrastructure that quietly works and doesn't require much attention.

David Williams

unread,
Jul 26, 2013, 9:41:24 AM7/26/13
to vcap...@cloudfoundry.org
A cold chill went down my spine when you mentioned LGPL. :)  My vote would be Resque based on Matthew's comments about it already being parts of the infrastructure today and being MIT licensed.  In the spirit of not deviating from cf-release's currently licensing model, CF components should be Apache, MIT, or BSD (no LGPL or GPL).

Is there any use case or requirement that Resque doesn't meet?

Luke Bakken

unread,
Jul 26, 2013, 10:11:00 AM7/26/13
to vcap...@cloudfoundry.org
https://github.com/blog/542-introducing-resque

If it's good enough for GitHub ....

al...@leadkarma.com

unread,
Jul 26, 2013, 10:19:53 AM7/26/13
to vcap...@cloudfoundry.org
You may be leaving a lot of performance on the table by passing on beanstalkd.  We are moving from Resque to BackBurner (via BeanEater to Beanstalkd) mainly for multi-threaded workers to reduce memory overhead per worker (and worker class).  Sidekiq will give you similar benefits and pains around worker thread safety.  We found BackBurner's workers warm up much faster than what we saw with Resque, and this in turns speeds up our deploy cycles.  That said, BackBurner is pretty new, and we a couple of important defects that we are trying to square away before move production use (workers not being killed off, and deadlock with multi-wokers).


mpe...@gmail.com

unread,
Jul 26, 2013, 10:37:32 AM7/26/13
to vcap...@cloudfoundry.org
Author of Sidekiq here.

You'll find Sidekiq to be MUCH faster than Resque and far more RAM efficient, we're talking 10-20x.  Sidekiq has a lot more functionality built in and all integrated cleanly together.  The documentation is also far more extensive.

I have every intention of supporting Sidekiq for years to come because of that fancy Pro version you mentioned.

Hope this helps.
Mike

PS MRI 1.9+ uses native threads, just with a GIL.

Jesse Zhang

unread,
Jul 26, 2013, 11:58:48 AM7/26/13
to vcap-dev
On the top of my head Resque sounds like a better fit because it's forking. We've had terrible experience with Ruby handling lots of string objects (send_file anyone?), and worked around the issue by fronting the cloud controller with nginx. Uploading files sounds precisely like that. The last thing we want is something that "works in dev" and passes all the tests but leaves a large chunk of uncollected memory in the main process on an unexpected Saturday 3 weeks later.

Jesse,
Engineer
Cloud Foundry


On Thu, Jul 25, 2013 at 10:10 PM, Matthew Kocher <mko...@pivotallabs.com> wrote:

ma...@backupify.com

unread,
Jul 26, 2013, 12:40:40 PM7/26/13
to vcap...@cloudfoundry.org
To add more fuel to the fire, check out Qless - https://github.com/seomoz/qless

Its new but has most of your requirements already built in.  It uses the lua scripting built into redis to ensure that its data structures remain consistent in highly concurrent environments, so ++ on your reliability requirement

I'm in the process of replacing resque with it, and so far am very happy (not in production yet, but we process about 6M resque jobs a day).
It uses the fork model of resque, which I like due to better isolation between jobs, but shouldn't be too hard to do things in a threaded fashion if you prefer.

I've ported a couple of the resque plugins we use to it as well:


st...@steveklabnik.com

unread,
Jul 26, 2013, 1:36:39 PM7/26/13
to vcap...@cloudfoundry.org
Resque maintainer here.

Just so you're all aware, Resque is working on a big 2.0 release, so things are changing. That said, huge companies use Resque to run a zillion jobs per day, just like Sidekiq. We do have a much, much bigger installed base. Compnies like GitHub, 37signals, LivingSocial, and many more. At a conference last week, a user told me about how they do a few thousand jobs per minute through Resque.

That said, Mike's comment about memory usage is true: you will use less memory with Sidekiq, due to threading. Resque 2.1 is slated to have N:M processes:threads, which will let you choose whatever model you want. I still think the forking model is more useful by default. His comment about the documentation is also unfortunately true at the moment, though we recently achieved 100% API documentation on master, and are now working on guide-style docs.

If there's anything I can do to help, please let me know. Whichever choice you pick, both Resque and Sidekiq are great, so you're in good hands either way.

Alexis Richardson

unread,
Jul 26, 2013, 4:02:21 PM7/26/13
to vcap...@cloudfoundry.org, Ask Solem
Another option might be Celery. This is maintained by Pivotal, in the
person of Ask Solem cc'd here. It is popular for jobs work, and web
site creation. Celery is designed to work with Redis and Rabbit and
Mongo.... "take your pick".

Links
> http://celeryproject.org/
> https://devcenter.heroku.com/articles/background-jobs-queueing


Requirements:

> Durability
Depending on broker, with RabbitMQ can be transient or durable,
can also support RabbitMQ publisher confirms.

> Performance
Depending on broker, but with RabbitMQ and on my generic office desktop PC,
a single worker (1 thread) can do 100k jobs a second using
non-persistent messages.

> Scalability
Depending on broker, workers communicate by message-passing.

> Retry w/ back-off
Supported

> Scheduling/recurring jobs
Supported (one-off, interval or crontab expressions)

> Ability to query job status
Celery supports pluggable "result backends", which let you track the status
of a job. Many options supported, both RPC and persistent storage
(RabbitMQ, Redis,
SQL, …)

> Admin interface
Flower (github.com/mher/flower) is a web-based real-time monitor and
admin interface
for Celery.

Monitoring is not the same as result backends, and workers can
enable/disable monitoring
at runtime.

Note that Celery does not let you modify the queue directly, e.g. you
cannot reorder messages in the queue, or delete a specific message.
Celery does let you "revoke" a message, so that the message is ignored
by workers, and you can implement most
of these operations without directly modifying the queue.
Jobs in Celery are considered a stream, i.e. they may be in-flight, in
the queue, or
reserved by a worker, so these operations are very difficult to
implement using direct access at scale.

> Database dependencies
Does not have any external service requirements except for what you
chose as a broker (and optionally a result backend).

> Queues local / non-local to the Cloud Controller box
Not sure what this means, but I suspect it depends on the broker used.

> Monitoring (statsd).
There are plugins for many monitoring tools, I think I have seen 3rd
party statsd scripts.

> Licensing
Celery is BSD licensed (3-clause)


Note that Celery is Python-centric, as this is the language it's written
in and what most of the users use it with.
It can be used from other languages using webhooks or by other means.

Native support for other languages is being planned (you can already
do this, but it does require an effort)


a

don...@crystalcommerce.com

unread,
Jul 26, 2013, 5:59:44 PM7/26/13
to vcap...@cloudfoundry.org
At CrystalCommerce, we're using both Sidekiq and Resque in production (on different apps). I wish we could move everything to sidekiq, the threading model and default retry with backoff are both great. They're both great projects with great maintainers and communities. For us, the only thing keeping us back from 100% sidekiq is thread safety. In Rails 2.3, ActiveRecord is not thread-safe. Also, and I just learned this yesterday, Nokogiri on MRI is not threadsafe (see issue #881). If all your code and libraries are thread safe, go with sidekiq.

Cheers,
Donald Plummer


On Thursday, July 25, 2013 10:10:52 PM UTC-7, Matthew Kocher wrote:

Chris Ferris

unread,
Jul 27, 2013, 7:44:44 AM7/27/13
to vcap...@cloudfoundry.org, vcap...@cloudfoundry.org
I must admit to the same reaction to the mention of LGPL. Of course, it depends on a number of considerations as to whether it really presents an issue. My preference would be to explore alternative options to an LGPL licensed component.

Sent from my iPhone

Brian Martin

unread,
Jul 29, 2013, 12:56:48 PM7/29/13
to vcap...@cloudfoundry.org


On Friday, July 26, 2013 1:10:52 AM UTC-4, Matthew Kocher wrote:
Hi All,

Were looking to add a background job / queue system to the Cloud Controller. The current use case for this is managing the complex lifecycle of Services which occur via 3rd party web transactions, which obviously fail occasionally. Further use cases could include staging & starting in the background, scheduled jobs, moving some processes that are a better fit for the background job model out of the event driven code, etc.


How do you envision this complex service lifecycle creation integrating with the current synchronous API for create-service?   In other words, today I can create a service synchronously on the cli using the create-service verb or during push.   What if my service creation does fall into the category of something more complex that requires a sort of mini-workflow or possible retries as you have stated.   Would these services enter into some sort of "pending" state until it completed creation?  Would service binding be deferred until creation could succeed?   Would service binding also have instances where this same queueing system could be utilized?

Brian Martin
IBM

Matthew Kocher

unread,
Jul 30, 2013, 2:02:01 AM7/30/13
to vcap...@cloudfoundry.org
In light of the license considerations (I don't think LGPL vs MIT is a huge difference, but we'd rather not fight that battle) we're going to go forward with resque for the time being.  It is "Proven Technology", as they say.

We don't expect this to have huge numbers of jobs, so forking shouldn't be an issue.  We've got plenty of experience with the scheduler and status add ons (and from what I hear they work together painlessly now) so we have a fairly clear course for the work.

David Stevenson can speak more to the direction of the services api, but the main thing we're looking at now is moving orphan detection into the Cloud Controller so every gateway doesn't have to implement the exact same logic.  We want to move away from everyone having to inherit from vcap-services-base, and instead make the API a service has to implement simpler.  Part of this work could be async service creation, but I don't know when that work might happen.


Dr Nic Williams

unread,
Jul 30, 2013, 2:14:10 AM7/30/13
to vcap...@cloudfoundry.org, vcap...@cloudfoundry.org
Thanks Matt for this process & the update.
--
Dr Nic Williams
Stark & Wayne LLC - the consultancy for Cloud Foundry
http://starkandwayne.com
+1 415 860 2185
twitter: drnic

Jamie Van Dyke

unread,
Jul 30, 2013, 5:50:49 AM7/30/13
to vcap...@cloudfoundry.org
Indeed, thank you!

st...@steveklabnik.com

unread,
Jul 30, 2013, 1:51:30 PM7/30/13
to vcap...@cloudfoundry.org
Great! As I mentioned before, please let me know if I can help in any way.
Reply all
Reply to author
Forward
0 new messages