Best Practices for Rate Limiting Tasks

157 views
Skip to first unread message

Joseph Purcell

unread,
Mar 18, 2014, 11:28:04 AM3/18/14
to action...@googlegroups.com
First of all, I'm new to Action Hero but have been extremely pleased with my experience. Thank you for keeping this forum up-to-date, I look forward to participating.

Now, I am encountering difficulty coming to a conclusion about the best way to rate limit tasks. To keep the conversation sensible I'll describe my current implementation and then discuss a few possible alternatives I see, and then I would love to hear any suggestions, comments, questions, or concerns.

Current Implementation

I am writing a web crawler that I am initiating through an action "crawl". The crawl action is passed a URL which then enqueues a task called "crawl". When the crawl task runs it runs a function "getRateLimitWait" to retrieve the number of milliseconds the task worker should wait before making a request to the given URL. When the wait value is retrieved it runs a setTimeout for the wait and then makes the HTTP request. When the HTTP request is completed it parses the HTML for any links and enqueues each of them as separate crawl tasks. The crawler completes when all tasks have been emptied from the queue.

The basic code for the crawl task looks like this:

getRateLimitWait(function(error, wait){
  setTimeout(function(){
    makeRequest(params.url, function(error, response, body){
      // <code for parsing the body>
      _.each(urls, function(url){
        api.tasks.enqueue('crawl', {url:url}, 'default');
      });
      next();
    });
  }, wait);
});

Pretty simple. Now, the getRateLimitWait function simply grabs how long from the current time the crawl task should wait before making a request.

function getRateLimitWait(next) {
  var key = 'last_crawl_time';
  var time_limit = 500; // rate limit of 2 requests per second
  api.cache.load(key, {}, function(error, last_time, ttl, created_at, read_at){
    var curr_time = Date.now();
    if (error) {
      // no request has been made yet
      var next_time = Date.now();
    } else {
      // set the next time as the current time if we are past the time limit
      // else, set the next time as the last time + the time limit
      var next_time = ((curr_time - last_time) > time_limit) ? curr_time : (last_time + time_limit);
    }
    api.cache.save(key, next_time, ttl, function(error, saved) {
      var wait = next_time - curr_time;
      next(!saved, wait);
    });
  });
}


I see some refactoring that could be done of that function, but you get the idea. All in all, I think this solution works well. No matter how many task workers you have, each worker will always start running at least 500ms after the previous one. However, there is one possible exception. I'm not sure if this could happen, but the exception would be if two task workers call getRateLimitWait at the same time. Then, they are calling api.cache.load at the same time which results in them both having the same "last_time" value and therefore the same "wait" value. So, I haven't quite figured that out with this solution.

Possible Alternatives

1. Use Enqueue Task At

Instead of using enqueueTask you could use enqueueTaskAt. But, to support multiple workers you will still need to keep track of when to set the "at" time to. Instead of a getRateLimitWait function you would have a function like getRateLimitNextTime that returns the next time. I see no advantage of this over what I've implemented and you still have the case if two workers call getRateLimitWait at the same time.

2. Recursive Task Enqueueing

Perhaps an asynchronous application isn't the best design for crawling. You could make a recursive call to enqueueTask that waits for its child task to complete before completing itself. This code might look like:

setTimeout(function(){
  makeRequest(params.url, function(error1, response, body){
    // <code for parsing the body>
    async.eachSeries(urls, function(url, callback){
      api.tasks.enqueue('crawl', {url:url}, 'default', function(){
        callback();
      });
    }, function(){
      // we are done, free this worker for next task
      next();
    });
  });
}, 500);

But, if you do this, then you have a near fork bomb. You need enough workers to handle the deepest number of levels of urls found. For example, you crawl http://example.com and find a link to page-2, which links to page-3, which links to page-4. Well, the task for crawling / won't finish until crawling page-2 is complete which won't finish until crawling page-3 is complete which won't finish until crawling page-4 is complete. So, in this case you would need at least 4 workers. This is not an ideal scenario.

3. Use a Recurring Task

You could use the obvious recurring task. You could put all the URLs in a custom queue, set the task frequency to 500ms, and then each time the task runs it will pull a URL from the queue, scrape for URLs, and add those new URLs into the queue. However, this has two problems. First, if you have multiple workers I don't know if both workers would be running the task at 500ms intervals (I don't know AH well enough to know if the frequency is the interval for ANY worker). Second, this is not optimized if you want to crawl multiple sites at the same time. Crawling two sites should allow you to run the task at a 250ms interval instead of 500ms.

Suggestions, Comments, Questions, or Concerns?

So, what is the best practice for rate limiting tasks? The solution I've implemented does work with the one exception. The alternatives I see as lesser options. I would love to hear your thoughts.

Brian Corrigan

unread,
Mar 18, 2014, 12:24:17 PM3/18/14
to Joseph Purcell, action...@googlegroups.com
Hey Joseph -

I'll throw my two cents in here (for whatever that's worth!). I think
I'd go down the enqueuing route and have each successful job enqueue
the next one. I like this for several reasons.

1) Tasks are persistent and can recover in case of some sort of outage
2) Tasks can be distributed horizontally to any number of workers as
the number of tasks grow
3) This give you many scheduling options, ie: every x minutes from
last success, every y minutes from last failure, x times per z
minutes, etc. You can even change it dynamically based on operating
conditions (perhaps you detect that content has changed and dont' want
to check again for a longer duration, etc.) Finally, some websites
provide sitemap.xml files which suggest how often crawlers should poll
for changes; if you implemented a system where the frequency is
dynamic you could use a different value for each URL.
4) Finally, a queue is very transparent and easy to monitor with the
provided resque interface

It sounds like a fun project :)

Cheers!
Brian
Brian Corrigan
MadGlory Interactive
http://www.madglory.com
Tw: @madgloryint
Ph: 518.867.1439
> --
> You received this message because you are subscribed to the Google Groups
> "actionHero.js" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to actionhero-j...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Evan Tahler

unread,
Mar 18, 2014, 1:27:29 PM3/18/14
to action...@googlegroups.com, Joseph Purcell, br...@madglory.com
A pattern I often see user in Resque (which is the backend to actionhero's task system), is the notion of a job being able to re-enqueue itself if it isn't time to run yet.  You rarely want to have a sleep/timeout within your job, because you are pausing a worker from working... and that means you need more workers to get the same throughput... and that costs more.

I'm guessing the rate-limit is per-host, as you don't want to DDOS whomever your are crawling.  If that's the case, just store the base domain/url in redis directly with a TTL on the key (using api.cache or redis directly).  Your job starts up and checks if the key is present.  If it is not, you know you can work the URL, so write the key to cache and work.  Otherwise, just re-enquueIn the job and exit.   You can rely on the natural expiration of stale keys to act as a semaphore for you. 

Joseph Purcell

unread,
Mar 19, 2014, 5:10:45 PM3/19/14
to action...@googlegroups.com, Joseph Purcell, br...@madglory.com
Thank you Evan. I will work through implementing your suggestion and post back my findings.
Reply all
Reply to author
Forward
0 new messages