First of all, I'm new to Action Hero but have been extremely pleased with my experience. Thank you for keeping this forum up-to-date, I look forward to participating.
Now, I am encountering difficulty coming to a conclusion about the best way to rate limit tasks. To keep the conversation sensible I'll describe my current implementation and then discuss a few possible alternatives I see, and then I would love to hear any suggestions, comments, questions, or concerns.
Current ImplementationI am writing a web crawler that I am initiating through an action "crawl". The crawl action is passed a URL which then enqueues a task called "crawl". When the crawl task runs it runs a function "getRateLimitWait" to retrieve the number of milliseconds the task worker should wait before making a request to the given URL. When the wait value is retrieved it runs a setTimeout for the wait and then makes the HTTP request. When the HTTP request is completed it parses the HTML for any links and enqueues each of them as separate crawl tasks. The crawler completes when all tasks have been emptied from the queue.
The basic code for the crawl task looks like this:
getRateLimitWait(function(error, wait){
setTimeout(function(){
makeRequest(params.url, function(error, response, body){
// <code for parsing the body>
_.each(urls, function(url){
api.tasks.enqueue('crawl', {url:url}, 'default');
});
next();
});
}, wait);
});
Pretty simple. Now, the getRateLimitWait function simply grabs how long from the current time the crawl task should wait before making a request.
function getRateLimitWait(next) {
var key = 'last_crawl_time';
var time_limit = 500; // rate limit of 2 requests per second
api.cache.load(key, {}, function(error, last_time, ttl, created_at, read_at){
var curr_time = Date.now();
if (error) {
// no request has been made yet
var next_time = Date.now();
} else {
// set the next time as the current time if we are past the time limit
// else, set the next time as the last time + the time limit
var next_time = ((curr_time - last_time) > time_limit) ? curr_time : (last_time + time_limit);
}
api.cache.save(key, next_time, ttl, function(error, saved) {
var wait = next_time - curr_time;
next(!saved, wait);
});
});
}I see some refactoring that could be done of that function, but you get the idea. All in all, I think this solution works well. No matter how many task workers you have, each worker will always start running at least 500ms after the previous one. However, there is one possible exception. I'm not sure if this could happen, but the exception would be if two task workers call getRateLimitWait at the same time. Then, they are calling api.cache.load at the same time which results in them both having the same "last_time" value and therefore the same "wait" value. So, I haven't quite figured that out with this solution.
Possible Alternatives1. Use Enqueue Task AtInstead of using enqueueTask you could use enqueueTaskAt. But, to support multiple workers you will still need to keep track of when to set the "at" time to. Instead of a getRateLimitWait function you would have a function like getRateLimitNextTime that returns the next time. I see no advantage of this over what I've implemented and you still have the case if two workers call getRateLimitWait at the same time.
2. Recursive Task Enqueueing
Perhaps an asynchronous application isn't the best design for crawling. You could make a recursive call to enqueueTask that waits for its child task to complete before completing itself. This code might look like:
setTimeout(function(){
makeRequest(params.url, function(error1, response, body){
// <code for parsing the body>
async.eachSeries(urls, function(url, callback){
api.tasks.enqueue('crawl', {url:url}, 'default', function(){
callback();
});
}, function(){
// we are done, free this worker for next task
next();
});
});
}, 500);
But, if you do this, then you have a near fork bomb. You need enough workers to handle the deepest number of levels of urls found. For example, you crawl
http://example.com and find a link to page-2, which links to page-3, which links to page-4. Well, the task for crawling / won't finish until crawling page-2 is complete which won't finish until crawling page-3 is complete which won't finish until crawling page-4 is complete. So, in this case you would need at least 4 workers. This is not an ideal scenario.
3. Use a Recurring Task
You could use the obvious recurring task. You could put all the URLs in a custom queue, set the task frequency to 500ms, and then each time the task runs it will pull a URL from the queue, scrape for URLs, and add those new URLs into the queue. However, this has two problems. First, if you have multiple workers I don't know if both workers would be running the task at 500ms intervals (I don't know AH well enough to know if the frequency is the interval for ANY worker). Second, this is not optimized if you want to crawl multiple sites at the same time. Crawling two sites should allow you to run the task at a 250ms interval instead of 500ms.
Suggestions, Comments, Questions, or Concerns?So,
what is the best practice for rate limiting tasks? The solution I've implemented does work with the one exception. The alternatives I see as lesser options. I would love to hear your thoughts.