How to solve rate_limit for celery based crawler

Showing 1-2 of 2 messages
How to solve rate_limit for celery based crawler Tomáš Mikula 12/2/16 5:06 AM
Hi,

I am writing crawler with celery and I am thinking about creating some queue of task for dowload URL.

The problem is, that I want to keep some time delays between requesting same domain. I can use global time_rate for task, but it will slow my crawler globally, because if I am downloading 4 different domains, it can be done at once.

Anyone have some tips? Thanks
Re: How to solve rate_limit for celery based crawler Mark Heppner 12/20/16 7:46 AM
I'd use some sort of database or semi-persistent cache. The time_rate is for the entire task, regardless of the args you send. You'll need your own control mechanism:

@app.task
def crawl(site, rate_limit=10 minutes):
    last_crawled = cache.get(site) or database.get(site)
    if last_crawled >= now() + rate_limit:
        do_something()


If the site has been crawled within 10 minutes, your task will still "run" but nothing will happen. Just do your own rate limiting inside the task and don't use time_rate globally.