| How to solve rate_limit for celery based crawler | Tomáš Mikula | 12/2/16 5:06 AM | Hi, I am writing crawler with celery and I am thinking about creating some queue of task for dowload URL. The problem is, that I want to keep some time delays between requesting same domain. I can use global time_rate for task, but it will slow my crawler globally, because if I am downloading 4 different domains, it can be done at once. Anyone have some tips? Thanks |
| Re: How to solve rate_limit for celery based crawler | Mark Heppner | 12/20/16 7:46 AM | I'd use some sort of database or semi-persistent cache. The time_rate is for the entire task, regardless of the args you send. You'll need your own control mechanism: @app.task def crawl(site, rate_limit=10 minutes): last_crawled = cache.get(site) or database.get(site) if last_crawled >= now() + rate_limit: do_something() If the site has been crawled within 10 minutes, your task will still "run" but nothing will happen. Just do your own rate limiting inside the task and don't use time_rate globally. |