Scrapy 0.14 - DOWNLOAD_DELAY doesn't seem to work?

472 views
Skip to first unread message

Mimino

unread,
Oct 4, 2012, 3:56:59 PM10/4/12
to scrapy...@googlegroups.com
Hi guys,
I need a little advice.

In my spider I first download the sitemap, containing hundreds of links, which I yield as Requests. I don't want to behave bad on the site, so I want to download those links one by one. But the problem is, even with the following settings:

DOWNLOAD_DELAY = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 1

the slowdown doesn't seem to work. It posts 16 requests in parallel, which means CONCURRENT_REQUESTS is in charge. And indeed changing CONCURRENT_REQUESTS to 1 works (but there is still not delay between the requests). But it also limits the total number of requests for all the spiders, which is definitely not what I want. Can anyone please give me any advice on what is happening here? :-o

Note: I do NOT use AutoThrottle extension or any other non-default extension/middleware.

Pablo Hoffman

unread,
Nov 4, 2012, 5:12:31 PM11/4/12
to scrapy...@googlegroups.com
Hi Mimino,

Could you upgrade to 0.16 and tell us if you're still experiencing the same problem?. What you're doing is fine. You could also consider setting CONCURRENT_REQUESTS to 1, to force global concurrency to 1, but it shouldn't be needed unless the domains of those requests differ.

Pablo.



--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/ql4DxegzWMoJ.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Mimino

unread,
Nov 7, 2012, 7:34:21 AM11/7/12
to scrapy...@googlegroups.com
Hi Pablo,

The DOWNLOAD_DELAY seems to work now, like expected. But I'm a little confused, though. I have the following DownloaderMiddleware:

from scrapy import log

class LogMiddleware(object):
    def process_request(self, request, spider):
        log.msg('Sending request: %s' % request, log.INFO)

    def process_response(self, request, response, spider):
        log.msg('Receiving response: %s' % response, log.INFO)
        return response


And it's middleware priority is 1000 (i.e. it is the last one to be executed). When I start crawling (supposing there are many requests from start_requests()) with:
CONCURRENT_REQUESTS = 5
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY=10
I instantly receive 5 "Sending request:..." messages. They are probably queued, because the responses are coming in 10 second intervals. But after reading http://doc.scrapy.org/en/0.16/topics/architecture.html, either the requests are queued in Downloader component or the flow is not exactly as depicted. Which one is correct?


Anyway, another small bugfix:
- in scrapy/core/scraper.py, line 215:
Replace:
log.err(output, 'Error processing %(item)s', item=item, spider=spider)
With:
log.err(output, 'Error processing %s' % item, spider=spider)
- I would also suggest for DjangoItem's save() add argument validate and run full_clean(), for the model, if the validate is True
- and for command "shell" add option "--spider" to select, which spider to use in shell

I have these things implemented in my local repository, so probably I can contribute?

Regards,
Mimino

Vasili Reikh

unread,
Nov 7, 2012, 10:50:50 AM11/7/12
to scrapy...@googlegroups.com
Hi everybody
I'm newbie in scrapy and I have a problem with scrapy.log settings.
I need logging, but not with level DEBUG level. I prefer ERROR LOG_LEVEL settings.
It works fine  if I run my spider from command line (it works with my_project.settings LOG_LEVEL settings).
But when I schedule my spiders on scrapyd server it omits my_project.settings LOG_LEVEL settings.
 
I've tried to set  my_project.settings LOG_LEVEL=scrapy.log.ERROR settings. But when It runs I open in browser localhost:6800/logs/my_project/XXX.log it logs with LOG_LEVEL=scrapy.log.DEBUG settings :(
Then I've tried to schedule spider with inline LOG_LEVEL settings
curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d setting=LOG_LEVEL=scrapy.log.ERROR Result did not change: any time it logs with DEBUG level (~2GB per hour).


Please, help!
 

Steven Almeroth

unread,
Nov 10, 2012, 2:00:11 PM11/10/12
to scrapy...@googlegroups.com
@Mimino, issue a pull-request on your Scrapy fork in Github, to contribute.

Steven Almeroth

unread,
Nov 10, 2012, 2:01:17 PM11/10/12
to scrapy...@googlegroups.com
@Nism, start a new thread with this question (i.e. you are attempting to highjack this thread).


On Thursday, October 4, 2012 2:56:59 PM UTC-5, Mimino wrote:

Pablo Hoffman

unread,
Nov 12, 2012, 1:33:50 PM11/12/12
to scrapy...@googlegroups.com
Can you paste the line in your project settings.py where you define the log level?


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.

Vasili Reikh

unread,
Nov 12, 2012, 1:52:46 PM11/12/12
to scrapy...@googlegroups.com
Hi!
In settings.py:
 
from scrapy import log
...
...
...
LOG_LEVEL = log.ERROR
////////////////////////////////////
 
This settings work fine when I execute spider script from command line,
but not work when I execute spider script within scrapyd server (every time with DEBUG level)

Pablo Hoffman

unread,
Nov 13, 2012, 12:00:43 AM11/13/12
to scrapy...@googlegroups.com
It should be:

LOG_LEVEL = 'ERROR'

Vasili Reikh

unread,
Nov 14, 2012, 5:01:01 PM11/14/12
to scrapy...@googlegroups.com
Yes, it works!!!
Thank you!!!!!
Reply all
Reply to author
Forward
0 new messages