Scrapy 0.14 - DOWNLOAD_DELAY doesn't seem to work?

Mimino

unread,

Oct 4, 2012, 3:56:59 PM10/4/12

to scrapy...@googlegroups.com

Hi guys,
I need a little advice.

In my spider I first download the sitemap, containing hundreds of links, which I yield as Requests. I don't want to behave bad on the site, so I want to download those links one by one. But the problem is, even with the following settings:

DOWNLOAD_DELAY = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 1

the slowdown doesn't seem to work. It posts 16 requests in parallel, which means CONCURRENT_REQUESTS is in charge. And indeed changing CONCURRENT_REQUESTS to 1 works (but there is still not delay between the requests). But it also limits the total number of requests for all the spiders, which is definitely not what I want. Can anyone please give me any advice on what is happening here? :-o

Note: I do NOT use AutoThrottle extension or any other non-default extension/middleware.

Pablo Hoffman

unread,

Nov 4, 2012, 5:12:31 PM11/4/12

to scrapy...@googlegroups.com

Hi Mimino,

Could you upgrade to 0.16 and tell us if you're still experiencing the same problem?. What you're doing is fine. You could also consider setting CONCURRENT_REQUESTS to 1, to force global concurrency to 1, but it shouldn't be needed unless the domains of those requests differ.

Pablo.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/ql4DxegzWMoJ.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Mimino

unread,

Nov 7, 2012, 7:34:21 AM11/7/12

to scrapy...@googlegroups.com

Hi Pablo,

The DOWNLOAD_DELAY seems to work now, like expected. But I'm a little confused, though. I have the following DownloaderMiddleware:

from scrapy import log

class LogMiddleware(object):
    def process_request(self, request, spider):
        log.msg('Sending request: %s' % request, log.INFO)

    def process_response(self, request, response, spider):
        log.msg('Receiving response: %s' % response, log.INFO)
        return response

And it's middleware priority is 1000 (i.e. it is the last one to be executed). When I start crawling (supposing there are many requests from start_requests()) with:

CONCURRENT_REQUESTS = 5
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY=10

I instantly receive 5 "Sending request:..." messages. They are probably queued, because the responses are coming in 10 second intervals. But after reading http://doc.scrapy.org/en/0.16/topics/architecture.html, either the requests are queued in Downloader component or the flow is not exactly as depicted. Which one is correct?

Anyway, another small bugfix:
- in scrapy/core/scraper.py, line 215:
Replace:

log.err(output, 'Error processing %(item)s', item=item, spider=spider)

With:

log.err(output, 'Error processing %s' % item, spider=spider)

- I would also suggest for DjangoItem's save() add argument validate and run full_clean(), for the model, if the validate is True
- and for command "shell" add option "--spider" to select, which spider to use in shell

I have these things implemented in my local repository, so probably I can contribute?

Regards,
Mimino

Vasili Reikh

unread,

Nov 7, 2012, 10:50:50 AM11/7/12

to scrapy...@googlegroups.com

Hi everybody
I'm newbie in scrapy and I have a problem with scrapy.log settings.
I need logging, but not with level DEBUG level. I prefer ERROR LOG_LEVEL settings.
It works fine if I run my spider from command line (it works with my_project.settings LOG_LEVEL settings).
But when I schedule my spiders on scrapyd server it omits my_project.settings LOG_LEVEL settings.

I've tried to set my_project.settings LOG_LEVEL=scrapy.log.ERROR settings. But when It runs I open in browser localhost:6800/logs/my_project/XXX.log it logs with LOG_LEVEL=scrapy.log.DEBUG settings :(
Then I've tried to schedule spider with inline LOG_LEVEL settings
curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d setting=LOG_LEVEL=scrapy.log.ERROR Result did not change: any time it logs with DEBUG level (~2GB per hour).

Please, help!

Steven Almeroth

unread,

Nov 10, 2012, 2:00:11 PM11/10/12

to scrapy...@googlegroups.com

@Mimino, issue a pull-request on your Scrapy fork in Github, to contribute.

Steven Almeroth

unread,

Nov 10, 2012, 2:01:17 PM11/10/12

to scrapy...@googlegroups.com

@Nism, start a new thread with this question (i.e. you are attempting to highjack this thread).

On Thursday, October 4, 2012 2:56:59 PM UTC-5, Mimino wrote:

Pablo Hoffman

unread,

Nov 12, 2012, 1:33:50 PM11/12/12

to scrapy...@googlegroups.com

Can you paste the line in your project settings.py where you define the log level?

--

You received this message because you are subscribed to the Google Groups "scrapy-users" group.

Vasili Reikh

unread,

Nov 12, 2012, 1:52:46 PM11/12/12

to scrapy...@googlegroups.com

Hi!
In settings.py:

from scrapy import log
...
...
...
LOG_LEVEL = log.ERROR
////////////////////////////////////

This settings work fine when I execute spider script from command line,
but not work when I execute spider script within scrapyd server (every time with DEBUG level)

Pablo Hoffman

unread,

Nov 13, 2012, 12:00:43 AM11/13/12

to scrapy...@googlegroups.com

It should be:

LOG_LEVEL = 'ERROR'

Vasili Reikh

unread,

Nov 14, 2012, 5:01:01 PM11/14/12

to scrapy...@googlegroups.com

Yes, it works!!!
Thank you!!!!!

Reply all

Reply to author

Forward