Why there isn't a downloader middleware doing the real downloading?

738 views
Skip to first unread message

Dan

unread,
Nov 10, 2013, 9:49:25 AM11/10/13
to scrapy...@googlegroups.com
Hi,

The default downloader middlewares are
{
    'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
    'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
    'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
    'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
}

It seems they either prepare for download or process after download, then where is the real downloading?

I ask this question becuase I write a downloader middleware
DOWNLOADER_MIDDLEWARES = {
    'air_fare.middlewares.CasperjsDownloaderMiddleware': 543,
}

It just calls casperjs to download the page and return the response, it works well.
But would the page be downloaded twice? one by my own downloader middleware, one by the default downloader middleware.

Any help will be highly appreciated!

Daniel Graña

unread,
Nov 10, 2013, 8:48:04 PM11/10/13
to scrapy...@googlegroups.com
hi Dan,

there is not way to wait for a downloadermiddleware process_request result without blocking the entire crawl, what you are trying to accomplish is better done by using a custom downloader handler, see http://doc.scrapy.org/en/0.20/topics/settings.html#download-handlers

good luck,
Dainel   

Dan

unread,
Nov 12, 2013, 5:13:55 AM11/12/13
to scrapy...@googlegroups.com
Daniel, Thanks for your import hint to download handler.

After some more digging, I had some understanding of my original question.
1. The real download is done by download handler which is at the end of the downloader middleware pipeline. It is the yellow "Downloader" in the architecture chart, see http://doc.scrapy.org/en/latest/topics/architecture.html#architecture-overview. And the source code is at \scrapy\core\downloader\handlers.
2. To achieve my goal, there are 2 ways. The one is to write a downloader middleware and put it to the first of downloader middleware pipeline. It just fetch the page dealing with Javascript and return the html response. All other downloader middleware and the downloader will be skipped. The other one is to write a custom download handler to fetch the page. There are examples for both at https://github.com/scrapinghub/scrapyjs.
Reply all
Reply to author
Forward
0 new messages