Hi Folks,
Summary
I want to modify what our Scrapy spider decides to download based upon the content of the HTTP headers. I've found snippets for controlling that behavior, but they do not seem to work with Scrapy 0.18. What is the best way to override default Scrapy behavior so I can trigger our filtering code once enough of the document has been downloaded to parse the HTTP headers?
Details
I'm a search engine architect evaluating Scrapy for use as an adhoc/diagnostic site crawler to compliment a commercial web crawler our client uses.
Many of the websites we crawl contain dynamic links to many very large files on them that you do not know the content type of until you get the HTTP headers (the URLs do not contain file extensions).
I want to modify our Scrapy spider to support the ability to only download documents if they are of content types:
text/html
text/plain
I further want to limit the size so that if a content length is listed as more than 5 mb, we truncate or drop it.
I've search around extensively and found some snippets for accomplishing this behavior, like this one (
http://snipplr.com/view/66993/), but the attempts to override DOWNLOADER_HTTPCLIENTFACTORY seem to have no effect. (even putting bogus module paths in there have no effect)
I think what is happening is that Scrapy is using HTTP 1.1 (which is good), but the ability to override the HTTP client factory is only supported in Scrapy's scrapy.core.downloader.handlers.http10 and not in scrapy.core.downloader.handlers.http11. I have no idea if this is a 'bug' or a 'feature'
I searched the source code and noticed DOWNLOADER_HTTPCLIENTFACTORY is used in http10.py but not http11.py, while DOWNLOADER_CLIENTCONTEXTFACTORY is referenced in both as well as default_settings.py:
~/tmp/scrapy/scrapy $ ack DOWNLOADER_HTTPCLIENTFACTORY
core/downloader/handlers/http10.py
10: self.HTTPClientFactory = load_object(settings['DOWNLOADER_HTTPCLIENTFACTORY'])
settings/default_settings.py
68:DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
~/tmp/scrapy/scrapy $ ack DOWNLOADER_CLIENTCONTEXTFACTORY
core/downloader/handlers/http10.py
11: self.ClientContextFactory = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
core/downloader/handlers/http11.py
28: self._contextFactoryClass = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
settings/default_settings.py
69:DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
For completeness, I added the following to our settings.py:
DOWNLOADER_HTTPCLIENTFACTORY = 'adhoc.downloader.LimitSizeHTTPClientFactory'
...And I made a factory to reject anything (to confirm it was working):
MAX_RESPONSE_SIZE = 1 #048576 # 1Mb
from scrapy.core.downloader.webclient import ScrapyHTTPClientFactory, ScrapyHTTPPageGetter
class LimitSizePageGetter(ScrapyHTTPPageGetter):
def handleHeader(self, key, value):
ScrapyHTTPPageGetter.handleHeader(self, key, value)
self.connectionLost('oversized')
#if key.lower() == 'content-length' and int(value) > MAX_RESPONSE_SIZE:
# self.connectionLost('oversized')
class LimitSizeHTTPClientFactory(ScrapyHTTPClientFactory):
protocol = LimitSizePageGetter
What is the best approach for adding the functionality I am looking for as a user without customizing Scrapy directly?
Thank you for your time. -Michael