How to override HTTP Client Factory behavior in Scrapy 0.18?

516 views
Skip to first unread message

Michael McIntosh

unread,
Oct 21, 2013, 4:24:12 PM10/21/13
to scrapy...@googlegroups.com
Hi Folks,

Summary

I want to modify what our Scrapy spider decides to download based upon the content of the HTTP headers. I've found snippets for controlling that behavior, but they do not seem to work with Scrapy 0.18. What is the best way to override default Scrapy behavior so I can trigger our filtering code once enough of the document has been downloaded to parse the HTTP headers? 

Details

I'm a search engine architect evaluating Scrapy for use as an adhoc/diagnostic site crawler to compliment a commercial web crawler our client uses.

Many of the websites we crawl contain dynamic links to many very large files on them that you do not know the content type of until you get the HTTP headers (the URLs do not contain file extensions).

I want to modify our Scrapy spider to support the ability to only download documents if they are of content types:

text/html
text/plain

I further want to limit the size so that if a content length is listed as more than 5 mb, we truncate or drop it.

I've search around extensively and found some snippets for accomplishing this behavior, like this one (http://snipplr.com/view/66993/), but the attempts to override DOWNLOADER_HTTPCLIENTFACTORY seem to have no effect. (even putting bogus module paths in there have no effect)

I think what is happening is that Scrapy is using HTTP 1.1 (which is good), but the ability to override the HTTP client factory is only supported in Scrapy's scrapy.core.downloader.handlers.http10 and not in scrapy.core.downloader.handlers.http11. I have no idea if this is a 'bug' or a 'feature'

I searched the source code and noticed DOWNLOADER_HTTPCLIENTFACTORY is used in http10.py but not http11.py, while DOWNLOADER_CLIENTCONTEXTFACTORY is referenced in both as well as default_settings.py:

~/tmp/scrapy/scrapy $ ack DOWNLOADER_HTTPCLIENTFACTORY
core/downloader/handlers/http10.py
10:        self.HTTPClientFactory = load_object(settings['DOWNLOADER_HTTPCLIENTFACTORY'])

settings/default_settings.py
68:DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'

~/tmp/scrapy/scrapy $ ack DOWNLOADER_CLIENTCONTEXTFACTORY
core/downloader/handlers/http10.py
11:        self.ClientContextFactory = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])

core/downloader/handlers/http11.py
28:        self._contextFactoryClass = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])

settings/default_settings.py
69:DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'

For completeness, I added the following to our settings.py:

DOWNLOADER_HTTPCLIENTFACTORY = 'adhoc.downloader.LimitSizeHTTPClientFactory'

...And I made a factory to reject anything (to confirm it was working):

MAX_RESPONSE_SIZE = 1 #048576 # 1Mb

from scrapy.core.downloader.webclient import ScrapyHTTPClientFactory, ScrapyHTTPPageGetter

class LimitSizePageGetter(ScrapyHTTPPageGetter):

   
def handleHeader(self, key, value):
       
ScrapyHTTPPageGetter.handleHeader(self, key, value)
       
self.connectionLost('oversized')
       
#if key.lower() == 'content-length' and int(value) > MAX_RESPONSE_SIZE:
       
#    self.connectionLost('oversized')

class LimitSizeHTTPClientFactory(ScrapyHTTPClientFactory):
    protocol
= LimitSizePageGetter


What is the best approach for adding the functionality I am looking for as a user without customizing Scrapy directly?

Thank you for your time. -Michael 

Яков Штоколов

unread,
Oct 27, 2014, 10:39:25 AM10/27/14
to scrapy...@googlegroups.com
Maybe it's too late, but I think that it needs to be answered.

To understand what's happening you need to clarify one thing:
for handling HTTP 1.0 requests, Scrapy uses twisted.web.http.HTTPClient class (http://twistedmatrix.com/documents/8.1.0/api/twisted.web.http.HTTPClient.html), but for HTTP 1.1 handling it uses more high-level client called twisted.web.client.Agent (http://twistedmatrix.com/documents/13.1.0/api/twisted.web.client.Agent.html). That's why there is no HTTPClientFactory to override.

So, to extend the default functionality, you need to override 'http' and 'https' handlers in DOWNLOAD_HANDLERS settings variable.
Like this:
DOWNLOAD_HANDLERS = {
   
'http': 'myproject.downloadhandlers.http11.MyHTTP11DownloadHandler',
   
'https': 'myproject.downloadhandlers.http11.MyHTTP11DownloadHandler',
}

And your http11.py will looks like this:
from scrapy.core.downloader.handlers.http11 import HTTP11DownloadHandler, ScrapyAgent

class MyHTTP11DownloadHandler(HTTP11DownloadHandler):

   
def download_request(self, request, spider):
       
"""Return a deferred for the HTTP download"""

        agent
= MyScrapyAgent(contextFactory=self._contextFactory, pool=self._pool)
       
return agent.download_request(request)


class MyScrapyAgent(ScrapyAgent):

   
def _cb_bodyready(self, txresponse, request):
       
"""
          Prevents body downloading if content-length
          is more than constant value
        """


        content_length
= int(txresponse.headers.getRawHeaders("content-length", [0])[0])

       
if content_length > MAX_RESPONSE_SIZE:
           
return txresponse, '', None

       
return super(MyScrapyAgent, self)._cb_bodyready(txresponse, request)

That's all!
If you need to extend this with more complex way, just see to scrapy/core/downloader/handlers/http11.py and read the twisted documentation above.
Reply all
Reply to author
Forward
0 new messages