PhantomJS DOWNLOAD_HANDLER setup

53 views
Skip to first unread message

David Fishburn

unread,
May 14, 2015, 10:11:47 AM5/14/15
to scrapy...@googlegroups.com
Besides my PhantomJS Middleware post, I believe the DOWNLOAD_HANDLER is different from the middleware (which is why I have posted separately).
I did find this github project:


Which provides a single python file:
    scrapy_phantomjs/downloader/handler.py

Given I am new to both Python and Scrapy I am having a very hard time understanding where to put and how to reference that file.

Assuming I have the following project structure.

/
    scrapy.cfg

    sapui5api/
        __init__.py
        items.py
        pipelines.py
        settings.py

        spiders/
            sapui5api_spiders.py


1.  Which directory to put handlers.py?
2.  How to reference it (what name to use)?


This is what I have tried so far.

I added:
   
/sapui5api/spiders/handler.py


This file has this defined:


from __future__ import unicode_literals
 
 
from scrapy import signals
 
from scrapy.signalmanager import SignalManager
 
from scrapy.responsetypes import responsetypes
 
from scrapy.xlib.pydispatch import dispatcher
 
from selenium import webdriver
 
from six.moves import queue
 
from twisted.internet import defer, threads
 
from twisted.python.failure import Failure
 
 
 
class PhantomJSDownloadHandler(object):







In /sapui5api/settings.py I added:

DOWNLOAD_HANDLERS = {
   
'http': 'crawler.http.PhantomJSDownloadHandler',
   
'https': 'crawler.https.PhantomJSDownloadHandler'
}


I also tried:

DOWNLOAD_HANDLERS = {
   
'http': 'sapui5api.spiders.PhantomJSDownloadHandler',
   
'https': 'sapui5api.spiders.PhantomJSDownloadHandler'
}


Really quite guessing at this point:

D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\settings\deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    BOT_VERSION
: no longer used (user agent defaults to Scrapy now)
  warnings
.warn(msg, ScrapyDeprecationWarning)
2015-05-14 10:08:34-0400 [scrapy] INFO: Scrapy 0.24.6 started (bot: sapui5api)
2015-05-14 10:08:34-0400 [scrapy] INFO: Optional features available: ssl, http11
2015-05-14 10:08:34-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sapui5api.spiders', 'SPIDER_MODULES': ['sapui5api.spiders'], 'USER_AGENT': 'sapui5api/1.0', 'BOT_NAME': 'sapui5api'}
2015-05-14 10:08:35-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
Traceback (most recent call last):
 
File "d:\python27\scripts\scrapy-script.py", line 9, in <module>
    load_entry_point
('scrapy==0.24.6', 'console_scripts', 'scrapy')()
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", line 143, in execute
    _run_print_help
(parser, _run_command, cmd, args, opts)
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", line 89, in _run_print_help
    func
(*a, **kw)
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", line 150, in _run_command
    cmd
.run(args, opts)
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\commands\crawl.py", line 60, in run
   
self.crawler_process.start()
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 92, in start
   
if self.start_crawling():
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 124, in start_crawling
   
return self._start_crawler() is not None
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 139, in _start_crawler
    crawler
.configure()
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 47, in configure
   
self.engine = ExecutionEngine(self, self._spider_closed)
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\engine.py", line 64, in __init__
   
self.downloader = downloader_cls(crawler)
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\downloader\__init__.py", line 73, in __init__
   
self.handlers = DownloadHandlers(crawler)
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 22, in __init__
    cls
= load_object(clspath)
 
File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\utils\misc.py", line 42, in load_object
   
raise ImportError("Error loading object '%s': %s" % (path, e))
ImportError: Error loading object 'crawler.http.PhantomJSDownloadHandler': No module named crawler.http


Not sure how the whole naming thing works with Python and Scrapy.

How do you know in what directory to put the handler.py?  The docs only talk about creating one, they never mention what directory you have to put these files or how to reference them properly after you create them.

Any help is greatly appreciated.

David


Reply all
Reply to author
Forward
0 new messages