PhantomJS DOWNLOAD_HANDLER setup

53 views

Skip to first unread message

David Fishburn

unread,

May 14, 2015, 10:11:47 AM5/14/15

to scrapy...@googlegroups.com

Besides my PhantomJS Middleware post, I believe the DOWNLOAD_HANDLER is different from the middleware (which is why I have posted separately).

I did find this github project:

flisky/scrapy-phantomjs-downloader

https://github.com/flisky/scrapy-phantomjs-downloader

Which provides a single python file:

scrapy_phantomjs/downloader/handler.py

Given I am new to both Python and Scrapy I am having a very hard time understanding where to put and how to reference that file.

Assuming I have the following project structure.

scrapy.cfg

sapui5api/

__init__.py

items.py

pipelines.py

settings.py

spiders/

sapui5api_spiders.py

1. Which directory to put handlers.py?

2. How to reference it (what name to use)?

This is what I have tried so far.

I added:

/sapui5api/spiders/handler.py

This file has this defined:


from __future__ import unicode_literals
 
 from scrapy import signals
 from scrapy.signalmanager import SignalManager
 from scrapy.responsetypes import responsetypes
 from scrapy.xlib.pydispatch import dispatcher
 from selenium import webdriver
 from six.moves import queue
 from twisted.internet import defer, threads
 from twisted.python.failure import Failure
 
 
 class PhantomJSDownloadHandler(object):

In /sapui5api/settings.py I added:

DOWNLOAD_HANDLERS = {
    'http': 'crawler.http.PhantomJSDownloadHandler',
    'https': 'crawler.https.PhantomJSDownloadHandler'
}

I also tried:

DOWNLOAD_HANDLERS = {
    'http': 'sapui5api.spiders.PhantomJSDownloadHandler',
    'https': 'sapui5api.spiders.PhantomJSDownloadHandler'
}

Really quite guessing at this point:

D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\settings\deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    BOT_VERSION: no longer used (user agent defaults to Scrapy now)
  warnings.warn(msg, ScrapyDeprecationWarning)
2015-05-14 10:08:34-0400 [scrapy] INFO: Scrapy 0.24.6 started (bot: sapui5api)
2015-05-14 10:08:34-0400 [scrapy] INFO: Optional features available: ssl, http11
2015-05-14 10:08:34-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sapui5api.spiders', 'SPIDER_MODULES': ['sapui5api.spiders'], 'USER_AGENT': 'sapui5api/1.0', 'BOT_NAME': 'sapui5api'}
2015-05-14 10:08:35-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
Traceback (most recent call last):
  File "d:\python27\scripts\scrapy-script.py", line 9, in <module>
    load_entry_point('scrapy==0.24.6', 'console_scripts', 'scrapy')()
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\commands\crawl.py", line 60, in run
    self.crawler_process.start()
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 92, in start
    if self.start_crawling():
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 124, in start_crawling
    return self._start_crawler() is not None
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 139, in _start_crawler
    crawler.configure()
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\crawler.py", line 47, in configure
    self.engine = ExecutionEngine(self, self._spider_closed)
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\engine.py", line 64, in __init__
    self.downloader = downloader_cls(crawler)
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\downloader\__init__.py", line 73, in __init__
    self.handlers = DownloadHandlers(crawler)
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\core\downloader\handlers\__init__.py", line 22, in __init__
    cls = load_object(clspath)
  File "D:\Python27\lib\site-packages\scrapy-0.24.6-py2.7.egg\scrapy\utils\misc.py", line 42, in load_object
    raise ImportError("Error loading object '%s': %s" % (path, e))
ImportError: Error loading object 'crawler.http.PhantomJSDownloadHandler': No module named crawler.http

Not sure how the whole naming thing works with Python and Scrapy.

How do you know in what directory to put the handler.py? The docs only talk about creating one, they never mention what directory you have to put these files or how to reference them properly after you create them.

Any help is greatly appreciated.

David

Reply all

Reply to author

Forward

0 new messages