This is my third or fourth post in the last 24 hours. I freely admit that I don’t know what I am doing, and that over the last several hours for this particular issue I have been guessing, because I didn’t know what scrapy wanted from me and I couldn’t find an answer.
Here are just a few lines from my log today. It runs over 100 pages when pasted into my word processor. I was just trying to make this work with the pipeline. It started with this error:
SavePipeline(item)
TypeError: object() takes no parameters
and never got better.
I read on SO that this was because my pipeline class did not have its own __init__ method, and so python was searching in the parent object for one. I thought that made sense, so I put an __init__ in there, and hell ensued. It was the usual ‘how many arguments’ problem, but when I tried giving it only self, and leaving the rest blank or with ‘pass’, I got indentation errors.
So I tried putting something innocuous like self.name = name, and we were back to the how many arguments error. I tried giving it process_item as an attribute, and after many go rounds and variations, that worked, but then it wouldn’t take my call to the process_item method – back to the number of arguments again. I imported my spider, and that helped, but still the errors kept coming. It’s been about 6 hours. I have Googled all over the place. I give up. I don’t get it. I need help.
Here is one full traceback, typical of most but hardly the only one, followed by an abbreviated version of some others, including the last:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/core/engine.py", line 70, in __init__
self.scraper = Scraper(crawler)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 986, in _gcd_import
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 665, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 87, in <module>
class SavePipeline(object):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in SavePipeline
SavePipeline(process_item)
NameError: name 'SavePipeline' is not defined
2017-05-28 02:43:30,386:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 87, in <module>
class SavePipeline(object):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in SavePipeline
SavePipeline(process_item)
NameError: name 'SavePipeline' is not defined
2017-05-28 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in <module>
SavePipeline(process_item)
NameError: name 'process_item' is not defined
2017-05-28 02:44:46,862:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in <module>
SavePipeline(process_item)
NameError: name 'process_item' is not defined
2017-05-28 03:10:29,174:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 100
return cls(name = =crawler.settings.get('ITEM_PIPELINES'),)
^
SyntaxError: invalid syntax
2017-05-28 03:10:51,021:middleware.py:53:from_settings:INFO:Enabled downloader middlewares:
2017-05-28 03:10:51,024:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 03:10:51,025:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 03:10:51,025:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 100, in from_crawler
return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
NameError: name 'crawler' is not defined
2017-05-28 03:10:51,026:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 100, in from_crawler
return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
NameError: name 'crawler' is not defined
PIPELINE.PY
from items import Acquire2Item
item = Acquire2Item()
from acquire2.spiders import testerapp2
class SavePipeline(object):
def __init__(self, name):
self.name = name
def process_item(self, item, testerapp2):
item.save()
return
process_item(self, item, testerapp2)
@classmethod
def from_crawler(cls, testerapp2):
return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
I notice there is something in there about crawler settings. I read this http://mengyangyang.org/scrapy/topics/item-pipeline.html#from_crawler among many other things. Obviously I don’t get it. Perhaps this is related to my other question about settings earlier today?
I just noticed that url. This must be a Chinese copy of the docs. Don’t think that makes a difference here.
Any help at all will be appreciated.
This is my third or fourth post in the last 24 hours. I freely admit that I don’t know what I am doing, and that over the last several hours for this particular issue I have been guessing, because I didn’t know what scrapy wanted from me and I couldn’t find an answer.
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
def process_item(self, item, spider): # do something with the item # item['processed'] = True return item
@classmethod def from_crawler(cls, testerapp2): return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
@classmethod def from_crawler(cls, crawler): return cls(name = crawler.settings.get('somesetting'),)