init method on pipeline class

342 views

Skip to first unread message

Malik Rumi

unread,

May 28, 2017, 12:03:03 AM5/28/17

to scrapy-users

This is my third or fourth post in the last 24 hours. I freely admit that I don’t know what I am doing, and that over the last several hours for this particular issue I have been guessing, because I didn’t know what scrapy wanted from me and I couldn’t find an answer.

Here are just a few lines from my log today. It runs over 100 pages when pasted into my word processor. I was just trying to make this work with the pipeline. It started with this error:

SavePipeline(item)
TypeError: object() takes no parameters

and never got better.

I read on SO that this was because my pipeline class did not have its own __init__ method, and so python was searching in the parent object for one. I thought that made sense, so I put an __init__ in there, and hell ensued. It was the usual ‘how many arguments’ problem, but when I tried giving it only self, and leaving the rest blank or with ‘pass’, I got indentation errors.

So I tried putting something innocuous like self.name = name, and we were back to the how many arguments error. I tried giving it process_item as an attribute, and after many go rounds and variations, that worked, but then it wouldn’t take my call to the process_item method – back to the number of arguments again. I imported my spider, and that helped, but still the errors kept coming. It’s been about 6 hours. I have Googled all over the place. I give up. I don’t get it. I need help.

Here is one full traceback, typical of most but hardly the only one, followed by an abbreviated version of some others, including the last:

Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/core/engine.py", line 70, in __init__
self.scraper = Scraper(crawler)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/home/malikarumi/Projects/sukayna/lib/python3.5/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 986, in _gcd_import
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 665, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 87, in <module>
class SavePipeline(object):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in SavePipeline
SavePipeline(process_item)
NameError: name 'SavePipeline' is not defined
2017-05-28 02:43:30,386:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 87, in <module>
class SavePipeline(object):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in SavePipeline
SavePipeline(process_item)
NameError: name 'SavePipeline' is not defined
2017-05-28 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 02:44:46,861:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in <module>
SavePipeline(process_item)
NameError: name 'process_item' is not defined
2017-05-28 02:44:46,862:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 96, in <module>
SavePipeline(process_item)
NameError: name 'process_item' is not defined
2017-05-28 03:10:29,174:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 100
return cls(name = =crawler.settings.get('ITEM_PIPELINES'),)
^
SyntaxError: invalid syntax
2017-05-28 03:10:51,021:middleware.py:53:from_settings:INFO:Enabled downloader middlewares:
2017-05-28 03:10:51,024:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 03:10:51,025:_legacy.py:154:publishToNewObserver:CRITICAL:Unhandled error in Deferred:
2017-05-28 03:10:51,025:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 100, in from_crawler
return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
NameError: name 'crawler' is not defined
2017-05-28 03:10:51,026:_legacy.py:154:publishToNewObserver:CRITICAL:
Traceback (most recent call last):
File "/home/malikarumi/Projects/sukayna/acquire2/acquire2/pipeline.py", line 100, in from_crawler
return cls(name = crawler.settings.get('ITEM_PIPELINES'),)
NameError: name 'crawler' is not defined

PIPELINE.PY
from items import Acquire2Item
item = Acquire2Item()
from acquire2.spiders import testerapp2
class SavePipeline(object):
def __init__(self, name):
self.name = name
def process_item(self, item, testerapp2):
item.save()
return
process_item(self, item, testerapp2)
@classmethod
def from_crawler(cls, testerapp2):
return cls(name = crawler.settings.get('ITEM_PIPELINES'),)

I notice there is something in there about crawler settings. I read this http://mengyangyang.org/scrapy/topics/item-pipeline.html#from_crawler among many other things. Obviously I don’t get it. Perhaps this is related to my other question about settings earlier today?

I just noticed that url. This must be a Chinese copy of the docs. Don’t think that makes a difference here.

Any help at all will be appreciated.

Paul Tremberth

unread,

May 31, 2017, 11:07:42 AM5/31/17

to scrapy-users

Hi Malik,

On Sunday, May 28, 2017 at 6:03:03 AM UTC+2, Malik Rumi wrote:

This is my third or fourth post in the last 24 hours. I freely admit that I don’t know what I am doing, and that over the last several hours for this particular issue I have been guessing, because I didn’t know what scrapy wanted from me and I couldn’t find an answer.

I also have a question: what do you want from Scrapy? what are you trying to achieve with this pipeline?

Try to always refer to the official docs and not copies.

For item pipelines, this is the page: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

It says:

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

Item pipelines need to implement one or more of the 4 methods described in https://docs.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline

(You don't need to implement them all.)

The most important one is `process_item`;

Scrapy framework will call this method of the pipeline instance it creates,

with each item that your spider callbacks return.

This method MUST have the signature described, in that it must expect 2 parameters, an item object and the running spider, in addition to the conventional "self" as first argument.

process_item(self, item, spider) MUST either return an item or a dict (usually transformed from the input item) -- you can return the item unchanged, or raise DropItem to tell scrapy to drop the item.

"self", "item", "spider" are variable references, named using conventions within Scrapy framework, to use in your method implementation.

- "item" points to an instance of your Item or a dict, depending of what your callback returns

- "spider" points to your spider instance

You do not need to replace your method signature to use "testerapp2".

You can do that in theory, because it's just a name give to the argument to be able to work with it in your method.

But I highly discourage this.

The most basic implementation of `process_item` would be returning the item as-is:

    def process_item(self, item, spider):
        # do something with the item
        # item['processed'] = True
        return item

The 2nd important method is the classmethod `from_crawler` that is usually used to initialize the pipeline object using settings or other info from the crawler object.

You've written

    @classmethod
    def from_crawler(cls, testerapp2):
        return cls(name = crawler.settings.get('ITEM_PIPELINES'),)

You could use some setting to initialize the pipeline, but ITEM_PIPELINES setting is a dict, so assigning the pipeline name field with a dict does not make much sense.

Also, you're seeing NameError: name 'crawler' is not defined

That makes sense because your `from_crawler` signature using "testerapp2" as the name of the 2nd argument.

Hence using "crawler" in the method does not mean anything for the Python interpreter.

One usually write this:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(name = crawler.settings.get('somesetting'),)

I think you'll find useful to read further about Python classes and defining functions, especially regarding formal parameters, and the convention of using "self" for the first argument of class methods.

Regards,

/Paul.

Reply all

Reply to author

Forward

0 new messages