Parse, callback, and Not Implemented Error following examples in the docs.

2,356 views

Skip to first unread message

Malik Rumi

unread,

Oct 12, 2015, 9:52:13 PM10/12/15

to scrapy-users

I posted a related question to Stack Overflow at http://stackoverflow.com/questions/33084480/scrapy-error-can-t-find-callback, but so far it has no answers.

I am not able to get a spider to crawl past the first page of any site I have tried, despite many iterations and many re-reads of the docs. I decided to test it against the example code from the docs.

The only change I made was to the name, so I could tell it apart.

'''
Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial
Run this, as is, on Dmoz.
'''

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "tutfollinks"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

And here is what I got:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76, in parse
    raise NotImplementedError
NotImplementedError
2015-10-12 19:31:21 [scrapy] INFO: Closing spider (finished)

When I googled the error, my first hit was:

http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a-site

The answer, according to the OP, was to change from BaseSpider to CrawlSpider. But, I repeat, this is copied verbatim from the example in the docs. Then how can it throw an error? In fact, the whole point of the example in the docs is to show how to crawl a site WITHOUT CrawlSpider, which is introduced for the first time in a note at the end of section 2.3.4

Another SO post had a similar issue, but in that case the original code was subclassed from CrawlSpider, and the OP was told he had accidentally overwritten parse(). But I see parse() being used in various examples in the docs, including this one. What, exactly, constitutes 'overwriting parse()'? Is it adding variables like the example in the docs do? How can that be?

Furthermore, the callback in this case is explicitly not parse, but parse_dir_contents.

What is going on here? Please, I'd like a why explanation as well as the hopefully simple answer. Thanks.

Malik Rumi

unread,

Oct 15, 2015, 10:17:02 PM10/15/15

to scrapy-users

Well, I figured it out, but now I have a different problem.

The original problem was that I had a hidden character in my code, probably from being lazy and cut and pasting it directly into my text editor. However, when I retyped the spider and ran it, scrapy ran a different spider, one I had NOT named in the command line. That was a head scratcher, and then I realized I may have may have copied things too closely; the docs say "The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique". Ok, fine, no problem. I changed the name. I also did cat on both files, and you can see the offending character at the start of the bad one, and it is not there in the new one. But when I ran runspider again, I still got the same result. I am at a loss as to why this is happening. Any expert explanations out there?

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ cat dmoz_debug2.py
��''' #this is the offending hidden character - by the way, 'delete' does not work to get rid of it

Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial

Run this, as is, on Dmoz. It is dmoz_debug2, with the name of the spider 'dmoz'.
I changed this to iso-8859 per http://stackoverflow.com/questions/1067742/clean-source-code-files-of-invisible-characters.

'''

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

def parse(self, response):
    for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)
        
def parse_dir_contents(self, response):
    for sel in response.xpath('//ul/li'):
        item = DmozItem()
        item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a/@href').extract()
        item['desc'] = sel.xpath('text()').extract()
        yield item

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ cat tutfollinksc.py
''' # this is my retyped copy, as you can see, without the offending character
This is tutfollinksc, the retyped spider in hopes of getting rid of hidden character
and not implemented error. It is in all respects identical to tutfollinks.

'''

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):

	name = "tutlinkC_dmoz"  # this is where i changed the spider name so it would not be identical to the spider in dmoz_debug2

	allowed_domains = ["dmoz.org"]
	start_urls = [
	    "http://www.dmoz.org/Computers/Programming/Languages/Python/"

	    ]
def parse(self, response):
	for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
		url = response.urljoin(href.extract())
		yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
	for sel in response.xpath('//ul/li'):
		item = DmozItem()
		item['title'] = sel.xpath('a/text()').extract()
		item['link'] = sel.xpath('a/@href').extract()
		item['desc'] = sel.xpath('text()').extract()
		yield item

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ scrapy runspider tutfollinksc.py -o tutfollinks_dmoz_c.json

Traceback (most recent call last):

  File "/usr/bin/scrapy", line 9, in <module>
    load_entry_point('Scrapy==1.0.3.post6-g2d688cd', 'console_scripts', 'scrapy')()
  File "/usr/lib/pymodules/python2.7/scrapy/cmdline.py", line 142, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 209, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 115, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/crawler.py", line 296, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 30, in from_settings
    return cls(settings)
  File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 21, in __init__
    for module in walk_modules(name):
  File "/usr/lib/pymodules/python2.7/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/home/malikarumi/Projects/tutorial/tutorial/spiders/dmoz_debug2.py", line 1
SyntaxError: Non-ASCII character '\xff' in file /home/malikarumi/Projects/tutorial/tutorial/spiders/dmoz_debug2.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Assuming the identical spider name was the issue, why does scrapy continue to run dmoz_debug2 despite the fact that this is NOT what I put in the command line? How do I fix it?

If that's not the issue, what is? Thanks

Malik Rumi

unread,

Oct 15, 2015, 10:41:22 PM10/15/15

to scrapy-users

I'd still like an answer to the previous questions, but as an update, I deleted dmoz_debug2 and then ran runspider on tutfollinksc again. It ran... but then I got a NotImplemented Error, so I am right back where I started.

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ scrapy runspider tutfollinksc.py -o tutfollinks_dmoz_c.json

2015-10-15 21:27:42 [scrapy] INFO: Scrapy 1.0.3.post6+g2d688cd started (bot: tutorial)
2015-10-15 21:27:42 [scrapy] INFO: Optional features available: ssl, http11
2015-10-15 21:27:42 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'tutfollinks_dmoz_c.json', 'BOT_NAME': 'tutorial'}
2015-10-15 21:27:42 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-15 21:27:42 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-15 21:27:42 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-15 21:27:42 [scrapy] INFO: Enabled item pipelines: 
2015-10-15 21:27:42 [scrapy] INFO: Spider opened
2015-10-15 21:27:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-15 21:27:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-15 21:27:43 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/> (referer: None)
2015-10-15 21:27:43 [scrapy] ERROR: Spider error processing <GET http://www.dmoz.org/Computers/Programming/Languages/Python/> (referer: None)

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76, in parse
    raise NotImplementedError
NotImplementedError

2015-10-15 21:27:43 [scrapy] INFO: Closing spider (finished)
2015-10-15 21:27:43 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 264,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7386,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 16, 2, 27, 43, 759336),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/NotImplementedError': 1,
 'start_time': datetime.datetime(2015, 10, 16, 2, 27, 42, 970895)}
2015-10-15 21:27:43 [scrapy] INFO: Spider closed (finished)
malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$

Some enlightenment greatly appreciated

On Monday, October 12, 2015 at 8:52:13 PM UTC-5, Malik Rumi wrote:

Ravi Dhole

unread,

Dec 4, 2015, 9:06:36 AM12/4/15

to scrapy-users

Hi Rumi,

I was receiving the same error while going through the scrapy tutorials.

The solution to our problem is indentation. Please follow http://stackoverflow.com/questions/24727919/scrapy-throwing-up-traceback-when-trying-to-parse-tabulated-data

Reply all

Reply to author

Forward

0 new messages