Getting Scrapy to output results when called from a script

748 views

Skip to first unread message

pascal....@gmail.com

unread,

Mar 13, 2015, 8:59:09 AM3/13/15

to scrapy...@googlegroups.com

Hello,

I've now spent 2 full days on finding a solution for my problem:
When I start my scrapy spider from the terminal I manage to get my results as CSV and I can see them in the terminal.
I seem to be unable to do the same when I start the spider from within a script. What I want to do in the end is to call Scrapy from a script (passing a few parameters) and get the results as a list or a Pandas DataFrame.

I have googled and views all discussion posts I could find, incl.:

Does anyone have a working collection of scripts that allow to call scrapy from within a script and collect the results?
Does anyone see a flaw in my approach?

Thank you very much for your help! I am happy to help you in return with whatever I might be helpful at (probably not much...).

Pascal

My callScrapy script:

import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'Scraper.settings')

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings

# Import the actual scraper
from Scraper.spiders.GoogleScraper_v1 import GoogleScraper_v1

def stop_reactor():
    reactor.stop()

def start_scraper():

    # from: https://scrapy.readthedocs.org/en/latest/topics/practices.html
    spider = GoogleScraper_v1()
    settings = get_project_settings()
    print str(settings)
    crawler = Crawler(settings)

    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    log.msg('Running reactor...')
    reactor.run() # the script will block here until the spider is closed
    log.msg('Reactor stopped.')

if __name__ == "__main__":

    # Call method to start scraper
    start_scraper()

My pipeline:

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter

# Exports scraped results as CSV file
# Source:
# http://stackoverflow.com/questions/20753358/how-can-i-use-the-fields-to-export-attribute-in-baseitemexporter-to-order-my-scr
class CSVPipeline(object):
    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open('%s_items.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.fields_to_export = ['desc', 'link'] # <--- was kommt hierhin? statt desc?
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

class NormalPipeline(object):
    def process_item(self, item, spider):
        return item

My console output:

2015-03-13 12:41:06+0000 [scrapy] INFO: Running reactor...
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Closing spider (finished)
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
    'downloader/request_count': 1,
    'downloader/request_method_count/GET': 1,
    'downloader/response_bytes': 113002,
    'downloader/response_count': 1,
    'downloader/response_status_count/200': 1,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 628538),
    'item_scraped_count': 69,
    'response_received_count': 1,
    'scheduler/dequeued': 1,
    'scheduler/dequeued/memory': 1,
    'scheduler/enqueued': 1,
    'scheduler/enqueued/memory': 1,
    'start_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 52977)}
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Spider closed (finished)
2015-03-13 12:41:06+0000 [scrapy] INFO: Reactor stopped.

Reply all

Reply to author

Forward

0 new messages