Hello,
I've now spent 2 full days on finding a solution for my problem:
When I start my scrapy spider from the terminal I manage to get my results as CSV and I can see them in the terminal.
I seem to be unable to do the same when I start the spider from within a script. What I want to do in the end is to call Scrapy from a script (passing a few parameters) and get the results as a list or a Pandas DataFrame.
I have googled and views all discussion posts I could find, incl.:
Does anyone have a working collection of scripts that allow to call scrapy from within a script and collect the results?
Does anyone see a flaw in my approach?
Thank you very much for your help! I am happy to help you in return with whatever I might be helpful at (probably not much...).
Pascal
My callScrapy script:import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'Scraper.settings')
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
# Import the actual scraper
from Scraper.spiders.GoogleScraper_v1 import GoogleScraper_v1
def stop_reactor():
reactor.stop()
def start_scraper():
# from:
https://scrapy.readthedocs.org/en/latest/topics/practices.html spider = GoogleScraper_v1()
settings = get_project_settings()
print str(settings)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
if __name__ == "__main__":
# Call method to start scraper
start_scraper()
My pipeline:from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
# Exports scraped results as CSV file
# Source:
#
http://stackoverflow.com/questions/20753358/how-can-i-use-the-fields-to-export-attribute-in-baseitemexporter-to-order-my-scrclass CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' %
spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['desc', 'link'] # <--- was kommt hierhin? statt desc?
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
class NormalPipeline(object):
def process_item(self, item, spider):
return item
My console output:
2015-03-13 12:41:06+0000 [scrapy] INFO: Running reactor...
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Closing spider (finished)
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 259,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 113002,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 628538),
'item_scraped_count': 69, 'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 52977)}
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Spider closed (finished)
2015-03-13 12:41:06+0000 [scrapy] INFO: Reactor stopped.