Hi,
I'm trying to export the scrapyd's output to a json file.
{"status": "ok", "versions": []}
marco@pc:~/crawlscrape/urls_listing$ scrapyd-deploy urls_listing -p urls_listing
Packing version 1422294714
Server response (200):
{"status": "ok", "project": "urls_listing", "version": "1422294714", "spiders": 1}
{"status": "ok", "jobid": "0b4518bea58411e482bcc04a00090e80"}
And this is the log file:
2015-01-26 18:52:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: urls_listing)
2015-01-26 18:52:08+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-01-26 18:52:08+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'urls_listing.spiders', 'SPIDER_MODULES': ['urls_listing.spiders'], 'FEED_URI': '/var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04\
a00090e80.jl', 'LOG_FILE': '/var/log/scrapyd/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.log', 'BOT_NAME': 'urls_listing'}
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Red\
irectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled item pipelines:
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider opened
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-26 18:52:08+0100 [scrapy] DEBUG: Telnet console listening on
127.0.0.1:60232015-01-26 18:52:08+0100 [scrapy] DEBUG: Web service listening on
127.0.0.1:6080 ......(omitted)
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Closing spider (finished)
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Stored jsonlines feed (1 items) in: /var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.jl
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 434,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 51709,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 26, 17, 52, 8, 820513),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 1, 26, 17, 52, 8, 612923)}
2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider closed (finished)
But there is no output.json:
marco@pc:~/crawlscrape/urls_listing$ ls -a
. .. build project.egg-info scrapy.cfg setup.py urls_listing
in ~/crawlscrape/urls_listing/urls/listing:
in items.py:
class UrlsListingItem(scrapy.Item):
# define the fields for your item here like:
#url = scrapy.Field()
#url = scrapy.Field(serializer=UrlsListingJsonExporter)
url = scrapy.Field(serializer=serialize_url)
pass
in pipelines.py I put:
class JsonExportPipeline(object):ì
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.files = {}
def spider_opened(self, spider):
self.files[spider] = file
self.exporter = JsonLinesItemExporter(file)
self.exporter.start_exporting()ì
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
in settings.py I put:
BOT_NAME = 'urls_listing'
SPIDER_MODULES = ['urls_listing.spiders']
NEWSPIDER_MODULE = 'urls_listing.spiders'
FEED_URI = 'file://home/marco/crawlscrape/urls_listing/output.json'
#FEED_URI = 'output.json'
FEED_FORMAT = 'jsonlines'
FEED_EXPORTERS = {
'jsonlines': 'scrapy.contrib.exporter.JsonLinesItemExporter',
}
What am I doing wrongly?
Looking forward to your kind help.
Marco