I have a spider that keeps track of the urls scarped from each page
that it visits. When the scraping is complete, I want to write those
urls to a file (i.e. when I receive spider_closed signal).
Is it OK to connect to the spider_closed signal in the spider
itself ?
Would it let the file write finish before freeing up the objects in
the spider.
Please see example below :
1) track_scraped_urls is a function that keeps accumulating all the
scraped links in a set (scraped_urls)
2) In the 'parse_page' function I remove some urls from this set (the
ones that are visited)
3) finally 'spider_closed' function gets called on the spider_closed
signal and dumps the set to a file.
4) I tried using __del__ function of MySpider but there's no
guarantee
that it'll be called
In essence, I'm trying to ask that when the spider receives the
spider_closed signal, does that signal imply : 'finish up whatever
you
are doing and then I'll shut you down & free up the memory ?"
class MySpider(CrawlSpider):
name = '
example.com'
allowed_domains = ['
example.com']
start_urls = ['
http://www.example.com']
rules = (
Rule(SgmlLinkExtractor(allow=('item\.php', )),
callback='parse_page', process_links='track_scraped_urls' ),
)
scraped_urls = set()
def __init__(self, *a, **kw):
super(MySpider, self).__init__(*a, **kw)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def track_scraped_urls(self, links):
for link in links:
self.scraped_urls.add(link.url)
return links
def spider_closed(self):
file = open('scraped_url.log', 'w')
for url in list(self.scraped_urls):
file.write(url+'\n')
file.close()