Connecting to spider_closed signal inside the spider... is it safe?

730 views
Skip to first unread message

Ensnap

unread,
Jun 1, 2011, 8:38:11 AM6/1/11
to scrapy-users
I have a spider that keeps track of the urls scarped from each page
that it visits. When the scraping is complete, I want to write those
urls to a file (i.e. when I receive spider_closed signal).
Is it OK to connect to the spider_closed signal in the spider
itself ?
Would it let the file write finish before freeing up the objects in
the spider.
Please see example below :
1) track_scraped_urls is a function that keeps accumulating all the
scraped links in a set (scraped_urls)
2) In the 'parse_page' function I remove some urls from this set (the
ones that are visited)
3) finally 'spider_closed' function gets called on the spider_closed
signal and dumps the set to a file.
4) I tried using __del__ function of MySpider but there's no
guarantee
that it'll be called
In essence, I'm trying to ask that when the spider receives the
spider_closed signal, does that signal imply : 'finish up whatever
you
are doing and then I'll shut you down & free up the memory ?"

class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(SgmlLinkExtractor(allow=('item\.php', )),
callback='parse_page', process_links='track_scraped_urls' ),
)

scraped_urls = set()

def __init__(self, *a, **kw):
super(MySpider, self).__init__(*a, **kw)
dispatcher.connect(self.spider_closed, signals.spider_closed)

def track_scraped_urls(self, links):
for link in links:
self.scraped_urls.add(link.url)
return links

def spider_closed(self):
file = open('scraped_url.log', 'w')
for url in list(self.scraped_urls):
file.write(url+'\n')
file.close()

Pablo Hoffman

unread,
Jun 8, 2011, 1:26:47 PM6/8/11
to scrapy...@googlegroups.com
Catching spider_closed on the spider itself should be safe, but make sure
the spider is catching the signal of itself, not of other spiders. You can do
that with this code:

def spider_closed(self, spider):
if spider is not self:
return
# ... rest of the code ...

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Reply all
Reply to author
Forward
0 new messages