Is it possible to limit the number of links crawled by crawl spider ?

1,365 views
Skip to first unread message

Chetan Motamarri

unread,
Sep 29, 2014, 3:06:31 AM9/29/14
to scrapy...@googlegroups.com
Hi,

I am new to use crawlspider... 

My problem is, I need to extract top 5 items data in this link (http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems). I have done this like this:


and specified rules as 
rules = (
             Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)), callback='parse_items'),
            )

Now it is crawling through all urls that starts with "http://steamcommunity.com/sharedfiles/filedetails" on the start_url page. 

My problem is it should crawl through only first 5 urls that starts with "http://steamcommunity.com/sharedfiles/filedetails/"  on the start_url page. Can we do this by crawlspider restrict or any other means ?

My code: 

class ScrapePriceSpider(CrawlSpider):
    
    name = 'ScrapeItems'     
    allowed_domains = ['steamcommunity.com']     
    
    rules = (
             Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)), callback='parse_items'),
            )


    def parse_items(self, response):
           hxs = HtmlXPathSelector(response)       

            item = ExtractitemsItem()

            item["Item Name"]                      = hxs.select("//div[@class='workshopItemTitle']/text()").extract()
            item["Unique Visits"]                  = hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()    
            item["Current Favorites"]          = hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
            return item

lnxpgn

unread,
Sep 29, 2014, 5:20:09 AM9/29/14
to scrapy...@googlegroups.com

I haven't used SgmlLinkExtractor before, but i think you should use http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems as the start url, try procss_links callback in the Rule() function to filter urls for the top 5 items.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Paul Tremberth

unread,
Sep 29, 2014, 7:14:13 AM9/29/14
to scrapy...@googlegroups.com
Hi,

You can use process_links for this:


Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)),
     process_links=lambda l: l[:5],
     
callback='parse_items'),

Chetan Motamarri

unread,
Sep 29, 2014, 1:01:58 PM9/29/14
to scrapy...@googlegroups.com
Hi Paul ,

It worked thank you very much, but it is not taking first 5 urls in that start url page "http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems" instead 
it is crawling random 5 links that starts with "http://steamcommunity.com/sharedfiles/filedetails/"; 

Can we restrict crawlspider to crawl the first 5 links on a page that starts with above url ?

Paul Tremberth

unread,
Sep 29, 2014, 1:33:23 PM9/29/14
to scrapy...@googlegroups.com
You can try with LinkExtractor and XPath:

LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',))

Chetan Motamarri

unread,
Oct 1, 2014, 3:13:44 AM10/1/14
to scrapy...@googlegroups.com
Hi Paul,
Thanks again bro. It is retrieving only first 3 items. But I want first 5 items. I don't know where it went wrong. Could you please help me. Here is my code..
class ScrapePriceSpider(CrawlSpider):
   name = 'ScrapeItems'     
    allowed_domains = ['steamcommunity.com']    
    
    rules = (
              Rule(LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',)),  process_links=lambda l: l[:5], callback='parse_items'),
            )


    def parse_items(self, response):
            hxs = HtmlXPathSelector(response)

            item = ExtractitemsItem()
            uniqueVisits = hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()    
            CurrentFavorites = hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
            itemname = hxs.select("//div[@class='workshopItemTitle']/text()").extract()
            item["Item"] = str(itemname)[3:-2]
            item["UniqueVisits"] = str(uniqueVisits)[3:-2]
            item["CurrentFavorites"] = str(CurrentFavorites)[3:-2]
            return item
Reply all
Reply to author
Forward
0 new messages