Is it possible to limit the number of links crawled by crawl spider ?

Chetan Motamarri

unread,

Sep 29, 2014, 3:06:31 AM9/29/14

to scrapy...@googlegroups.com

Hi,

I am new to use crawlspider...

My problem is, I need to extract top 5 items data in this link (http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems). I have done this like this:

start_urls = [ 'http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=' ]

and specified rules as

rules = (

Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)), callback='parse_items'),

)

Now it is crawling through all urls that starts with "http://steamcommunity.com/sharedfiles/filedetails" on the start_url page.

My problem is it should crawl through only first 5 urls that starts with "http://steamcommunity.com/sharedfiles/filedetails/" on the start_url page. Can we do this by crawlspider restrict or any other means ?

My code:

class ScrapePriceSpider(CrawlSpider):

name = 'ScrapeItems'

allowed_domains = ['steamcommunity.com']

start_urls = ['http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=' ]

rules = (

Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)), callback='parse_items'),

)

def parse_items(self, response):

hxs = HtmlXPathSelector(response)

item = ExtractitemsItem()

item["Item Name"] = hxs.select("//div[@class='workshopItemTitle']/text()").extract()

item["Unique Visits"] = hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()

item["Current Favorites"] = hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()

return item

lnxpgn

unread,

Sep 29, 2014, 5:20:09 AM9/29/14

to scrapy...@googlegroups.com

I haven't used SgmlLinkExtractor before, but i think you should use http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems as the start url, try procss_links callback in the Rule() function to filter urls for the top 5 items.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Paul Tremberth

unread,

Sep 29, 2014, 7:14:13 AM9/29/14

to scrapy...@googlegroups.com

Hi,

You can use process_links for this:

Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)),
     process_links=lambda l: l[:5],
     callback='parse_items'),

Chetan Motamarri

unread,

Sep 29, 2014, 1:01:58 PM9/29/14

to scrapy...@googlegroups.com

Hi Paul ,

It worked thank you very much, but it is not taking first 5 urls in that start url page "http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems" instead

it is crawling random 5 links that starts with "http://steamcommunity.com/sharedfiles/filedetails/";

Can we restrict crawlspider to crawl the first 5 links on a page that starts with above url ?

Paul Tremberth

unread,

Sep 29, 2014, 1:33:23 PM9/29/14

to scrapy...@googlegroups.com

You can try with LinkExtractor and XPath:

LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',))

Chetan Motamarri

unread,

Oct 1, 2014, 3:13:44 AM10/1/14

to scrapy...@googlegroups.com

Hi Paul,

Thanks again bro. It is retrieving only first 3 items. But I want first 5 items. I don't know where it went wrong. Could you please help me. Here is my code..

class ScrapePriceSpider(CrawlSpider):

name = 'ScrapeItems'

allowed_domains = ['steamcommunity.com']

start_urls = ['http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems']

rules = (

Rule(LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',)), process_links=lambda l: l[:5], callback='parse_items'),

)

def parse_items(self, response):

hxs = HtmlXPathSelector(response)

item = ExtractitemsItem()

uniqueVisits = hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()

CurrentFavorites = hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()

itemname = hxs.select("//div[@class='workshopItemTitle']/text()").extract()

item["Item"] = str(itemname)[3:-2]

item["UniqueVisits"] = str(uniqueVisits)[3:-2]

item["CurrentFavorites"] = str(CurrentFavorites)[3:-2]

return item

Reply all

Reply to author

Forward