Crawling Every Page of a Website

58 views

Skip to first unread message

Tim Fitzhardinge

unread,

Oct 9, 2016, 7:19:06 AM10/9/16

to scrapy-users

I'm new to web crawling. I successfully ran the main tutorial under a myspider.py. Now how do I crawl every page from a website. As I tried changing in the start_urls to take any home page of a website however it only crawled 1 page.

For example say crawl every page from http://www.asx.com.au website. I believe there will be 10,000+ pages. Thank you

Enter code here...import scrapy

class BlogSpider(scrapy.Spider):

name = 'blogspider'

start_urls = ['https://blog.scrapinghub.com']

def parse(self, response):

for title in response.css('h2.entry-title'):

yield {'title': title.css('a ::text').extract_first()}

next_page = response.css('div.prev-post > a ::attr(href)').extract_first()

if next_page:

yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Felipe Ruhland

unread,

Oct 10, 2016, 5:46:10 AM10/10/16

to scrapy...@googlegroups.com

Hey, Tim. You have to change you code and find the next page selector.

You can use scrapy shell[1] to search for next page selector.

I hope this help you.

Good luck.

[1] https://doc.scrapy.org/en/latest/topics/shell.html

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages