Crawling Every Page of a Website

58 views
Skip to first unread message

Tim Fitzhardinge

unread,
Oct 9, 2016, 7:19:06 AM10/9/16
to scrapy-users
Hi

I'm new to web crawling. I successfully ran the main tutorial under a myspider.py. Now how do I crawl every page from a website. As I tried changing in the start_urls to take any home page of a website however it only crawled 1 page.

For example say crawl every page from http://www.asx.com.au website. I believe there will be 10,000+ pages. Thank you 

Enter code here...import scrapy

 

class BlogSpider(scrapy.Spider):

    name = 'blogspider'

    start_urls = ['https://blog.scrapinghub.com']

 

    def parse(self, response):

        for title in response.css('h2.entry-title'):

            yield {'title': title.css('a ::text').extract_first()}

 

        next_page = response.css('div.prev-post > a ::attr(href)').extract_first()

        if next_page:

            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)



Felipe Ruhland

unread,
Oct 10, 2016, 5:46:10 AM10/10/16
to scrapy...@googlegroups.com
Hey, Tim. You have to change you code and find the next page selector.
You can use scrapy shell[1] to search for next page selector.

I hope this help you.

Good luck.


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscribe@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages