A little help for a new scrapy user?

793 views
Skip to first unread message

Tina C

unread,
Nov 18, 2014, 2:54:55 PM11/18/14
to scrapy...@googlegroups.com
There has to be something really simple that I'm missing. I'm trying to get it to crawl more than one page, but I'm using a section of the page as a starting point for testing. I can't get it to crawl anything beyond the index page. What am I doing wrong?

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor

class DmozSpider(CrawlSpider):
    name
= "africanstudies"
    allowed_domains
= ["northwestern.edu"]
    start_urls
= [
       
"http://www.northwestern.edu/african-studies/about/"
   
]

   
def parse(self, response):
       
for sel in response.xpath('//div[2]/div[1]'):
            item
= AfricanstudiesItem()
            item
['url'] = response.url
            item
['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()').extract()      
            item
['desc'] = sel.xpath('div[4]/*').extract()      
           
yield item


Travis Leleu

unread,
Nov 18, 2014, 4:36:46 PM11/18/14
to scrapy...@googlegroups.com
Hi Tina!

Your code looks good, except it's missing logic that would give scrapy more pages to crawl.  (Scrapy won't grab links and crawl them by default; you have to indicate what you want to crawl.)

I use one of two primary mechanisms:

With the CrawlSpider, you can define a class variable called rules that defines rules for scrapy to consider when following links.  Often, I will define these rules based on a LinkExtractor object, which allows you to specify things like callbacks (what method to use in parsing a particular link), filters (you can modify the URL to remove session variables, etc.), limitations on links to extract (full gamut of css and xpath selectors available).  More information is at http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule

Sometimes, the rule-based link following just doesn't cut it.  (If you're using the scrapy.Spider spider class, the rules options aren't implemented, so you have to do it this way.)  If you yield a Request object from your parsing class, scrapy will add that to the queue to be scraped and processed.

That make sense?

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Tina C

unread,
Nov 19, 2014, 4:15:49 PM11/19/14
to scrapy...@googlegroups.com, m...@travisleleu.com
That's helpful, but I'm hung up on getting the spider to follow relative links. I've tried a lot of things, but I think that I'm really close with this:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import urlparse

class AfricanstudiesSpider(CrawlSpider):
    name
= "africanstudies"
    allowed_domains
= ["northwestern.edu/african-studies"]
    start_urls
= [
       
"http://www.northwestern.edu/african-studies/about/"
   
]
   
    rules
= (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow=True),)
   
   
def parse_links(self, response):
        sel
= scrapy.Selector(response)
       
for href in sel.xpath('//a/@href').extract():
            url
= urlparse.urljoin(response.url, href)
           
yield Request(url, callback = self.parse_items,)
           
       
def parse_items(self, response):
           
self.log('Hi, this is an item page! %s' % response.url)

       
for sel in response.xpath('//div[2]/div[1]'):
            item
= AfricanstudiesItem()
            item
['url'] = response.url
            item
['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()').extract()      
            item
['desc'] = sel.xpath('div[4]/*').extract()      
           
yield item

I can see from my logs that is skipping over the hard coded links from other domains (as it should). I thought this bit of code would cause the spider to recognize my relative links, but it does not.

Hopefully you can lend a hand and tell me what I'm doing wrong.

Tina C

unread,
Nov 19, 2014, 5:04:08 PM11/19/14
to scrapy...@googlegroups.com, m...@travisleleu.com
So, I have it crawling, but it doesn't crawl the correct area/site. If I use 'allowed_domains', it doesn't crawl anything. If I remove it, it crawls too many things. Here's the updated code:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request

import urlparse

class AfricanstudiesSpider(CrawlSpider):
    name
= "africanstudies"
    allowed_domains
= ["northwestern.edu/african-studies"]
    start_urls
= [
       
"http://www.northwestern.edu/african-studies/about/"
   
]

   
    rules
= (Rule(LinkExtractor(allow=(r'')),callback='parse_links',follow=True),)
   
   
def parse_links(self, response):
        links
= response.xpath('//a/@href').extract()
       
for link in links:
            url
= urlparse.urljoin(response.url, link)

Travis Leleu

unread,
Nov 19, 2014, 5:33:07 PM11/19/14
to scrapy...@googlegroups.com
Are you trying to crawl every link on northwestern.edu that is in the subdirectory african-studies?  allowed_domains controls the domain name, not the path -- to limit to the /african-studies subdir, you'd put that information into the "allow" named parameter of the link extractor object.

Assuming that's what you're trying to accomplish, try this:


import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request

import urlparse

class AfricanstudiesSpider(CrawlSpider):
    name
= "africanstudies"

    allowed_domains
= ["northwestern.edu"]
    start_urls
= [
       
"http://www.northwestern.edu/african-studies/about/"
   
]

   
    rules
= (Rule(LinkExtractor(allow=(r'african-studies')),callback='parse_links',follow=True),)

Tina C

unread,
Nov 20, 2014, 11:23:40 AM11/20/14
to scrapy...@googlegroups.com, m...@travisleleu.com
Thanks, that worked perfectly!

Tina C

unread,
Nov 20, 2014, 5:04:19 PM11/20/14
to scrapy...@googlegroups.com, m...@travisleleu.com
Actually, I was wrong, it's not working. It still is crawling sites outside of the subdirectory. moreover, i'm not able to get anything in the /about/ subdirectory.

Tina C

unread,
Nov 21, 2014, 4:52:58 PM11/21/14
to scrapy...@googlegroups.com, m...@travisleleu.com
Just to update (and to serve as an archive for anyone searching for a similar answer), I was really close with the previous code snippets I listed. The problem was that the information contained in my callback was canceling out my rules. Here's my updated code (I'm only grabbing the URLs at this point) and it seems to work.

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from africanstudies.items import AfricanstudiesItem

class MySpider(CrawlSpider):

    name
= 'africanstudies'
    allowed_domains
= ['northwestern.edu']

    start_urls
= ['http://www.northwestern.edu/african-studies']

    rules
= (
       
Rule(LinkExtractor(allow='african-studies'), follow=True, callback='parse_item'),

   
)

   
def parse_item(self, response):

       
self.log('Hi, this is an item page! %s' % response.url)

        item
= AfricanstudiesItem()
        item
['url'] = response.
url
       
return item

Erik Schafer

unread,
Dec 2, 2014, 1:30:22 PM12/2/14
to scrapy...@googlegroups.com, m...@travisleleu.com
Sorry to necro this / bump, but this thread was incredibly helpful in getting my first crawlspider running.

I'm really disappointed in the documentation for scrapy, because there are some serious errors.  

The documentation states that the allow value of the link extractor object takes 
a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
which is simply false.  It seems to take a regular expression that the url minus the fully qualified domain name must match, but I'm not sure, since I'm totally new to scrapy and can't trust the documentation.

I also still don't know why follow=true needs to be explicitly included (it did not work without follow=true for me) given that the documentation states follow=none defaults to true.

I don't understand why callback=none will use the default parse method to recursivley crawl matched urls, callback='mycallback' will not, but follow=true callback='mycallback' (seems to) invoke both callbacks.

IMO this example of a simple recursive crawlspider should be in the documentation.

Finally, the only thing I have to add is that unless your allow string is formatted as r'regex' it will not understand slashes as escape characters.

Nicolás Alejandro Ramírez Quiros

unread,
Dec 2, 2014, 2:28:20 PM12/2/14
to scrapy...@googlegroups.com, m...@travisleleu.com
1. Usually the sited use relative path on the html, that is what you have to match.
2. Is says "If callback is None follow defaults to True", this is because sometimes you don't want to follow links from the item page; an example would be you want the position on the category page.
3. Documentation has a big ass warning "When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work."
4. Example: http://doc.scrapy.org/en/master/topics/spiders.html#crawlspider-example
5. That isn't scrapy fault, you can blame Guido :D and regex module maintainers.
Reply all
Reply to author
Forward
0 new messages