Scrapy experts - How can I recursively scrape RateMyProfessor? ($20 bounty to your favorite charity)

Jordan Rein

unread,

Sep 18, 2013, 5:45:58 PM9/18/13

to scrapy...@googlegroups.com

Here's the complete post with my problems about scraping from RMP: http://www.reddit.com/r/learnpython/comments/1mmo72/any_scrapy_users_here_that_can_help_me_out/

Here's the tl;dr:

I need to recurively scrape the list of professor names from a college
RMP has this type of URL structure http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311, http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2, http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=3, but Scrapy thinks that the sid and pageNo attributes are reveresed
This was the rule I had coded into my CrawlSpider: rules = (Rule(SgmlLinkExtractor(allow=('&pageNo=\d'), restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)
For some reason, this rule adds, not changes, the pageNo attributes to my URLs (i.e.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&pageNo=3&sid=2311)
This is my complete code:

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from rmp.items import RmpItem class MySpider(CrawlSpider): name = "rmp" allowed_domains = ["ratemyprofessors.com"] start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"] rules = (Rule(SgmlLinkExtractor(allow=('&pageNo=\d',), restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),) def parser(self, response): hxs = HtmlXPathSelector(response) html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']") profs = [] for line in html: prof = RmpItem() prof["name"] = line.select("div[@class='profName']/a/text()").extract() profs.append(prof) return profs

Also, today I found a $20 bill on the street! To make things a little bit more interesting, the $20 will go to whatever charity the person who helps me wants it to go to. I'd like to offer more, but I'm still a starving student with a minimum wage job. (That's why I'm learning Scrapy/Python in the first place!) Anyways, thank you for reading, and I hope someone here will be able to fix this problem.

- Jordan

bruce

unread,

Sep 18, 2013, 6:40:33 PM9/18/13

to scrapy-users

jordan...

hey.. I already wrote a scraper for this.. but it's not scrapy

let me know if you want the data...

get me your contact data and let's talk.

-bruce

> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users...@googlegroups.com.
> To post to this group, send email to scrapy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.

Jordan Rein

unread,

Sep 18, 2013, 7:08:17 PM9/18/13

to scrapy...@googlegroups.com

Hey Bruce,

Appreciate the offer, but this is more of a learning exercise for me than anything.

I've been able to recursively scrape from other websites before, but RateMyProfessors is just proving to be a difficult challenge.

Paul Tremberth

unread,

Sep 18, 2013, 8:14:54 PM9/18/13

to scrapy...@googlegroups.com

Jordan,

I think you may have found a bug

When I fetch the 1st page in scrapy shell, SgmlLinkExtractor has an issue after the 2nd page

(py2.7)paul@wheezy:~/tmp/rmp$ scrapy shell http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311

...

>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

>>> SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links(response)

[Link(url='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311', text=u'c', fragment='', nofollow=False)]

>>>

>>> fetch('http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311')

2013-09-19 02:05:38+0200 [rmpspider] DEBUG: Crawled (200) <GET http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311> (referer: None)

...

>>> SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links(response)

[Link(url='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&pageNo=3&sid=2311', text=u'c', fragment='', nofollow=False)]

>>>

But when I run the shell starting from the 2nd page directly, it's ok for the next page,

but the next link from the 3rd page is wrong again

(py2.7)paul@wheezy:~/tmp/rmp$ scrapy shell "http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2"

...

>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

>>> SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links(response)

[Link(url='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311', text=u'c', fragment='', nofollow=False)]

>>>

>>> fetch('http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311')

...

>>> SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)).extract_links(response)

[Link(url='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&pageNo=4&sid=2311', text=u'c', fragment='', nofollow=False)]

>>>

In the meantime,

you can write an equivalent spider using BaseSpider and constructing the next page request "by hand",

with a little HtmlXPathSelector select() and urlparse.urljoin()

#from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.http import Request

from scrapy.spider import BaseSpider

#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import HtmlXPathSelector

from rmp.items import RmpItem

import urlparse

class MySpider(BaseSpider):

name = "rmpspider"

allowed_domains = ["ratemyprofessors.com"]

start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]

#rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)

def parse(self, response):

hxs = HtmlXPathSelector(response)

html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")

for line in html:

prof = RmpItem()

prof["name"] = line.select("div[@class='profName']/a/text()").extract()

yield prof

for url in hxs.select('//a[@id="next"]/@href').extract():

yield Request(urlparse.urljoin(response.url, url))

Jordan Rein

unread,

Sep 18, 2013, 9:04:06 PM9/18/13

to scrapy...@googlegroups.com

Just checked out your solution, Paul, and it works perfectly.

Thanks for taking the time to debug my program - it ends a day and a half hunt throughout Stack Overflow and the Scrapy docs for the answer.

As for Scrapy's inability to read RMP's links, I suspect it has something to do with RMP's use of the ampersand symbol, but that's just conjecture. Hopefully someone can chime in and explain that quirk to me.

Anyways, do you have a charity you'd like for me to check out? If not, I'll just pass my money along to water.org.

Rolando Espinoza La Fuente

unread,

Sep 18, 2013, 9:29:11 PM9/18/13

to scrapy...@googlegroups.com

Oh, I just posted my answer on SO: http://stackoverflow.com/questions/18862071/why-is-my-scrapy-scraper-only-returning-the-second-page-of-results/18884908#18884908

In summary: RMP doesn't handle well the canonicalized url.

Solution: Use canonicalize=False in the SgmlLinkExtractor class.

Regards,

Rolando

--

Jordan Rein

unread,

Sep 18, 2013, 9:57:10 PM9/18/13

to scrapy...@googlegroups.com

Nice fix! I was damn near certain the problem was with the SgmlLinkExtractor, not a parameter with the Rule class. Thanks for pointing that out to me.

As for restrict_xpaths='//div[@id="pagination"]', while it does successfully grab all the professor names, they aren't in alphabetical order. And like you said, restrict_xpaths='//a[@id="next"]' doesn't crawl the first page, but I found a solution for that: http://stackoverflow.com/questions/15836062/scrapy-crawlspider-doesnt-crawl-the-first-landing-page

Anyways, time for some scraping.

Jordan Rein

unread,

Sep 19, 2013, 3:09:08 AM9/19/13

to scrapy...@googlegroups.com

Just passed on $40 to directrelief.com in honor of me getting this scraper to work properly.

Feels good, man.

Jordan Rein

unread,

Sep 19, 2013, 8:22:03 PM9/19/13

to scrapy...@googlegroups.com

Quick question, Paul - why does did you need to use a for loop to get the next URL? Since you're only getting one URL at a time, it makes sense to me that the code would look something like this:

url = hxs.select('//a[@id="next"]/@href').extract()

yield Request(urlparse.urljoin(response.url, url))

Paul Tremberth

unread,

Sep 21, 2013, 12:42:59 PM9/21/13

to scrapy...@googlegroups.com

Hi Jordan,

well you don't have to do it in a for loop, but hxs.select().extract() returns a list in any case, either empty list [], or list of values,

so instead of testing for empty list before accessing the first element in the extract() result (to prevent and IndexError exception),

like this:

urls = hxs.select('//a[@id="next"]/@href').extract()

if urls:

yield Request(urlparse.urljoin(response.url, urls[0]))

I usually use a for loop,

and sometimes add a break after that to get out at the first iteration, if I'm not sure how many matches I will get

I certainly wouldn't consider this as best practice though ;)

Cheers,

Paul.

Jordan Rein

unread,

Sep 22, 2013, 12:52:22 AM9/22/13

to scrapy...@googlegroups.com

Thanks Paul - that makes perfect sense.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/Ha3QhT9iYoI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

Reply all

Reply to author

Forward