Crawl entire site, scrape only certain page

959 views
Skip to first unread message

ScrapyMan

unread,
Jun 21, 2010, 7:21:11 AM6/21/10
to scrapy-users
Hi scrapy people,

I'm new to scrapy.

I'm trying to set up a crawler that will crawl my entire site and
scrape content for only certain URL.

Example:

Crawl every page of example.com in order to find every product links
(example.com/apple/iphone-3g-34344)

I want to scrape only content from product page but i need links from
every product on the domain...

How to do this?

I've tried to set up a spider with rules for this but it only crawl
start URL?

Here is my code:

rules = (

Rule(SgmlLinkExtractor(allow='www.example.com/(.*)/(.*)-(.*)
$'),
'parse',
follow=True,
),
)

I would apreciate you help or code example snipets?

Thanks a lot

Marc

Rishi Singh

unread,
Jun 21, 2010, 9:37:21 AM6/21/10
to scrapy...@googlegroups.com
Hi Marc,

Change the name of your parse function to something like parse_item. Crawlspider has a function already defined with the name parse and it is being overridden. I had the same problem as you awhile ago.

Infact, I've seen a number of people have this problem. Maybe it's worth putting a note in the tutorial about this?

firstCrawlSpider(CrawlSpider)
rules = (

       Rule(SgmlLinkExtractor(allow='www.example.com/(.*)/(.*)-(.*)
$'),
           'parse_item',
           follow=True,
       ),
   )

def parse_item:
    #parse code here
    return item

Best,
Rishi


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.


Pablo Hoffman

unread,
Jun 22, 2010, 8:45:19 PM6/22/10
to scrapy...@googlegroups.com
On Mon, Jun 21, 2010 at 09:37:21AM -0400, Rishi Singh wrote:
> Hi Marc,
>
> Change the name of your parse function to something like parse_item.
> Crawlspider has a function already defined with the name parse and it is
> being overridden. I had the same problem as you awhile ago.
>
> Infact, I've seen a number of people have this problem. Maybe it's worth
> putting a note in the tutorial about this?

Agreed. What do you consider the most suitable place to put this note? Could
you send a patch?

Pablo.

Rishi Singh

unread,
Jun 24, 2010, 12:15:14 PM6/24/10
to scrapy...@googlegroups.com
Hi all,

This won't be a problem. 

I have a question for you all though...When I first started using scrapy, I was looking for the features of CrawlSpider first. Does anybody here think maybe Crawlspider should be featured in the tutorial instead of BaseSpider as the first example? I think it would show scrapy's power immediately to the newbie data miner. The downside is that it could be more complex and falsely scare the user into thinking scrapy is difficult to use. Maybe a newer user to python or scrapy could chime in?


Best,
Rishi


--

scr...@asia.com

unread,
Jun 25, 2010, 4:15:16 AM6/25/10
to scrapy...@googlegroups.com
Hi,

I've set up a
CrawlSpider that works fine but it seems not to crawl all categories and subcategories of my website?

It should return me about 2500 products and i got only 600?

Is there a way to have a list of every crawled URL?

Pablo Hoffman

unread,
Jun 25, 2010, 10:09:38 AM6/25/10
to scrapy...@googlegroups.com
Hi Rishi,

I've had precisely that dilemma too about BaseSpider vs CrawlSpider in the
tutorial. I agree the CrawlSpider doesn't get the visibility it deserves in the
tutorial. But maybe it would suffice to write a paragraph or two explaining
what it adds to BaseSpider, and why you'd need it, along with a quick example,
and then just point to the CrawlSpider doc:

http://doc.scrapy.org/topics/spiders.html#crawlspider

Another thing to consider is that we're working on a second revision of
CrawlSpider, more based on pluggable components, so you would have a
link/request extrator, a callback rules dispatcher, a canonilzer, and you can
combine them as you wish (and maybe don't use some of them). However, the basic
idea is the same: being able to crawl and scrape a site based on a set for
rules for crawling and parsing. If the API changes, the documentation can be
quickly updated with the new API, but the guide/introduction part should be
mostly reused.

Thanks for your interest in improving the doc, I hope I've clarified your
questions,

Pablo.

> > scrapy-users...@googlegroups.com<scrapy-users%2Bunsu...@googlegroups.com>

Pablo Hoffman

unread,
Jun 25, 2010, 4:28:23 PM6/25/10
to scrapy...@googlegroups.com
On Fri, Jun 25, 2010 at 04:15:16AM -0400, scr...@asia.com wrote:
> Hi,
>
> I've set up a CrawlSpider that works fine but it seems not to crawl all categories and subcategories of my website?
>
> It should return me about 2500 products and i got only 600?
>
> Is there a way to have a list of every crawled URL?

You can grep the log looking for lines containing "Crawled". In unix:

$ grep Crawled scrapy.log

Pablo.

Reply all
Reply to author
Forward
0 new messages