what is the best way to scrape multiple domains with scrapy?

752 views

Skip to first unread message

carrier24sg

unread,

Mar 31, 2011, 11:21:12 AM3/31/11

to scrapy-users

Hi guys,

I posted this question on stackoverflow but thought it will be more
appropriate here.

have around 10 odd sites that I wish to scrape from. A couple of them
are wordpress blogs and they follow the same html structure, albeit
with different classes. The others are either forums or blogs of other
formats.

The information I like to scrape is common - the post content, the
timestamp, the author, title and the comments.

My question is, do i have to create one separate spider for each
domain? If not, how can I create a generic spider that allows me
scrape by loading options from a configuration file or something
similar?

I figured I could load the xpath expressions from a file which
location can be loaded via command line but there seems to be some
difficulties when scraping for some domain requires that I use regex
select(expression_here).re(regex) while some do not.

Pablo Hoffman

unread,

Apr 7, 2011, 12:20:59 AM4/7/11

to scrapy...@googlegroups.com

On Thu, Mar 31, 2011 at 08:21:12AM -0700, carrier24sg wrote:
> My question is, do i have to create one separate spider for each
> domain? If not, how can I create a generic spider that allows me
> scrape by loading options from a configuration file or something
> similar?

You could create a base spider and make every site-specific spider inherit from
it, setting only the relevant attributes that change like:

from myproject.base_spiders import BaseWordPressSpider

class TechcrunchSpider(BaseWordPressSpider):

start_urls = ['http://techcrunch.com/']
allowed_domains = ['techcrunch.com']

class OtherWordpressSpider(BaseWordPressSpider):

start_urls = ['http://example.com/']
allowed_domains = ['example.com']

# ...

All the common (reusable) logic would be implemented in the base spider
(BaseWordPressSpider).