Scrapy won't do what you're looking for, but could be useful if you were
to build such a crawler.
You don't need to have one spider per website with scrapy. It's not
unusual to scrape hundreds of thousands of websites with a single
spider. Obviously, you won't be authoring xpaths for each website.
Perhaps you would have one spider for every site, or maybe a default
spider and some manually written ones for sites that are important or
don't work with the default one.
It's very easy to write a spider in scrapy that would follow all the
seed urls and keep going extracting data. But you probably want to limit
the time on any given site. Scrapy has support for this (stopping after
certain number of requests), but it's a bit harsh.. you probably want to
prioritize certain pages (nearer the home page, lead to new items to
extract, not visited recently, etc.) and implement some revisit policy.
This is likely to be something you want to consider no matter what
solution you go with.
For longer running crawls you'll need to consider server failure. Scrapy
has support for serializing state, or for replacing local datastructures
with data you can read across the network (e.g. using redis for storing
scraping queues). Depending on your crawling rate, you may want to use
many machines, certainly you'll need many processes.
Then, of course, you need to actually extract the data you are
interested in from these web pages. I am not aware of anything that you
can plug into scrapy for this (if anyone is, please let me know). There
are a number of techniques for this, but so far I have not seen good
open source implementations. If you're writing one, I'd be interested to
know.
Have you looked at Nutch? it has better support for the larger crawls
where you are following many links across multiple websites in a
fault-tolerant way. Of course, you'll still need to handle the data
extraction yourself.
Cheers,
Shane
For handling 5000+ sites you will need to provide resources. Scrapy
can scale by using same central spider schedule queue and scheduling
crawls using multiple servers. Scrapyd lets you to configure the
number of crawl processes that can be spawned on the server.
However from my experience scrapy processes tend to be more cpu
intensive, you can handle more sites with less hardware using nutch as
opposed to scrapy.
Now the point is why is scrapy better, in your case it seems you want
to extract some structured information from the sites which is way
more simple and fast using scrapy as opposed to developing site
specific plugins in nutch.
Scrapy lets you configure and crawl each site quicker (for development
time) and the extraction solution can be XPath based or RE based.
Hope that helped!
Umar
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to
> scrapy-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scrapy-users?hl=en.
>
"Then, of course, you need to actually extract the data you are
interested in from these web pages. I am not aware of anything that you
can plug into scrapy for this (if anyone is, please let me know). There
are a number of techniques for this, but so far I have not seen good
open source implementations. If you're writing one, I'd be interested to
know."