Can scrapy handle 5000+ website crawl and provide structured data?

dan123456789

unread,

Feb 15, 2012, 6:24:54 PM2/15/12

to scrapy-users

Hi,
Im looking at crawling 5000 + websites and need a solution. They are
real estate listings, so the data is similar, but every site has its
own html code - they are all unique sites. No clean datafeed or api is
available.

I am looking for a solution that is halfway intelligent, or I can
program intelligence into it. Something I can just load the root
domains into, it crawls, and will capture data between html tags and
present it in a somewhat orderly manner. I cannot write a unique
parser for every site. What I need is something that will capture
everything, then I will know that in say Field XYZ the price has been
stored (because the html code on every page of that site that had
price was <td id=price> 100 </td> ) for example.

Is scrapy for me?

Im hoping to load the captured data into some sort of DB, map the
fields to what I need (eg find the field that is price and call it
price) then that becomes the parser/clean data for that site until the
html changes.

Any ideas on how to do this with scrapy if possible?

Shane Evans

unread,

Feb 16, 2012, 5:29:55 PM2/16/12

to scrapy...@googlegroups.com

Hi Dan,

Scrapy won't do what you're looking for, but could be useful if you were
to build such a crawler.

You don't need to have one spider per website with scrapy. It's not
unusual to scrape hundreds of thousands of websites with a single
spider. Obviously, you won't be authoring xpaths for each website.
Perhaps you would have one spider for every site, or maybe a default
spider and some manually written ones for sites that are important or
don't work with the default one.

It's very easy to write a spider in scrapy that would follow all the
seed urls and keep going extracting data. But you probably want to limit
the time on any given site. Scrapy has support for this (stopping after
certain number of requests), but it's a bit harsh.. you probably want to
prioritize certain pages (nearer the home page, lead to new items to
extract, not visited recently, etc.) and implement some revisit policy.
This is likely to be something you want to consider no matter what
solution you go with.

For longer running crawls you'll need to consider server failure. Scrapy
has support for serializing state, or for replacing local datastructures
with data you can read across the network (e.g. using redis for storing
scraping queues). Depending on your crawling rate, you may want to use
many machines, certainly you'll need many processes.

Then, of course, you need to actually extract the data you are
interested in from these web pages. I am not aware of anything that you
can plug into scrapy for this (if anyone is, please let me know). There
are a number of techniques for this, but so far I have not seen good
open source implementations. If you're writing one, I'd be interested to
know.

Have you looked at Nutch? it has better support for the larger crawls
where you are following many links across multiple websites in a
fault-tolerant way. Of course, you'll still need to handle the data
extraction yourself.

Cheers,

Shane

Geek Gamer

unread,

Feb 16, 2012, 11:29:09 PM2/16/12

to scrapy...@googlegroups.com

Hi,

For handling 5000+ sites you will need to provide resources. Scrapy
can scale by using same central spider schedule queue and scheduling
crawls using multiple servers. Scrapyd lets you to configure the
number of crawl processes that can be spawned on the server.

However from my experience scrapy processes tend to be more cpu
intensive, you can handle more sites with less hardware using nutch as
opposed to scrapy.

Now the point is why is scrapy better, in your case it seems you want
to extract some structured information from the sites which is way
more simple and fast using scrapy as opposed to developing site
specific plugins in nutch.

Scrapy lets you configure and crawl each site quicker (for development
time) and the extraction solution can be XPath based or RE based.

Hope that helped!
Umar

> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to
> scrapy-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scrapy-users?hl=en.
>

Neverlast N

unread,

Feb 17, 2012, 4:31:00 AM2/17/12

to scrapy...@googlegroups.com


"Then, of course, you need to actually extract the data you are 
interested in from these web pages. I am not aware of anything that you 
can plug into scrapy for this (if anyone is, please let me know). There 
are a number of techniques for this, but so far I have not seen good 
open source implementations. If you're writing one, I'd be interested to 
know."

My experience with Scrapy is the same. I want also to extract real estate data from 100's of websites and the best I came up with is to create a set of tools that will be able to create XPath by giving them some training data i.e. manually extracting data for a few pages and then the configuration will be automatically generated.

This would probably reduce the configuration time to less than 5 minutes/site which means 416hours ... The good news is that this is $3/h job and not $25/h job i.e. it would cost you less than $1500 to create the configurations and you can massively paralelize to have it done in e.g. a week. The bad news is that you will need another $2000 probably to write custom scrappers for the 20% of the sites that can't be configured semi-automatically. If you choose that 80% is good enough - that's ok.

Still you will have to write the configuration "wizard" software though :)

Note: If it's a 5000 site project probably $10k is funny money. Hosting will be costing you ~$1k/month if you are crawling daily. Are you sure you NEED all those sites though?

Cheers,
Dimitris

> Date: Wed, 15 Feb 2012 15:24:54 -0800
> Subject: Can scrapy handle 5000+ website crawl and provide structured data?
> From: daninth...@gmail.com
> To: scrapy...@googlegroups.com

Yung Bubu

unread,

Jul 21, 2015, 8:22:41 PM7/21/15

to scrapy-users

Hi Everyone,

I am facing the same problematic as Dan in a project (need for a scalable, generic crawler; possibly enhanced with probabilistic models).

I was wondering if you might have some new inputs on the subject, since the last response was more than 3 years ago.

How has Scrapy evolved in general and relatively to "competitors" like Nutch ?

Your input would be much appreciated !

Thanks,

Yung

K Chenette

unread,

Jul 22, 2015, 12:09:09 PM7/22/15

to scrapy-users

Hi Yung,

I recommend that you check out this page:

http://blog.scrapinghub.com/2015/04/22/frontera-the-brain-behind-the-crawls/

"last year we had 11 billion requests made on Scrapy Cloud alone"

I believe Scraping Hub is the best source of information on scaling Scrapy crawls.

Karen

Yung Bubu

unread,

Jul 23, 2015, 7:29:47 AM7/23/15

to scrapy-users, kcons...@gmail.com

Hi Karen,

Frontera looks really great !

I will have a deep look at it, it could help me a lot !

Many Thanks,

Yung

Reply all

Reply to author

Forward