distributed crawler architecture

bruce

unread,

Jul 29, 2010, 9:12:05 AM7/29/10

to scrapy...@googlegroups.com

hey guys...

considering using scrapy for a crawling project.

however, scrapy really only deals with the initial problem of
fetching/parsing a page.

i'm looking at the 10,000 foot level of how to architext a distributed
system where i have a master system that distributes groups of urls to
be parsed, to the client servers in the network. The client server
contains the "client apps" that are used to parse the given urls that
the client receives from the master server. So you have the kind of
following roles for the master/client side of the process:

Master Server/Side
-Distributes out URLs in batches of 100-200 urls to the clients in a
fifo process
-Tracks which client servers are currently processing which URLs
-Accepts URL parse results from the given child client servers
-Updates the Master Database with the completed/error results from the
Client Server/Processes

Client Server/Side
-Gets the batch of URLs from the Master Server for processing/parsing
-Generates a multiple/dual run against the targeted page for each group of URLs
-The results of a run is the parsed data from the page, written to a file for
each page. So the output dir consists of the total resulting parsed pages
-Each run (1st or 2nd) for the group of URLs is written to a separate dir
-Each dir is then sorted/hashed/compared to determine if the two runs match
-The assumption is that if both resulting dirs match, the parsed data
is correct/complete and can be returned to the master server.
-If the dis don't match, there was an err, and we throw out all the
results, and we restart the process on the Master Server side...

This approach is being evaluated as the needs for the project demand
that the parsed data
be verified/validated.

I've been searching the net, trying to find research docs on the
architecture of aggregation
crawlers to see how other applications handle this, and as far as I
can tell, this issue
isn't dealt with.

Thoughts, comments, pointers, names/contacts of people within
companies who are building these kinds of large scale apps would be
helpful!!

Thanks

David Koblas

unread,

Jul 30, 2010, 10:44:51 AM7/30/10

to scrapy...@googlegroups.com, bruce

Bruce,

I'll just chime in and say when I initially looked at scrapy it wasn't really targeted at that problem, but there is a lot of the infrastructure that could really assist. Probably the best places to look at what you're talking about is over in Apache land specifically the Droids project ( http://incubator.apache.org/droids/ ) of course it's not in Python.

A while ago (9 months) I started with scrapy and ended up building something new, the general idea was that it looked something like this:

Fetcher

Responsible for HTTP fetching of websites, getting and validating robots.txt for any website and request.
Can do multiple parallel requests (fully async twisted core) to multiple sites simultaneously
Direct fetch interface (thrift RPC) to facilitate debugging
Contacts the scheduler service to get a batch of URLs to work on, notifies when URLs are fetched
When a page is fetched -- no processing happens -- it's queued down to the processor for further handling

Scheduler

Responsible for managing this universe of URLs to fetch and constructing batches of URLs to be fetched
All about priority queues, databases etc.etc.etc.

Processor

Takes pages in from the fetcher -- and does whatever task is necessary

Extract URLs and schedule future crawls
Parse HTML and extract meaningful content
Store in databases

All of this was build based on RabbitMQ (AMQP) queuing protocols between the components, with trift messsages as the queued objects.

While I'm not going to say it was 100% finished, nothing ever is, it did/does work.

--koblas

Christian Hochfilzer

unread,

Aug 20, 2013, 12:08:56 PM8/20/13

to scrapy...@googlegroups.com, bruce

That's sounds very interesting... Bruce - did you ever complete this since your post date? Could you share the source? CH

CONFIDENTIALITY NOTICE: This e-mail message from NetConstructor (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review; use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

Mingfeng Yang

unread,

Aug 20, 2013, 1:45:31 PM8/20/13

to scrapy...@googlegroups.com, bruce

I just recently finished a distributed and live crawling system based on scrapy. It aims to

1. crawl and collect thousands of discussion threads from internet

2. gather all discussion posts under each thread, and collect the new updated ones lively.

Here is how I design it with scrapy, django, and redis.

1. I wrote scrapy spiders to parse different websites to get related urls, and parse and extract all discussion posts.

2. The urls are saved into a postgresql database

3. Django is used to manage all the urls, each url is affiliated with a few properties, like last time it get scraped, the hash of the latest post, the publish time of the latest post, which spider it should use to fetch and parse page etc.

4. A cron job will fetch urls need to be scraped, and put related property data into Redis.

5. A script utilizing the core of scrapy will be called to fetch urls from the redis server, and do the real crawling and parsing. The script can be launched from different servers to leverage the distributed power.

6. Inside the spider, I implemented some checks, if an already scraped item is met or an item with published time older than the latest one I collected before, I decide there is no more new items, and the spider will be stopped.

The pros of this system is that it can do the work, and the cons are that the log output from scrapy is scattered over multiple servers, and for each url, I need to re-launch the twisted reactor to run the spider which kinda waste lots of CPU cycles.

But so far it works pretty well.

Ming-

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Jordi Llonch

unread,

Aug 20, 2013, 2:33:07 PM8/20/13

to scrapy...@googlegroups.com

Looks like some of us are working on a similar effort.

Ming, a central log utility can help you: graylog2 or log stash... Also scrapy-sentry (https://pypi.python.org/pypi/scrapy-sentry) can help you to track errors.

My solution at the end, after some performance tests was RabbitMQ as Queue, supervisord as process manager, scrapy, scrapy-sentry and a BigTable storage in an automated infrastructure.

I found informative to keep track of the following KPI: network usage, disk throughput, cpu load and memory usage.

Twisted and the reactor is becoming a challenge for scrapy. Was someone working on a pypy version, how is it going?

Regards,

2013/8/21 Mingfeng Yang <mfy...@wisewindow.com>

bruce

unread,

Aug 20, 2013, 2:59:26 PM8/20/13

to scrapy-users

I've had some indepth conversations/IM chats with Pablo (great guys - behind scrapy!!) on what they're doing.

To be upfront, I have no interest/desire in building anything to compete with them. I'm more than willing to work with them were we might overlap.

But the project behind scrapy/scrapyhub is all about building a generalizable crawling architecture/plumbing which is world class.

My need is to be able to have a distributed (cheap) process to crawl the targetd sites for the data I'm going to need.

As some have mentioned, it seems we're working on similar parts of the same puzzle, and we are.

So why don't we discuss any get to the end goal sooner.

-bruce

On Tue, Aug 20, 2013 at 2:58 PM, bruce <bado...@gmail.com> wrote:

I've had some indepth conversations/IM chats with Pablo (great guys - behind scrapy!!) on what they're doing.

To be upfront, I have no interest/desire in building anything to compete with them. I'm more than willing to work with them were we might overlap.

But the project behind scrapy/scrapyhub is all about building a generalizable crawling architecture/plumbing which is world class.

My need is to be able to have a distributed (cheap) process to crawl the targetd sites for the data I'm going to need.

As some have mentioned, it seems we're working on similar parts of the same puzzle, and we are.

So why don't we discuss any get to the end goal sooner.

-bruce

On Tue, Aug 20, 2013 at 2:49 PM, bruce <bado...@gmail.com> wrote:

Hey Jordi,

Are you in the US?

regarding your management process, did you develop your own web app? do you have a web based job/batch/scheduler process?

Mingfeng Yang

unread,

Aug 20, 2013, 3:07:12 PM8/20/13

to scrapy...@googlegroups.com

Hi Jordi,

Thanks for suggesting tools which look very useful. Do you know is there any way to replace/change the default scrapy.log so that all logs get written to graylog or other outside services? This is on my plan list, but I haven't looked into this issue yet.

Thanks,

Ming-

Jordi Llonch

unread,

Aug 20, 2013, 3:40:08 PM8/20/13

to scrapy...@googlegroups.com

Bruce, I'm based in Australia. My skype is jllonch

2013/8/21 Mingfeng Yang <mfy...@wisewindow.com>

Jordi Llonch

unread,

Aug 20, 2013, 3:43:22 PM8/20/13

to scrapy...@googlegroups.com

Hi Ming,

Search online how to log using default python logging module and then redirect that to syslog. Easy, simple and straightforward

Regards,

2013/8/21 Jordi Llonch <llo...@gmail.com>

Christian Hochfilzer

unread,

Aug 20, 2013, 7:57:21 PM8/20/13

to scrapy...@googlegroups.com

That's great... I am really interested solutions like this... Let me ask you, if you were to do it all over again is there anything you would change... or for that matter did you run into any strange issues?

My interests relate to establishing something which the flexibility to work in different usage situations.

I would be interest to know what your take is on using something like celery as a means to manage (or for that matter extend out the capabilities).

Connect me me directly if your interested. (My first name (@) netconstructor (com)

Chris

You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/-AwGaWBuwyI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Jordi Llonch

unread,

Aug 20, 2013, 9:00:48 PM8/20/13

to scrapy...@googlegroups.com

Christian,

Obviously, scrapy is a very well designed framework that allows to scale up with little or no hassle.

One of the biggest issues is to understand scalability and having to test everything.

Have discarded some storage engines after evaluation/staging, like MongoDB and his lovely write-lock. A real distributed database "will help". Interesting on the subject is: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

A good node performance balance after communication/network reduction also can improved x7 the cluster performance. That's more about distributed computing domain other than to crawling.

Automating the platform is another domain, I have used Opscode Chef. Wow, what a great product! Wow, what a huge learning curve!

After that, you'll have to deal with "bigdata" (silly term) analytics and management, but that's another history belonging to a complete different tale.

Cheers,

2013/8/21 Christian Hochfilzer <ch...@netconstructor.com>

Reply all

Reply to author

Forward