considering using scrapy for a crawling project.
however, scrapy really only deals with the initial problem of
fetching/parsing a page.
i'm looking at the 10,000 foot level of how to architext a distributed
system where i have a master system that distributes groups of urls to
be parsed, to the client servers in the network. The client server
contains the "client apps" that are used to parse the given urls that
the client receives from the master server. So you have the kind of
following roles for the master/client side of the process:
Master Server/Side
-Distributes out URLs in batches of 100-200 urls to the clients in a
fifo process
-Tracks which client servers are currently processing which URLs
-Accepts URL parse results from the given child client servers
-Updates the Master Database with the completed/error results from the
Client Server/Processes
Client Server/Side
-Gets the batch of URLs from the Master Server for processing/parsing
-Generates a multiple/dual run against the targeted page for each group of URLs
-The results of a run is the parsed data from the page, written to a file for
each page. So the output dir consists of the total resulting parsed pages
-Each run (1st or 2nd) for the group of URLs is written to a separate dir
-Each dir is then sorted/hashed/compared to determine if the two runs match
-The assumption is that if both resulting dirs match, the parsed data
is correct/complete and can be returned to the master server.
-If the dis don't match, there was an err, and we throw out all the
results, and we restart the process on the Master Server side...
This approach is being evaluated as the needs for the project demand
that the parsed data
be verified/validated.
I've been searching the net, trying to find research docs on the
architecture of aggregation
crawlers to see how other applications handle this, and as far as I
can tell, this issue
isn't dealt with.
Thoughts, comments, pointers, names/contacts of people within
companies who are building these kinds of large scale apps would be
helpful!!
Thanks
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.
-bruceSo why don't we discuss any get to the end goal sooner.As some have mentioned, it seems we're working on similar parts of the same puzzle, and we are.My need is to be able to have a distributed (cheap) process to crawl the targetd sites for the data I'm going to need.But the project behind scrapy/scrapyhub is all about building a generalizable crawling architecture/plumbing which is world class.I've had some indepth conversations/IM chats with Pablo (great guys - behind scrapy!!) on what they're doing.To be upfront, I have no interest/desire in building anything to compete with them. I'm more than willing to work with them were we might overlap.
On Tue, Aug 20, 2013 at 2:49 PM, bruce <bado...@gmail.com> wrote:regarding your management process, did you develop your own web app? do you have a web based job/batch/scheduler process?Hey Jordi,Are you in the US?
That's great... I am really interested solutions like this... Let me ask you, if you were to do it all over again is there anything you would change... or for that matter did you run into any strange issues?
My interests relate to establishing something which the flexibility to work in different usage situations.
I would be interest to know what your take is on using something like celery as a means to manage (or for that matter extend out the capabilities).
Connect me me directly if your interested. (My first name (@) netconstructor (com)
Chris
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/-AwGaWBuwyI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.