Hi there,
My first crawler works perfectly after some experimentation but I
cannot return items for further pipelining due to the structure I
created.
What i want to do is roughly the following:
-> crawl a list of similar websites (same domain) but I want to
provide the links dynamically (ie: from external application so I
think I cannot use start_url[])
-> these similar sites have to be crawled recursively to get all the
data I want
-> after crawling I want to process the data (ie: use it in pipeline)
The last part is what does not work in my current design.
So what I ended up with was a for loop over the urls where in every
loop a recursive crawling loop is started. The crawler does the
following:
----
def start_requests(self):
# open an url which contains a list of urls
request = Request(settings['URL_FILE'],
callback=self.parse_url_list)
yield request
def parse_url_list(self, response):
(... code to extract the urls from the HTML ...)
for url in url_list:
request = Request(url, callback=self.parse)
yield request
def parse(self, response):
(... this is were I do the real extraction of data ...)
(... due to the nature of the site i do recursive requests to get
all the data ...)
request = Request(next_page, callback=self.parse)
yield request
--------------
But how can I ever return the items in this design? I guess my design
is seriously flawed but I cannot come up with a different one which
works. My guess is that all these yields make it impossible to get the
data back with a return statement but my understanding of yield and
python is not sufficient to come up with a working solution to this
problem.
Any ideas?
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to
scrapy...@googlegroups.com.
To unsubscribe from this group, send email to
scrapy-users...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/scrapy-users?hl=en.