I have a list of url's to process, some are pointing to webpages,
others to files on disk
self.urls = [
'
http://www.xyz.com/data.html',
'file:///var/log/datafiles/data.html' ]
My spider needs to login, first, and after that I have a bit of code
that yield the list of requests
while len(self.urls)>0:
next_url = self.urls.pop(0)
yield Request(next_url, callback=self.parse, dont_filter=True)
The problem is that this used to work (I think), but it no longer
does. The files on disk aren't getting crawled.. The webpages are
doing file:
DEBUG: Crawled (200) <GET http:// ...
but I see nothing happening to the local files. No debug messages ala
"<GET file:///var/...", also self.parse is not getting called for
those files...
Any idea for to debug this, or solve this?