requesting and parsing both http and local files

798 views
Skip to first unread message

TB

unread,
Sep 10, 2011, 3:05:46 PM9/10/11
to scrapy-users
I have a list of url's to process, some are pointing to webpages,
others to files on disk

self.urls = [
'http://www.xyz.com/data.html',
'file:///var/log/datafiles/data.html' ]

My spider needs to login, first, and after that I have a bit of code
that yield the list of requests

while len(self.urls)>0:
next_url = self.urls.pop(0)
yield Request(next_url, callback=self.parse, dont_filter=True)

The problem is that this used to work (I think), but it no longer
does. The files on disk aren't getting crawled.. The webpages are
doing file:
DEBUG: Crawled (200) <GET http:// ...

but I see nothing happening to the local files. No debug messages ala
"<GET file:///var/...", also self.parse is not getting called for
those files...

Any idea for to debug this, or solve this?

TB

unread,
Sep 11, 2011, 3:17:26 AM9/11/11
to scrapy-users
I explicitly set the file downloadhanders, because I somewhere read
that the "file" handler might be causing a security risk and should be
disabled by default,.. but no luck..

DOWNLOAD_HANDLERS = {
'file':
'scrapy.core.downloader.handlers.file.FileDownloadHandler',
'http':
'scrapy.core.downloader.handlers.http.HttpDownloadHandler',
'https':
'scrapy.core.downloader.handlers.http.HttpDownloadHandler',
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
}

I there somewhere an example on how to request a local file using a
Request object?

TB

unread,
Sep 12, 2011, 2:06:37 PM9/12/11
to scrapy-users
I found a solution and thought it might be good to share:

I ended up reading files, storing the content into a HtmlResponse
object, and sending that that the parse function

filecontent = open(filename, 'rb').read()
response = HtmlResponse(url='file://' + filename, body=filecontent,
encoding="iso-8859-15")
self.parse(response);

Brice T

unread,
Sep 30, 2011, 9:43:39 AM9/30/11
to scrapy-users
Hi,

I've been digging through the same troubles and was about to surrender
when finding an explanation and an elegant solution.

To me, this is due to the "allowed_domains".

What I did was to add "127.0.0.1" in "allowed_domains" and access to
local files with urls begining with "file://127.0.0.1/".

Hope your trouble was the same as mine and hope it prevent other users
to lose some time ...
Reply all
Reply to author
Forward
0 new messages