how to download and save a file with scrapy

12,276 views
Skip to first unread message

Ana Carolina Assis Jesus

unread,
Sep 17, 2013, 5:50:05 AM9/17/13
to scrapy...@googlegroups.com
Hi!

I am trying to download a csv file with scrapy.
I could crawl inside the site and get to the form I need and then I find two buttons to click.
One will list the transactions while the second one will download a XXX.cvs file.

How do I save this file within scrapy?

I mean, if I choose the list transactions, I will get another webpage and this I can see.
But what if I choose the action to download? I guess I should not use the return self.parse_dosomething but something else to save the file it should give me (???)

Or should the download start by itself?

Thanks,
Ana

Paul Tremberth

unread,
Sep 17, 2013, 6:01:56 AM9/17/13
to scrapy...@googlegroups.com
Hi Ana,
to download files, you should have a look at the new FilesPipeline

It's in the master branch though, not in a tagged version of Scrapy, so you'll have to install scrapy from source.

Paul.

Ana Carolina Assis Jesus

unread,
Sep 17, 2013, 6:04:31 AM9/17/13
to scrapy...@googlegroups.com
Hi Paul.

What do you mean by installing scrapy from source?
I need a new version from it?
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> scrapy-users...@googlegroups.com.
> To post to this group, send email to scrapy...@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.

Ana Carolina Assis Jesus

unread,
Sep 17, 2013, 6:07:40 AM9/17/13
to scrapy...@googlegroups.com
I mean... because I did install it quite recently and I see that there
is a pipeline code in it... though I didn;t use and don't know how to
(?)

Ana Carolina Assis Jesus

unread,
Sep 17, 2013, 6:12:56 AM9/17/13
to scrapy...@googlegroups.com
Another thing I am not understanding about the pipeline when I read it
is that it looks like it is supposed to download certain fields.
What I want is to save a whole file.
But what I don't understand is why this file already doesn't download
automatically once I ask for it? If I am asking for the right action,
why do I still see a page listings instead of the downloaded file???

Thanks!
Ana

On Tue, Sep 17, 2013 at 12:07 PM, Ana Carolina Assis Jesus

Paul Tremberth

unread,
Sep 17, 2013, 6:14:08 AM9/17/13
to scrapy...@googlegroups.com
Well, the FilesPipeline is a module inside scrapy.contrib.pipelines
It was committed less than 2 weeks ago.(Scrapy is being improved all the time by the community)

It depends when and how you installed scrapy:
- if you install a tagged version using pip or easy_install (as it's recommended; http://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy)
you won't have the Pipeline and you have to add it yourself

- if you installed from source less than 2 weeks ago (git clone g...@github.com:scrapy/scrapy.git; cd scrapy; sudo python setup.py install)
you should be good (but Scrapy from latest source code might be unstable and not fully tested)

Ana Carolina Assis Jesus

unread,
Sep 17, 2013, 6:19:52 AM9/17/13
to scrapy...@googlegroups.com
well, I installed about two weeks ago, but a tagged version... so
maybe I dont have it...
But I really need pipeline even if get button, at principle, at least,
should just download a file! I mean, it is what it does manualy...
???

Thanks!

On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth

Ana Carolina Assis Jesus

unread,
Sep 17, 2013, 7:46:15 AM9/17/13
to scrapy...@googlegroups.com
Hi Paul,

Could you give me an example on how to use the pipeline, please?

Thanks,
Ana

Ana Carolina Assis Jesus

unread,
Sep 18, 2013, 8:12:14 AM9/18/13
to scrapy...@googlegroups.com
Hi Paul.

Hello!

As I told you earlier I was able to see the desirable output. In csv format when I print the response.body, and it looks like this, more or less:

79;BUK-159;12/9/2013;0;Incomplete or invalid;;;none/;;;
79/D;0000000000000000;12/9/2013;;;145.65;;;;
80;BUK-160;12/9/2013;0;Incomplete or invalid;;;none/;;;
80/D;0000000000000000;12/9/2013;;;150.75;;;;

So then I wanted to read these output, but if I just treat the response.body as a csv file it doesn't work. It just read every letter/number as a line :-/
Then I decided to save it first on a file and THEN open this file as csv and read it. Like bellow:

def parse_download(self, response):
        
        f1 = open('Transactions.csv','w')
        print >> f1, response.body

        
        data_set = csv.reader(open("Transactions.csv","rb"))
        for row in data_set:
             print row

My problem is that my original file (the Transactions.csv I just saved) has 1011 rows. BUT when I read it later, I can only see 1000 rows!

Do you know why is that?
Does it have a way for me to save the response.body as csv and read it all!?
Or is there a way to read the response.body as a csv?

Thank!
Ana

Ana Carolina Assis Jesus

unread,
Sep 18, 2013, 8:39:04 AM9/18/13
to scrapy...@googlegroups.com
I even tried to write the reader part in a separate code, but I keep
only reading 1000 entries! Even though the file has 1011 rows!

I really cant understand this???

def parse_download(self, response):

f1 = open('pymTransactions.csv','w')
print >> f1, response.body


reader.getData()

and then:

import csv
import unicodedata

import atexit
import os
import string
import rlcompleter
import io

import time
import datetime
from datetime import date

def getData():

data_set = csv.reader(open("pymTransactions.csv","rb"))

#payId = []

count = 0
for row in data_set:

#payId.append(row[0])
print row

if(count%1000==0):
print 'count: ', count

count = count + 1


But I keep having the same result :-S

Ana Carolina Assis Jesus

unread,
Sep 18, 2013, 8:46:24 AM9/18/13
to scrapy...@googlegroups.com
In fact, it is also breaking down the line!
I just noticed that the reader not only read until at row 1000, but
also the row 1000 is broken and not all the fields are read!

:-S

I am really puzzled!?

On Wed, Sep 18, 2013 at 2:39 PM, Ana Carolina Assis Jesus

Paul Tremberth

unread,
Sep 21, 2013, 1:03:09 PM9/21/13
to scrapy...@googlegroups.com
Hi Ana,

if you want to use the FilesPipeline, before it's in an official Scrapy release,
here's one way to do it:

and save it somewhere in your Scrapy project,
let's say at the root of your project (but that's not the best location...)
yourproject/files.py

2) then, enable this pipeline by adding this to your settings.py

ITEM_PIPELINES = [
    'yourproject.files.FilesPipeline',
]
FILES_STORE = '/path/to/yourproject/downloads'

FILES_STORE needs to point to a location where Scrapy can write (create it beforehand)

3) add 2 special fields to your item definition
    file_urls = Field()
    files = Field()

4) in your spider, when you have an URL for a file to download,
add it to your Item instance before returning it

...
    myitem = YourProjectItem()
    ...
    myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv"]
    yield myitem

5) run your spider and you should see files in the FILES_STORE folder

Here's an example that download a few files from the IETF website

the scrapy project is called "filedownload"

items.py looks like this:

from scrapy.item import Item, Field

class FiledownloadItem(Item):
    file_urls = Field()
    files = Field()
 

this is the code for the spider:

from scrapy.spider import BaseSpider
from filedownload.items import FiledownloadItem

class IetfSpider(BaseSpider):
    name = "ietf"
    allowed_domains = ["ietf.org"]
    start_urls = (
        'http://www.ietf.org/',
        )

    def parse(self, response):
        yield FiledownloadItem(
            file_urls=[
                'http://www.ietf.org/rfc/rfc2616.txt',
                'http://www.rfc-editor.org/rfc/rfc2616.ps',
                'http://tools.ietf.org/html/rfc2616.html',
            ]
        )

When you run the spider, at the end, you should see in the console something like this:

2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200 http://www.ietf.org/>
              'http://www.ietf.org/rfc/rfc2616.txt',
'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
           'path': 'full/4f7f3e96b2dda337913105cd751a2d05d7e64b64.gif',
          {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
           'path': 'full/454ea89fbeaf00219fbcae49960d8bd1016994b0.txt',
           'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
          {'checksum': '5f0dc88aced3b0678d702fb26454e851',
           'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps',
           'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
          {'checksum': '2d555310626966c3521cda04ae2fe76f',
           'path': 'full/6ff52709da9514feb13211b6eb050458f353b49a.pdf',
           'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
          {'checksum': '735820b4f0f4df7048b288ba36612295',
           'path': 'full/7192dd9a00a8567bf3dc4c21ababdcec6c69ce7f.html',
           'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)

which tells you what files were downloaded, and where they were stored.

Hope this helps.

Ana Carolina Assis Jesus

unread,
Sep 24, 2013, 8:00:19 AM9/24/13
to scrapy...@googlegroups.com
Hi Paul!

Thanks a lot.
I will definitely be trying this! :-)

Cheers!
Ana

On Sat, Sep 21, 2013 at 7:03 PM, Paul Tremberth

papis w

unread,
Dec 3, 2013, 2:06:22 PM12/3/13
to scrapy...@googlegroups.com
Hi
    I am trying to download pdf files so I tried to follow files.py you posted. I followed your example to download ietf.
I didn't get any file back in the path I specified.
These are what I got when I ran scrapy crawl ietf
2013-12-03 13:04:00-0600 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: loginTest)
2013-12-03 13:04:00-0600 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-12-03 13:04:00-0600 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-12-03 13:04:00-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-12-03 13:04:00-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-12-03 13:04:00-0600 [scrapy] DEBUG: Enabled item pipelines: FilesPipeline
2013-12-03 13:04:00-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-12-03 13:04:00-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-12-03 13:04:00-0600 [ietf] INFO: Spider opened
2013-12-03 13:04:00-0600 [ietf] DEBUG: Crawled (200) <GET http://www.ietf.org/> (referer: None)
2013-12-03 13:04:00-0600 [ietf] DEBUG: Scraped LogintestItem(file_urls=['http://www.ietf.org/images/ietflogotrans.gif', 'http://www.ietf.org/rfc/rfc2616.txt', 'http://www.rfc-editor.org/rfc/rfc2616.ps', 'http://www.rfc-editor.org/rfc/rfc2616.pdf', 'http://tools.ietf.org/html/rfc2616.html']) in <http://www.ietf.org/>
2013-12-03 13:04:00-0600 [ietf] DEBUG: Crawled (200) <GET http://www.ietf.org/images/ietflogotrans.gif> (referer: None)
2013-12-03 13:04:02-0600 [ietf] DEBUG: Crawled (200) <GET http://www.ietf.org/rfc/rfc2616.txt> (referer: None)
2013-12-03 13:04:02-0600 [ietf] DEBUG: Crawled (200) <GET http://tools.ietf.org/html/rfc2616.html> (referer: None)
2013-12-03 13:04:02-0600 [ietf] DEBUG: Crawled (200) <GET http://www.rfc-editor.org/rfc/rfc2616.pdf> (referer: None)
2013-12-03 13:04:02-0600 [ietf] DEBUG: Crawled (200) <GET http://www.rfc-editor.org/rfc/rfc2616.ps> (referer: None)
2013-12-03 13:04:02-0600 [ietf] INFO: Passed LogintestItem(files=[], file_urls=['http://www.ietf.org/images/ietflogotrans.gif', 'http://www.ietf.org/rfc/rfc2616.txt', 'http://www.rfc-editor.org/rfc/rfc2616.ps', 'http://www.rfc-editor.org/rfc/rfc2616.pdf', 'http://tools.ietf.org/html/rfc2616.html'])
2013-12-03 13:04:02-0600 [ietf] INFO: Closing spider (finished)
2013-12-03 13:04:02-0600 [ietf] INFO: Spider closed (finished)

Could you please tell me what did I do wrong here? I guess I dont need to get FilesPipeline?
Thanks in advanced
Papis

Paul Tremberth

unread,
Dec 3, 2013, 4:20:49 PM12/3/13
to scrapy...@googlegroups.com
Hi,
I see you're running a rather "old" Scrapy release, 0.12, I haven't tested with this version.
Can you upgrade your scrapy release?
Also, have you enabled the files pipeline in your settings.py?

/Paul.

Matt Cialini

unread,
Feb 21, 2014, 12:44:20 AM2/21/14
to scrapy...@googlegroups.com
Hello Paul!

I'm Matt. I know this is a somewhat old group now but I have found your advice about FilesPipeline and it works great. I had one question though. Do you know of an easy way to pass in a file_name field for each url so that the FilesPipeline will save each url with the correct name?

Thanks!

Paul Tremberth

unread,
Feb 25, 2014, 4:28:15 AM2/25/14
to scrapy...@googlegroups.com
Hi Matt,

one way to do that is to play with the FilesPipeline get_media_requests(),
passing additional data through the meta dict
and then using a custom file_path() method

Below, I use a dict in file_urls and not a list, so that I can pass a URL and a custom file_name

Using the same IETF example I used above in the thread:

A simple spider downloading some files from IETF.org

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.item import Item, Field


class IetfItem(Item):
    files = Field()
    file_urls = Field()


class IETFSpider(Spider):
    name = 'ietfpipe'
    allowed_domains = ['ietf.org']
    start_urls = ['http://www.ietf.org']
    file_urls = [
    def parse(self, response):
        for cnt, furl in enumerate(self.file_urls, start=1):
            yield IetfItem(file_urls=[{"file_url": furl, "file_name": "file_%03d" % cnt}])



Custom FilesPipeline

from scrapy.contrib.pipeline.files import FilesPipeline
from scrapy.http import Request

class MyFilesPipeline(FilesPipeline):

    def get_media_requests(self, item, info):
        for file_spec in item['file_urls']:
            yield Request(url=file_spec["file_url"], meta={"file_spec": file_spec})

    def file_path(self, request, response=None, info=None):
        return request.meta["file_spec"]["file_name"]



Hope this helps

/Paul.

Matt Cialini

unread,
Feb 25, 2014, 10:03:20 AM2/25/14
to scrapy...@googlegroups.com
Hi Paul,

Thanks for the suggestion. I'm trying to implement it now but the files aren't being written to disk correctly. What function in files.py handles the actual saving of the file?

Every item I pass into files.py eventually is a FileDownloadItem {'file_urls': [list of several of these dict objects {file_url:url, file_name:name}]}

I'll attach my code to this if you have time to look it over. Basically I think something is not being passed in correctly in files.py, but it's hard to search through and determine where.

Thanks so much Paul!

- Matt C
doi_spider.py
files.py
items.py
settings.py

Casey Klimkowsky

unread,
Mar 23, 2014, 10:00:27 PM3/23/14
to scrapy...@googlegroups.com
Hi Matt,

I was wondering if you ever figured out your problem? I am also looking to use the FilesPipeline with custom file names. I was able to edit FilesPipeline itself to achieve this result, but obviously it would be a better practice to extend FilesPipeline and override the necessary methods instead. When I use a solution similar to Paul's, my files are not downloaded to my hard drive.

Thank you!

Matt Cialini

unread,
Mar 25, 2014, 1:34:31 AM3/25/14
to scrapy...@googlegroups.com
Hi Casey,

I ended up using Paul's suggestion and expanded on it to fit my needs. Basically my spider creates a single instance of FileDownloadItem {'file_urls': [list of several of dict objects {file_url:url, file_name:name} ] }. The individual dict objects have their file_url = their web url, and file name = the save title. It yields the item to the FilesPipeline, in which I just edited a few functions to better match the item structure i passed in. 

def _get_filesystem_path(self, path):
        str = self.basedir + path[0]
        return str

def file_path(self, request, response=None, info=None):
        def _warn():
            #print "_warn"
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('FilesPipeline.file_key(url) method is deprecated, please use '
                          'file_path(request, response=None, info=None) instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() method has been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        media_ext = os.path.splitext(url)[1]  # change to request.url after deprecation
        ret = request.meta["file_spec"]["file_name"]
        return ret[0] + media_ext



For more options, visit https://groups.google.com/d/optout.

Matt Cialini

unread,
Mar 25, 2014, 8:52:08 AM3/25/14
to scrapy...@googlegroups.com
I actually sent you the old code I had. The new one also edited these functions, but instead of path[0] and ret[0], i had just path and ret.
Reply all
Reply to author
Forward
0 new messages