Streaming uploads to temp files

Ivan Sagalaev

unread,

Mar 10, 2006, 3:43:46 AM3/10/06

to django-d...@googlegroups.com

There is a ticket for making Django not to eat entire file upload in
memory http://code.djangoproject.com/ticket/1484

I have it basically working using cgi.FieldStorage which stores POSTed
files in temp files. request.FILES items now are classes inherited from
dict that have file['file'] returning a file-like object and fill
file['content'] with file content on first access.

To finish a patch I'd like to have your opinions:

* Temp files are read from stdin in 8K chunks which seems to me too
conservative (and slow). What would be the reasonable amount? I
would say 32K but it's based only on some dusty habits :-)
* I don't plan to leave 'read everything in memory' behavior for
file uploads in place. However if it makes sense to process file
uploads in memory I could make a setting switching between these
behaviors.

Malcolm Tredinnick

unread,

Mar 10, 2006, 4:17:58 AM3/10/06

to django-d...@googlegroups.com

Hi Ivan,

Some initial thoughts...

On Fri, 2006-03-10 at 11:43 +0300, Ivan Sagalaev wrote:
> There is a ticket for making Django not to eat entire file upload in
> memory http://code.djangoproject.com/ticket/1484
>
> I have it basically working using cgi.FieldStorage which stores POSTed
> files in temp files. request.FILES items now are classes inherited from
> dict that have file['file'] returning a file-like object and fill
> file['content'] with file content on first access.
>
> To finish a patch I'd like to have your opinions:
>
> * Temp files are read from stdin in 8K chunks which seems to me too
> conservative (and slow). What would be the reasonable amount? I
> would say 32K but it's based only on some dusty habits :-)

Some number of megabytes per chunk (10, maybe). You are going to execute
a lot of loops and consequent data copying if you are doing multiple
passes per megabyte of data.

> * I don't plan to leave 'read everything in memory' behavior for
> file uploads in place. However if it makes sense to process file
> uploads in memory I could make a setting switching between these
> behaviors.

What is the performance impact of this change? And does it vary
depending upon whether you use the mod_python or wsgi backend?

Writing to disk is obviously a lot slower than writing to memory and,
despite the way the initial ticket was written, this change is really
only necessary very large files or large numbers of simultaneous uploads
of medium-sized files. For a production-level web server setup, wanting
to be able to handle the occasional upload of a couple of hundred
megabytes, in-memory will still suffice, for example (and it's fairly
rare to allow multi-tens-of-megabytes uploads from Joe Public, so you
will normally have some idea of how frequent the large file uploads are
going to be, since your audience will be "under control").

If the performance hit is non-trivial, I think we need to retain the
option of in-memory usage (and I would even prefer that to be the
default), since that is likely to be the common case. I like the idea of
having an option to write temporary files to disk for the cases that
need it, since when you really need it there are often no good
alternatives. But I have concerns about performance in the common case.

Malcolm

Ivan Sagalaev

unread,

Mar 10, 2006, 5:11:29 AM3/10/06

to django-d...@googlegroups.com

Malcolm Tredinnick wrote:

>Some number of megabytes per chunk (10, maybe). You are going to execute
>a lot of loops and consequent data copying if you are doing multiple
>passes per megabyte of data.
>
>

This is the whole point -- to make it in small chunks (but not very
small) to conserve memory. 2-3 megabyte files are already take too much.

>What is the performance impact of this change? And does it vary
>depending upon whether you use the mod_python or wsgi backend?
>
>

Well... On my developement machine (1.3 GHz Celeron, 256 MB RAM) this is
a difference between "works" and "doesn't work" :-). If I try to upload
14 MB file not using temp files it takes *minutes* and swaps hardly (in
fact I never could wait for it to end). With streaming to temp files it
takes about a second. This is both with development server which is WSGI
and Apache with mod_python. On machine with more memory current variant
should work less hard but it's still only one request.

The thing is that those megabytes are multiplied:
1. sitting in stdin
2. copied in raw_post_data string
3. copied in mime parser
4. copied in request.FILES

>Writing to disk is obviously a lot slower than writing to memory and,
>despite the way the initial ticket was written, this change is really
>only necessary very large files or large numbers of simultaneous uploads
>of medium-sized files.
>

Yep. Especially for cases with many simultaneous requests. But I see
your point that memory may be cheaper than disk writes. I just've got an
impression that problems from memory are more common. Think of photo
service or music exchange service that accept zipped albums.

Then I make it with two settings (though I'm known to invent not very
good settings names :-) ):

STREAM_UPLOAD_ON_DISK = False # or True by default?
STREAM_UPLOAD_CHUNK_SIZE = 32 * 1024 * 1024

Ivan Sagalaev

unread,

Mar 10, 2006, 5:50:07 AM3/10/06

to django-d...@googlegroups.com

Ivan Sagalaev wrote:

>With streaming to temp files it takes about a second.
>

Looks like I was wrong. It appears that cgi.FieldStorage stores content
in temp files only when uploaded parts have Content-length set. But
browser doesn't set it and FieldStorage reads data with readline() and
stores it in memory anyway.

So all these disk vs. memory assumptions are irrelevant :-). Now I think
if it at all makes sense to deal with temp files rather than just switch
Django mime parser to FieldStorage instead of email.Message...

Ivan Sagalaev

unread,

Mar 10, 2006, 6:07:42 AM3/10/06

to django-d...@googlegroups.com

Ivan Sagalaev wrote:

>Looks like I was wrong.
>

And again :-(

> It appears that cgi.FieldStorage stores content
>in temp files only when uploaded parts have Content-length set. But
>browser doesn't set it and FieldStorage reads data with readline() and
>stores it in memory anyway.
>
>

It does uses readline() when length is unknown but stores it in memory
only until length of the content doesn't exceed 1000 bytes (hard-coded).
Then it creates temp file and dumps data there.

>So all these disk vs. memory assumptions are irrelevant :-). Now I think
>if it at all makes sense to deal with temp files rather than just switch
>Django mime parser to FieldStorage instead of email.Message...
>
>

Then it's not that much irrelevant :-). I'll make a setting for chosing
type of a storage but not for chunk size.

Ivan Sagalaev

unread,

Mar 11, 2006, 1:23:12 AM3/11/06

to django-d...@googlegroups.com

Ivan Sagalaev wrote:

>Then it's not that much irrelevant :-). I'll make a setting for chosing
>type of a storage but not for chunk size.
>
>

I've submitted a patch. It even looks faster than email.Message even
without disk storage. Can anyone please test it under magic-removal?

luca

unread,

Mar 11, 2006, 4:03:44 AM3/11/06

to Django developers

Hi !
This is a great improvement I will try it asap.
I'm sure you did but just for the record take a look at cherrypy
approach on this: http://www.cherrypy.org/wiki/FileUpload
After a simple recipe there is another that work very well, I've
tested it on a local network with several iso images without a
problem, ram usage and CPU usage was always low, awesome.
I know that cp is a different beast but you can see some useful
optimization advice.

Hope this help,
luca

Ivan Sagalaev

unread,

Mar 11, 2006, 5:20:32 AM3/11/06

to django-d...@googlegroups.com

luca wrote:

>I'm sure you did but just for the record take a look at cherrypy
>approach on this: http://www.cherrypy.org/wiki/FileUpload
>
>

They do exactly the same thing: parse uploaded data with
cgi.FieldStorage that streams them to a temp file.

In fact there is still a problem with this approach that I mentioned
eralier. cgi.FieldStorage uses readline() to read uploaded file from the
input stream. This means that the size of chunks is unpredictable
(depends on occurence of line endings in a binary file). So theoreticaly
if the file doesn't contain line endings it will be read in memory entirely.

It's rather hard to parse MIME stream in predictable chunks keeping
track of part boundaries. I did it long time ago in my webmail system
written in C++. May be I'll port it some day in Django, when I stop
shuddering whenever I think of it :-)

Istvan Albert

unread,

Mar 26, 2006, 3:42:18 PM3/26/06

to Django developers

> But I have concerns about performance in the common case.

File upload is inherently a slow process strictly limited by the
connection speed so it is not clear that any kind of performace
problems would be noticable at all (even if there were such problems)

Loading files into memory is *extremely* limiting, there are great
many applications, photo or music albums or rich web applications that
need file upload. The last thing one wants is to have to worry about
balooning up the memory footprint because of something as trivial as
uploading a file.

No sane application slurps up files of unkown size, why would a generic
web framework do so?

> rare to allow multi-tens-of-megabytes uploads from Joe Public

The question rather is why make Django *incapable* of dealing with
files without worrying of what a user might choose to upload? When the
solution is rather trivial?

And by the way, thanks for the nice patch Ivan. I'm (re)writing an
application to Django and would have never done so had there not be
this hope that this issue will be settled properly.

Istvan

Ivan Sagalaev

unread,

Mar 26, 2006, 11:55:36 PM3/26/06

to django-d...@googlegroups.com

Istvan Albert wrote:

>And by the way, thanks for the nice patch Ivan. I'm (re)writing an
>application to Django and would have never done so had there not be
>this hope that this issue will be settled properly.
>
>

Then you will like another one :-). This week I'll try to implement a
solution for paired problem: streaming outputting large files. Now
Django understands output as a string in memory which is not acceptable
for my project (I need to feed users 70 MB zipped mp3 albums controlling
speed and download success).

Reply all

Reply to author

Forward