Streaming Uploads Discussion

103 views
Skip to first unread message

Mike Axiak

unread,
Mar 19, 2008, 12:30:55 AM3/19/08
to Django developers
Hello,

It's become clear to me that the API for 2070 [1] (which is currently
a mess :) has never really been discussed by the developers, and I've
been thinking about ways of organizing it such that one can add
features to file uploading without modifying the core.
Here's an overview of what I had in mind.

(Any comments/suggestions/angry cries are greatly appreciated.)

Extremely Visible Changes
================

request.set_upload_handler(<upload_handler>)
--------------------------------------------------------------
This new function will allow one to register a new FileUploadHandler
object to deal with the incoming data.
The API for the FileUploadHandler (and the default child
TemporaryFileUploadHandler) is discussed below.

request.FILES
-----------------
This is no longer a MultiValueDict of raw content, but a
MultiValueDict of UploadedFile objects.
This will probably hurt the most, as there are probably applications
assuming that they can take the content from this dict.
The API of the UploadedFile (and the default child
TemporaryUploadedFile) is discussed below.


APIs for New Objects
=============

FileUploadHandler
-----------------------
Objects of this type must define the following methods:
1. new_file(filename, content_length) -- A signal that a new file
has been reached by the parser.
2. receive_data_chunk(raw_data, start, stop) -- Some data has been
received by the parser. A non-nonzero return value indicates
that
the file upload should cease.
3. file_complete(file_size) -- A signal that the current file has
completed (no more than one file will ever be dealt with).
The expected return value is an object of UploadedFile.
4. upload_complete() -- A signal that all uploads have finished
successfully.
5. get_chunk_size() -- Returns a number that represents the maximum
number
of bytes read into memory for each chunk.

By adding a set_upload_handler() method to request, anyone can
override the default upload handler. However, this must be done before
the POST was accessed, and it is probably recommended we raise an
error if someone tries to set a new upload handler after the FILES
MultiValueDict is populated.

The default handler (TemporaryFileUploadHandler) will stream the data
to a temporary file for each part of the upload, and return a
corresponding TemporaryUploadedFile object that represents that data.
In addition, the TemporaryFileUploadHandler can know to stream it to
memory if the content_size is small enough (and start streaming it
into disk if the header lied for some grace period).

UploadedFile
----------------
Objects of this class represent files that were just uploaded and
handled by some handler above. I'm assuming that this will be the most
difficult class to nail, since it has to balance a few different
aspects:
1. Encapsulating what it means to be a uploaded file.
2. Making it easy for a developer to send it to a FileBackend and
have it Do The Right Thing.
3. Making it easy for a developer to get the content and do
something with it.

Methods of the UploadedFile:
1. open() -- Open the file for reading.
2. read([num_bytes]) -- Read a number of bytes from the file.
3. close() -- Round out the standard read file operations.
4. chunk([chunk_size]) -- Generates chunks from the file (if the
content is in memory already, it should ignore chunk_size).
5. filename() -- The filename from the content-disposition header.
6. multiple_chunks([chunk_size]) -- Whether or not you can expect
multiple chunks from chunk() (Boolean). Useful if you want to
do
something particularly when you know you can put it all into
memory at once.
7. file_size() -- The size in bytes of the uploaded file.

Added method of a TemporaryUploadedFile:
8. temporary_file_path() -- The whole path (directory and basename)
of the temporary file on disk. A FileBackend may decide to move
a file (using the OS move operations) if it has a non-empty
temporary_file_path.


Notes: After talking with Marty online, it seems that this will work
with the FileBackends, provided that they know how to deal with it.
For instance, if the upload is going to a temporary file, and the
FileBackend is a standard file, then the FileBackend for standard
files should know from temporary_file_path() that it can move, and
perform an OS move operation. Otherwise, the FileBackends should use
the chunk() iterator and do whatever it needs in chunks.

It's interesting to note that with this framework a lot of interesting
possibilities open up. I will not write any of the code to do anything
but the basic temporary disk storage, but here are a few interesting
examples of what can happen:
- Gzipping data on the fly [GZipFileUploadHandler +
GZipFileBackend].
- Saving file to another FTP Server on the fly
[FTPFileUploadHandler +
NoOpFileBackend].
- Having Cool Ajax-y file uploads [AjaxProgressUploadHandler + Any
Backend].
- Having user-based quotas [QuotaUploadHandler + Any Backend].
...

Anyway, the list immediately above isn't what's important right now,
but I'd like to get the API
close enough so we can break things once.

Cheers,
Michael Axiak

Mike Axiak

unread,
Mar 19, 2008, 1:34:22 AM3/19/08
to Django developers
One slight modification so far...

On Mar 19, 12:30 am, Mike Axiak <mcax...@gmail.com> wrote:
> Hello,
>
> [...]
>
> FileUploadHandler
> -----------------------
> Objects of this type must define the following methods:
> 1. new_file(filename, content_length) -- A signal that a new file
> has been reached by the parser.
> 2. receive_data_chunk(raw_data, start, stop) -- Some data has been
> received by the parser. A non-nonzero return value indicates
> that
> the file upload should cease.

Scratch that return value. You will have to raise StopUpload and
SkipFile to stop the entire upload and skip the file respectively.

>
>
> Cheers,
> Michael Axiak

-Mike

Casper Jensen

unread,
Mar 19, 2008, 7:06:04 AM3/19/08
to django-d...@googlegroups.com
The filename returned by UploadedFile, will that be an unicode object
or a string? Currently request.FILES uses string objects for the
filenames, which results in some problems with uploaded files with
non-latin filenames. See #6009.

Ivan Sagalaev

unread,
Mar 19, 2008, 7:31:45 AM3/19/08
to django-d...@googlegroups.com
Mike Axiak wrote:
> Extremely Visible Changes
> ================
>
> request.set_upload_handler(<upload_handler>)
> --------------------------------------------------------------
> This new function will allow one to register a new FileUploadHandler
> object to deal with the incoming data.
> The API for the FileUploadHandler (and the default child
> TemporaryFileUploadHandler) is discussed below.

I remember long time ago I had a bit different idea.

There are actually two aspects of what happens to an uploaded file:

- its data should be stored somewhere
- a user might want to do some additional processing (count bytes, unzip
it on the fly, resend it to a remote machine)

And I thought that for the first part -- storing it somewhere -- there
shouldn't be actually any handlers, Django should just store it in temp
files indefinitely. However this your proposal may be better. Because,
given my example of unzipping files on the fly, user might not want
event store original file as it is. What do you think about it?

> request.FILES
> -----------------
> This is no longer a MultiValueDict of raw content, but a
> MultiValueDict of UploadedFile objects.
> This will probably hurt the most, as there are probably applications
> assuming that they can take the content from this dict.

I believe this can be made backwards compatible. In my patch[1] to
ticket 1484 (which was duplicated by 2070 long ago) I had a FileDict
class that was lazily fetching file content upon accessing
request.FILES['some_file']['content']. Have a look at it.

> 2. receive_data_chunk(raw_data, start, stop) -- Some data has been
> received by the parser.

Am I right thinking that raw_data is not a raw data from socket but is
already decoded from whatever content-transfer-encoding it might be in
(i.e. base64)?

> 5. get_chunk_size()

Why not just an attribute chunk_size? It's shorter and it's almost
always will be just a plain class attribute.

> By adding a set_upload_handler() method to request, anyone can
> override the default upload handler. However, this must be done before
> the POST was accessed, and it is probably recommended we raise an
> error if someone tries to set a new upload handler after the FILES
> MultiValueDict is populated.

Instead of having users figure out where to stick a call to
set_upload_handler we could steal the design from middleware and just
have UPLOAD_HANDLERS setting... It might not be such a good idea if
people will often want different logic per view. However I think a
single global setting is needed for the most common use case: store
everything in temp files.

> It's interesting to note that with this framework a lot of interesting
> possibilities open up. I will not write any of the code to do anything
> but the basic temporary disk storage, but here are a few interesting
> examples of what can happen:
> - Gzipping data on the fly [GZipFileUploadHandler +
> GZipFileBackend].
> - Saving file to another FTP Server on the fly
> [FTPFileUploadHandler +
> NoOpFileBackend].
> - Having Cool Ajax-y file uploads [AjaxProgressUploadHandler + Any
> Backend].
> - Having user-based quotas [QuotaUploadHandler + Any Backend].

Heh :-). I was inventing my use cases before I get to this point :-).

[1]: http://code.djangoproject.com/attachment/ticket/1484/1484.m-r.6.diff

Jacob Kaplan-Moss

unread,
Mar 19, 2008, 8:47:02 AM3/19/08
to django-d...@googlegroups.com
Hi Mike --

*Very* well done -- I'm in agreement with nearly every aspect of your
proposal. Major props for taking on such a sticky issue -- walking
into #2070 is a bit like exploring an overgrown jungle :) A few
comments inline below, but in general I quite like your API and would
like to see your code.

On Tue, Mar 18, 2008 at 11:30 PM, Mike Axiak <mca...@gmail.com> wrote:
> request.set_upload_handler(<upload_handler>)

I especially like this -- it neatly encapsulates the fact that
different folks are going to want quite different file upload
behavior. A few things to think about:

* Do you think it's worth firing a signal
(file_upload_started/file_upload_finished) from request or from the
base upload handler? This would let decentralized processing of file
uploads, but could get confusing.
* Do you think we should allow multiple upload handlers (which makes
this call into something like request.add_upload_handler(handler))?
The other option would be to just make folks compose upload handlers
with a wrapper class, which is hardly a hardship.

I don't have answers to either of these questions; something to think
about, at least.

> request.FILES
> -----------------
> This is no longer a MultiValueDict of raw content, but a
> MultiValueDict of UploadedFile objects.
> This will probably hurt the most, as there are probably applications
> assuming that they can take the content from this dict.

It seems to me that you could pretty easily provide a
backwards-compatible API by defining __getitem__ on UploadedFile;
raise a DeprecationWarning there but provide the data in the "old"
style.

> 5. get_chunk_size() -

Just make this FileUploadHandler.chunk_size -- no reason for getters/setters.

Again, thanks -- this is good stuff.

Jacob

Ivan Sagalaev

unread,
Mar 19, 2008, 9:19:56 AM3/19/08
to django-d...@googlegroups.com
Jacob Kaplan-Moss wrote:
>> request.set_upload_handler(<upload_handler>)
>
> I especially like this -- it neatly encapsulates the fact that
> different folks are going to want quite different file upload
> behavior.

But what about default behaviour? There should be some place to say "all
file uploads should go on disk".

P.S. In my other response we seem to agree on __getitem__ for
request.FILES and an attribute .chunk_size almost exactly :-).

Mike Axiak

unread,
Mar 19, 2008, 12:08:59 PM3/19/08
to Django developers
Good question. The issue of encoding is a little murky. As of now, my
(almost working) patch will pretty much try to encode everything
(using force_to_unicode) *except* data in a file upload.
I'm not exactly sure we'll ever support encoding data in file uploads,
but if we do it's definitely a tricky problem.
The issue is two-fold:
- What files should we encode and which shouldn't we? (do we do
matching on the mimetype? that doesn't sound nice)
- Dealing with multibyte boundaries in the strings while streaming
the data. (Not a question, just a fun night of work.)

Anyway, I'll go back to ignoring that issue for now unless I get angry
cries for it.

-Mike

Mike Axiak

unread,
Mar 19, 2008, 12:18:19 PM3/19/08
to Django developers


On Mar 19, 7:31 am, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:
> [...]
> Am I right thinking that raw_data is not a raw data from socket but is
> already decoded from whatever content-transfer-encoding it might be in
> (i.e. base64)?

Yes. The idea is to abstract away much of the HTTP work and just get
the raw data after it's been pulled from whatever casing it's in.

> Instead of having users figure out where to stick a call to
> set_upload_handler we could steal the design from middleware and just
> have UPLOAD_HANDLERS setting... It might not be such a good idea if
> people will often want different logic per view. However I think a
> single global setting is needed for the most common use case: store
> everything in temp files.

The idea is that you can set this anywhere, so long as you haven't
accessed POST and FILES yet. So a view will be a place you can do
this.
The TemporaryFileUploadHandler will be default, and is what will
happen "out-of-the-box".
(Well...there's the idea of using the MemoryFileUploadHandler if the
Content-Length exists and is small...but that's beside the point. :)

-Mike

Jacob Kaplan-Moss

unread,
Mar 19, 2008, 12:40:45 PM3/19/08
to django-d...@googlegroups.com
On Wed, Mar 19, 2008 at 8:19 AM, Ivan Sagalaev
<man...@softwaremaniacs.org> wrote:
> But what about default behaviour? There should be some place to say "all
> file uploads should go on disk".

I think the default behavior doesn't need to be a setting:
upload-to-disk for all files over N bytes (1M, maybe?), and
upload-to-RAM for anything smaller.

Jacob

Ivan Sagalaev

unread,
Mar 19, 2008, 12:41:41 PM3/19/08
to django-d...@googlegroups.com
Mike Axiak wrote:
> The idea is that you can set this anywhere, so long as you haven't
> accessed POST and FILES yet. So a view will be a place you can do
> this.
> The TemporaryFileUploadHandler will be default, and is what will
> happen "out-of-the-box".

My concern is for this "out-of-the-box" default not to be hard-coded.
There should be some way for a user to set its own global (i.e. non
per-view) handler. I understand that everyone can write a one-line
middleware that sets the handler but while we're at designing stage I
thought we could make it a setting instead of another FAQ :-)

Ivan Sagalaev

unread,
Mar 19, 2008, 12:47:24 PM3/19/08
to django-d...@googlegroups.com
Mike Axiak wrote:
> Good question. The issue of encoding is a little murky. As of now, my
> (almost working) patch will pretty much try to encode everything
> (using force_to_unicode) *except* data in a file upload.

This is the right way. Uploaded file is almost never will be treated as
text and unicode doesn't make sense for binary data. It's also
especially useless for the default behaviour of storing uploaded data to
disk: you'll have to encode just decoded data back to a stream of bytes
to store it in a file.

Ivan Sagalaev

unread,
Mar 19, 2008, 12:53:00 PM3/19/08
to django-d...@googlegroups.com
Jacob Kaplan-Moss wrote:
> I think the default behavior doesn't need to be a setting:
> upload-to-disk for all files over N bytes (1M, maybe?), and
> upload-to-RAM for anything smaller.

Sorry, I didn't mean "default" in a sense of absent user settings. I
want to have a way to set a user-specific handler not only on per-view
basis but as a "default" for all upload requests.


To think of a use-case... Imagine some system where /tmp directory is on
a very small partition. An admin would want to direct all uploads
(potentially big) to some other device. And he wants to do it only for
this Django site, not the whole system. Something like this...

Mikhail Gusarov

unread,
Mar 19, 2008, 12:55:51 PM3/19/08
to django-d...@googlegroups.com
Twas brillig at 19:53:00 19.03.2008 UTC+03 when Ivan Sagalaev did gyre and gimble:

IS> An admin would want to direct all uploads (potentially big) to some
IS> other device. And he wants to do it only for this Django site, not
IS> the whole system.

Security also comes in mind: it may be undesirable to use same directory
for different sites to prevent information leakage (even the fact of
file upload may be security-sensitive).

--

Mike Axiak

unread,
Mar 19, 2008, 1:46:51 PM3/19/08
to Django developers

On Mar 19, 12:53 pm, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:
> [...]
> To think of a use-case... Imagine some system where /tmp directory is on
> a very small partition. An admin would want to direct all uploads
> (potentially big) to some other device. And he wants to do it only for
> this Django site, not the whole system. Something like this...

Since this is part of the request, the user is welcome to write a
middleware class that sets this before any view gets started in the
entire project.
Does this not do what you want?

-Mike

Ivan Sagalaev

unread,
Mar 19, 2008, 1:49:14 PM3/19/08
to django-d...@googlegroups.com
Mike Axiak wrote:
> Since this is part of the request, the user is welcome to write a
> middleware class that sets this before any view gets started in the
> entire project.
> Does this not do what you want?

It does. But I was thinking of a setting. However I don't insist on it.
Since you and Jacob seem to not feel that it's necessary let it be just
a method on request. I was merely raising a concern.

Mike Axiak

unread,
Mar 19, 2008, 1:54:31 PM3/19/08
to Django developers
The way I see it is that there's no reason we can't add a contrib
middleware later that adds a setting for the default request handler.

-Mike

Chad Maine

unread,
Mar 19, 2008, 1:56:55 PM3/19/08
to django-d...@googlegroups.com
This looks great.  I expect it will make proper integration with something like Tramline (www.infrae.com/products/tramline) very easy to do but perhaps no longer necessary.

Chad

Mike Axiak

unread,
Mar 24, 2008, 2:02:23 PM3/24/08
to Django developers
Now that we actually have a working patch [1], there are a few details
I'd like to raise here.

Major Issues
=======

Supporting dictionaries in form code
------------------------------------
While we can have file objects that support the dictionary lookup to
make those backwards compatible, there's still the API of the forms
code.
As an example, there are a lot of tests that look like::

TextFileForm(data={'description': u'Assistance'}, files={'file':
{'filename': 'test1.txt', 'content': 'hello world'}})

This is an example of something using a dictionary instead of a
UploadedFile instance when using the forms. However, the forms should
*probably* be using the new fileobject API (use .chunks()/.read()
rather than ['content']), so that would mean they would break if fed a
dictionary like the above example. I see three options to deal with
this:

1. Leave forms code alone, causing (a) DeprecationWarnings and (b)
the files to be read into memory when saving.
2. Modify the form code to access the attributes of the
UploadedFile, but on AttributeError use the old dict-style interface
and emit a DeprecationWarning.
3. Modify form code to use the new UploadedFile object, but just
break on the old dictionary access as it's largely an internal API
anyway.

Without thinking about it, I wrote the patch for (3) and modified the
tests appropriately. I'm now thinking people will probably want (2)
more. The other issue is what we should do with the tests. Should we
leave them? Should we copy them and create a copy of them for the new
style? Should we replace them with the new style?


Having upload_handlers be a list object
---------------------------------------
I know that .set_upload_handler() was really cool, but I've discarded
it to support multiple handlers (see below for details). So now I had
to think of what to replace it with. At first I thought:
.append_upload_handler(), but then I realized that I may want
.insert_upload_handler(), and .delete_upload_handler(). So before long
I realized that I just want people to have a list interface.
Therefore, I decided to just leave it as a plain old list. Thus, to
add an upload handler, you'd just write::

request.upload_handlers.append(some_upload_handler)

And to replace them all::

request.upload_handlers = [some_upload_handler]

I've made a few efforts to ensure that it will raise an error if the
upload has already been handled. I know this isn't as simple as the
.set_upload_handler() interface, but I think it's the simplest way we
can support the list modification/replacement in a useful fashion.
What do people think about this?


(Mis)Handling Content-Length
----------------------------
There's probably not much room for argument here, but it's worth
asking a larger group. Currently when you try uploading Large files
(~2GB and greater), you will get a weird Content-Length header (less
than zero, overflowing). The problem is two-fold:
1. We can't easily exhaust the request data without knowing its
length.
2. If we don't exhaust the request data, we get a sad Connect Reset
error from the browsers.
This means that even if we wanted to display a useful error page to
people telling them why this happened, it won't happen because we'll
just get Connection Reset errors. We can just ignore it, or we can try
modifying the request streams to set timeouts and exhaust the input.
My perusal of Apache docs tells me that there's discard_request_body,
but unfortunately even this seems to depend on Content-Length. Should
I/we just ignore this and expect people to be sane and upload
reasonable data?

Revised API
===========

I was about to write a huge document that described the API in here.
But then I realized I might as well write it in restructuredtext. Then
I realized I might as well make it formatted like a Django document.
And well...you can see the outcome at:
http://orodruin.axiak.net/upload_handling.html

Let me know if you have any comments on the API. (Things like how it
deals with multiple handlers could be up for discussion.)

Addendum
========

One last note: In order to test the API and everything, I created a
google code project 'django-uploadutils' [2] which I hope will become
a collection of upload handlers for doing common tasks. If anyone is
interested in helping out with this, I'd be happy to add you as a
contributor.


Regards,
Mike


1: http://code.djangoproject.com/ticket/2070
2: http://code.google.com/p/django-uploadutils/

Ivan Sagalaev

unread,
Mar 25, 2008, 6:04:58 AM3/25/08
to django-d...@googlegroups.com
Mike Axiak wrote:
> 1. Leave forms code alone, causing (a) DeprecationWarnings and (b)
> the files to be read into memory when saving.
> 2. Modify the form code to access the attributes of the
> UploadedFile, but on AttributeError use the old dict-style interface
> and emit a DeprecationWarning.
> 3. Modify form code to use the new UploadedFile object, but just
> break on the old dictionary access as it's largely an internal API
> anyway.

I don't like #1 because there's no point to keep deprecated code in
Django where we can fix it. And #3 is indeed broken because there is
much code in the wild that uses request.FILES directly. It's a public
API after all.

#2 looks reasonable to me.

> I realized that I just want people to have a list interface.
> Therefore, I decided to just leave it as a plain old list. Thus, to
> add an upload handler, you'd just write::
>
> request.upload_handlers.append(some_upload_handler)
>
> And to replace them all::
>
> request.upload_handlers = [some_upload_handler]
>
> I've made a few efforts to ensure that it will raise an error if the
> upload has already been handled. I know this isn't as simple as the
> .set_upload_handler() interface, but I think it's the simplest way we
> can support the list modification/replacement in a useful fashion.
> What do people think about this?

It would be good to invent a declarative setting for this but since it's
a per-view thing it would be hard. So may be indeed a list would be enough.

> Currently when you try uploading Large files
> (~2GB and greater), you will get a weird Content-Length header (less
> than zero, overflowing).

> ...


> Should
> I/we just ignore this and expect people to be sane and upload
> reasonable data?

If Content-Length header is crewed then neither we, nor users can do
anything about it. When these file volumes become more widespread I
believe browsers will fix themselves to send correct Content-Length.

Mike Axiak

unread,
Mar 25, 2008, 11:25:31 AM3/25/08
to Django developers
Hey,

On Mar 25, 6:04 am, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:
> I don't like #1 because there's no point to keep deprecated code in
> Django where we can fix it. And #3 is indeed broken because there is
> much code in the wild that uses request.FILES directly. It's a public
> API after all.
I thought this would be confused. Currently when you want to upload
something to a form, you would do something along the lines of::

f = SomeForm(request.POST, request.FILES)

This actually invokes two separate interfaces: the objects inside
request.FILES, and the way the form deals with it. Currently, the form
assumes request.FILES contains dictionaries, and so (like the
documentation) you can feed it dictionaries of data and it will work.
How the form treats this data is the specific interface I'm talking
about, not what request.FILES contains.

Anyway, I'll modify the patch to do #2 anyway. Thanks for the input.

-Mike

Malcolm Tredinnick

unread,
Mar 25, 2008, 3:32:21 PM3/25/08
to django-d...@googlegroups.com

On Mon, 2008-03-24 at 11:02 -0700, Mike Axiak wrote:
> Now that we actually have a working patch [1], there are a few details
> I'd like to raise here.

I haven't had time to sit down and devote to reading through the whole
patch, but I have a possibly very easy question that I can't answer from
the docs at the moment.

I'm a simple guy. I like simple stuff. So if I'm deploying a Django
system and it will handle file uploads from some forms and all I want to
do is ensure that large file uploads aren't held in memory, how do I do
that? In other words, I want to avoid the problem that originally
prompted this ticket and the related one.

Does this Just Work(tm) out of the box?

Malcolm

--
Remember that you are unique. Just like everyone else.
http://www.pointy-stick.com/blog/

Mike Axiak

unread,
Mar 25, 2008, 3:38:32 PM3/25/08
to Django developers
Hey,

On Mar 25, 3:32 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
> Does this Just Work(tm) out of the box?
Yes. :-)

-Mike

Mike Axiak

unread,
Mar 26, 2008, 1:59:19 AM3/26/08
to Django developers
On Mar 25, 3:38 pm, Mike Axiak <mcax...@gmail.com> wrote:
> Yes. :-)
Doh. I didn't realize you actually wanted me to do more work.

The newly updated doc can be found at the same URL:
http://orodruin.axiak.net/upload_handling.html

Be sure to refresh as the browser caching is high on that page.

-Mike

Malcolm Tredinnick

unread,
Mar 26, 2008, 4:45:37 AM3/26/08
to django-d...@googlegroups.com

On Tue, 2008-03-25 at 22:59 -0700, Mike Axiak wrote:
> On Mar 25, 3:38 pm, Mike Axiak <mcax...@gmail.com> wrote:
> > Yes. :-)
> Doh. I didn't realize you actually wanted me to do more work.

I didn't either. I thought I may have just been missing something
obvious. But now that you've written the extra bits at the top of the
document, it makes more sense to me as a user. Thanks.

Okay, so this all has to go on the review pile now I guess. That should
be fun, for some value of "fun". Nice work, Mike.

Regards,
Malcolm

--
No one is listening until you make a mistake.
http://www.pointy-stick.com/blog/

Jacob Kaplan-Moss

unread,
Apr 4, 2008, 12:28:32 PM4/4/08
to django-d...@googlegroups.com
On Mon, Mar 24, 2008 at 1:02 PM, Mike Axiak <mca...@gmail.com> wrote:
> Now that we actually have a working patch [1], there are a few details
> I'd like to raise here.

Woo! Thanks for your hard work. My thoughts on your questions follow inline:

> Supporting dictionaries in form code
> ------------------------------------

> [...]


> TextFileForm(data={'description': u'Assistance'}, files={'file':
> {'filename': 'test1.txt', 'content': 'hello world'}})

What would an equivalent line look like under the new system? That is, what do
folks need to change their tests to?

> I see three options to deal with this:

> [...]


> 2. Modify the form code to access the attributes of the
> UploadedFile, but on AttributeError use the old dict-style interface
> and emit a DeprecationWarning.

Yes, this is correct approach. Something along these lines:

if isinstance(uploadedfile, dict):
warn(...)
uploadedfile = uploadedfile_from_dict(uploadedfile)

Option #3 is unacceptable: if at all possible we want people's tests to
not break. Warnings are fine; breakage if avoidable is a bad idea.

> The other issue is what we should do with the tests. Should we
> leave them? Should we copy them and create a copy of them for the new
> style? Should we replace them with the new style?

The latter -- fix all the tests to use the new syntax as a demo of how it's
supposed to be done.

> Having upload_handlers be a list object
> ---------------------------------------

> [...]


> Therefore, I decided to just leave it as a plain old list. Thus, to
> add an upload handler, you'd just write::
>
> request.upload_handlers.append(some_upload_handler)
>
> And to replace them all::
>
> request.upload_handlers = [some_upload_handler]

If we do indeed need this -- see below -- then this is the right way to do it.

> What do people think about this?

I'm thinking YAGNI here. Why would I need multiple upload handlers? I think you
need to talk me through your thinking here, because at first glance this smacks
of overegineering. Remember that multiple handlers can always be accomplished by
composition anyway, so unless there's a good reason I think set_upload_handler()
is just much cleaner.

Similarly, I'm a bit suspicious of FILE_UPLOAD_HANDLERS. Couldn't you just write
a simple middleware to do the same thing? As a general principle, you should try
as hard as you can to avoid introducing new settings; if there's another way
just do it that way. In this case, I'd just document that if you want to use a
custom upload handler globally that you should write a middleware class.

> (Mis)Handling Content-Length
> ----------------------------
> [...]


> There's probably not much room for argument here, but it's worth
> asking a larger group. Currently when you try uploading Large files
> (~2GB and greater), you will get a weird Content-Length header (less
> than zero, overflowing).

Personally, I can't see any reason to care too much about people trying
to upload 2GB files through the web, anyway. Anyone allowing uploads should
have a sane upload size limit set at the web server level. Anyone who's allowing
uploads of over 2GB is just asking to get DOSed.

I think this is a place where we just add a note to the docs about setting the
Apache/lighttpd/etc. upload limit and move on.

> Revised API
> ===========
> [...]


> Let me know if you have any comments on the API. (Things like how it
> deals with multiple handlers could be up for discussion.)

I quite like the API. I'm not sure why you'd need to return a custom
UploadedFile from an upload handler, but props for documenting the interface
anyway :)

Thanks again for the hard work -- this looks very good!

Jacob

Mike Axiak

unread,
Apr 4, 2008, 2:04:11 PM4/4/08
to Django developers
Thanks for the review!

On Apr 4, 12:28 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com>
wrote:
> >    TextFileForm(data={'description': u'Assistance'}, files={'file':
> >  {'filename': 'test1.txt', 'content': 'hello world'}})
>
> What would an equivalent line look like under the new system? That is, what do
> folks need to change their tests to?
The patch does this for all the tests in tests/, but here's an
example:

TextFileForm(data={'description': u'Assistance'}, files={'file':
SimpleUploadedFile('test1.txt', 'hello world')})

Note the SimpleUploadedFile, which takes the filename and the content.

> Yes, this is correct approach. Something along these lines:
>
>    if isinstance(uploadedfile, dict):
>         warn(...)
>         uploadedfile = uploadedfile_from_dict(uploadedfile)

I like this way better than special-casing it. I'll update this in the
patch.

> I'm thinking YAGNI here. Why would I need multiple upload handlers? I think you
> need to talk me through your thinking here, because at first glance this smacks
> of overegineering. Remember that multiple handlers can always be accomplished by
> composition anyway, so unless there's a good reason I think set_upload_handler()
> is just much cleaner.

Composition gets a little tricky. I realized this when I wrote the
stuff that handles the composition.
Right now we have (InMemoryUploadHandler, TemporaryFileUploadHandler)
in a sequence, and now the memory handler doesn't have to worry about
what happens nor does the TemporaryFileUploadHandler need to worry
about what's before.
Basically every upload handler I can think of writing benefits from
this. For example, the UploadProgressUploadHandler would be wedged
between:

(InMemoryUploadHandler, UploadProgressUploadHandler,
TemporaryFileUploadHandler)

So that file progress only gets computed for uploads that go into the
file system. Avoiding the overhead for small uploads.

> Similarly, I'm a bit suspicious of FILE_UPLOAD_HANDLERS. Couldn't you just write
> a simple middleware to do the same thing? As a general principle, you should try
> as hard as you can to avoid introducing new settings; if there's another way
> just do it that way. In this case, I'd just document that if you want to use a
> custom upload handler globally that you should write a middleware class.

This took a while for me to get sold on. I can only think of 5 or 6
useful upload handlers, but that's still 1956 possible orderings. It'd
be tough to make a handler easy to install when it has to be aware of
where it belongs. This functionality could be replicated in a
middleware which reads the settings, but I'm not sure that's the best
place for something like that.

> I quite like the API. I'm not sure why you'd need to return a custom
> UploadedFile from an upload handler, but props for documenting the interface
> anyway :)

If the upload handlers don't alter the content or represent any
storage change then sure they wouldn't need an additional UploadedFile
class.
However, the instant they pass the file to some remote location
(think: S3UploadHandler) or alter the content (think:
GZipUploadHandler) they will need their own way of defining what is
content and how it should be "saved".

Thanks for the thorough review.

Cheers,
Mike

Jacob Kaplan-Moss

unread,
Apr 4, 2008, 2:46:16 PM4/4/08
to django-d...@googlegroups.com
On Fri, Apr 4, 2008 at 1:04 PM, Mike Axiak <mca...@gmail.com> wrote:
> Composition gets a little tricky. I realized this when I wrote the
> stuff that handles the composition.
> Right now we have (InMemoryUploadHandler, TemporaryFileUploadHandler)
> in a sequence, and now the memory handler doesn't have to worry about
> what happens nor does the TemporaryFileUploadHandler need to worry
> about what's before.
> Basically every upload handler I can think of writing benefits from
> this. For example, the UploadProgressUploadHandler would be wedged
> between:
>
> (InMemoryUploadHandler, UploadProgressUploadHandler,
> TemporaryFileUploadHandler)
>
> So that file progress only gets computed for uploads that go into the
> file system. Avoiding the overhead for small uploads.

OK, I think I understand; thanks. It's hard sometimes figuring out a
thought process from its final result. I think you're probably right
to make this a list, then.

> This [FILE_UPLOAD_HANDLERS] took a while for me to get sold on. I can


> only think of 5 or 6 useful upload handlers, but that's still 1956
> possible orderings. It'd be tough to make a handler easy to install
> when it has to be aware of where it belongs. This functionality could
> be replicated in a middleware which reads the settings, but I'm not
> sure that's the best place for something like that.

Hrm, good point. I'll chew a bit more, but I can't think of a good way
to avoid the extra setting (as much as I dislike creeping settings).

Jacob

Mike Axiak

unread,
Apr 4, 2008, 2:56:44 PM4/4/08
to Django developers
On Apr 4, 2:46 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com>
wrote:
> Hrm, good point. I'll chew a bit more, but I can't think of a good way
> to avoid the extra setting (as much as I dislike creeping settings).

I didn't want to use the extra setting either, but I finally caved in
after working with it (and discussing with Ivan). I will certainly
explore any other possibilities.

>
> Jacob

Marty Alchin

unread,
Apr 4, 2008, 3:29:18 PM4/4/08
to django-d...@googlegroups.com
On Fri, Apr 4, 2008 at 2:04 PM, Mike Axiak <mca...@gmail.com> wrote:
> However, the instant they pass the file to some remote location
> (think: S3UploadHandler) or alter the content (think:
> GZipUploadHandler) they will need their own way of defining what is
> content and how it should be "saved".

Maybe I'm missing something obvious, but why would there ever be an
S3UploadHandler? Shouldn't that be handled by a file storage backend?

As for GZipUploadHandler, if you're talking about gzipping it as part
of the save process, shouldn't that also be a file storage backend? Of
course, if you're talking about *un*gzipping it during upload, so it
can be processed by Python, I withdraw the question.

I admit I haven't been following this terribly closely, but now that
both #5361 and #2070 are nearing completion, I'm trying to get a good
handle on all of this in case there are any interactions between the
two that I can help with.

-Gul

Jacob Kaplan-Moss

unread,
Apr 4, 2008, 3:43:49 PM4/4/08
to django-d...@googlegroups.com
On Fri, Apr 4, 2008 at 2:29 PM, Marty Alchin <gulo...@gamemusic.org> wrote:
> Maybe I'm missing something obvious, but why would there ever be an
> S3UploadHandler? Shouldn't that be handled by a file storage backend?

Yeah, one thing we'll need to figure out PDQ is what's appropriate for
an upload handler, and what's appropriate for a storage backend.
Hopefully the two of you can work out the breakdown.

FYI, I plan to merge 2070 first (sorry, Marty!) since I think it works
a bit better that way from a dependancy POV.

Jacob

Marty Alchin

unread,
Apr 4, 2008, 3:53:42 PM4/4/08
to django-d...@googlegroups.com
On Fri, Apr 4, 2008 at 3:43 PM, Jacob Kaplan-Moss
<jacob.ka...@gmail.com> wrote:
> Yeah, one thing we'll need to figure out PDQ is what's appropriate for
> an upload handler, and what's appropriate for a storage backend.
> Hopefully the two of you can work out the breakdown.

I'll read over the patch and the docs and see if I can get a better
handle on how it works, so I can be of more use there. Also, Mike and
I put our heads together in IRC sometimes, so we should be able to get
it sorted out soon.

> FYI, I plan to merge 2070 first (sorry, Marty!) since I think it works
> a bit better that way from a dependancy POV.

No worries. I still have another of David's suggestions to integrate
before I can have it "finished" anyway. Plus, there's a 3291-ticket
age difference. The youngest always gets the shaft. :) I can live with
that.

-Gul

Mike Axiak

unread,
Apr 4, 2008, 4:23:25 PM4/4/08
to Django developers
On Apr 4, 3:29 pm, "Marty Alchin" <gulop...@gamemusic.org> wrote:
> I admit I haven't been following this terribly closely, but now that
> both #5361 and #2070 are nearing completion, I'm trying to get a good
> handle on all of this in case there are any interactions between the
> two that I can help with.
Yeah, it's an interesting interaction. We can talk about this more on
IRC, but let me try to outline a few ideas here.
I see both 2070 (one half of it) and 5361 both doing the same thing:
representing files. However, they are representing files in different
stages of life.
One is tied to the database (#5361), and one can potentially be
half-written (#2070). I forsee a time when we actually merge these
into one, but I think should do this incrementally.

Let's look at the S3 uploading process. There are two ways we can
handle the upload and send to S3:

1. Stream the data directly to S3.
2. Stream the data to disk, then send to S3.

I think a lot of people might opt to do option (1). In this case, the
upload handler will need to stream it to the S3 directly, but it needs
to give request.FILES *something* that the user/forms/db layer can
interact with. In this case, I gave the example of an S3UploadedFile.
It has a size, a way to read data from it, and a filename. In light of
#5361, if your backend is S3FileBackend, I would imagine it would
notice the S3UploadedFile and say "Oh, saving this is trivial. Let me
just save attribute ____". But that doesn't mean you can't save a
normal file in the S3FileBackend, does it?

Again, this is a little tricky, and to get right you and I will have
to work out some more of the details. But I do see them as usefully
somewhat separate identities.

Cheers,
Mike

Malcolm Tredinnick

unread,
Apr 4, 2008, 7:26:36 PM4/4/08
to django-d...@googlegroups.com

On Fri, 2008-04-04 at 13:23 -0700, Mike Axiak wrote:
[...]

> Let's look at the S3 uploading process. There are two ways we can
> handle the upload and send to S3:
>
> 1. Stream the data directly to S3.
> 2. Stream the data to disk, then send to S3.
>
> I think a lot of people might opt to do option (1).

Then those people deserve to be beaten heavily about the head and
shoulders. S3 is NOT a reliable upload endpoint. They (Amazon) say
there'll be approximately a 1% failure rate for attempted uploads. As 37
signals have noted in the past (they use it for their whiteboard file
storage), that doesn't mean that when you try again it will work,
either. It means, in practice, that periodically for 1% of the time, all
your uploads will fail. Then, a little later, they'll start to work. But
there could be minutes on end where you cannot upload. So unless your
web application is designed to intentionally lose data (in which case I
can think of a big optimisation on the saving end to lose it even faster
and more reliably), you must save to disk before trying to upload to S3.

In short, I can't see any reason the file upload handler should care
about storage systems like this.

Regards,
Malcolm

--
Experience is something you don't get until just after you need it.
http://www.pointy-stick.com/blog/

Ivan Sagalaev

unread,
Apr 5, 2008, 4:27:29 AM4/5/08
to django-d...@googlegroups.com
Mike Axiak wrote:
> I didn't want to use the extra setting either, but I finally caved in
> after working with it (and discussing with Ivan). I will certainly
> explore any other possibilities.

My original reason for settings was that a single upload handler can't
possibly know the semantics of other upload handlers and can't decide
where to put itself in the list. This should be a decision of a
developer who compiles a project from third-party apps with upload
handlers how to arrange them. It's very similar to middleware: the order
is decided in settings. The only exception would be when a handler
really depends on some other handler which it can require be raising an
Exception somewhere.

Ivan Sagalaev

unread,
Apr 5, 2008, 4:36:01 AM4/5/08
to django-d...@googlegroups.com
Malcolm Tredinnick wrote:
> Then those people deserve to be beaten heavily about the head and
> shoulders. S3 is NOT a reliable upload endpoint. They (Amazon) say
> there'll be approximately a 1% failure rate for attempted uploads. As 37
> signals have noted in the past (they use it for their whiteboard file
> storage), that doesn't mean that when you try again it will work,
> either. It means, in practice, that periodically for 1% of the time, all
> your uploads will fail.

Nobody stops a developer from doing both things in parallel: storing on
disk and streaming to S3. Then when S3 fails a stored file can be
scheduled for repeating uploading.

The reason for not doing it only over-the-disk way is speed. Since 99%
of uploads do succeed they will gain heavily from not doing writes and
reads of the whole file on local disk.

Anyway S3 is just an example. It could be some local media-server
instead. I think the main reason between an upload handler and a file
backend is that the latter is generally simpler to write and the former
is generally more flexible.

Ivan Sagalaev

unread,
Apr 5, 2008, 4:40:44 AM4/5/08
to django-d...@googlegroups.com
Ivan Sagalaev wrote:
> Nobody stops a developer from doing both things in parallel: storing on
> disk and streaming to S3.
<skip>

> they will gain heavily from not doing writes and
> reads of the whole file on local disk.

Looks like I'm contradicting myself a bit here :-). But it's not the
point actually :-)

Reply all
Reply to author
Forward
0 new messages