Google Groups Home
Help | Sign in
Streaming Uploads Discussion
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 36 - Collapse all   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Mike Axiak  
View profile
(1 user)  More options Mar 19, 12:30 am
From: Mike Axiak <mcax...@gmail.com>
Date: Tue, 18 Mar 2008 21:30:55 -0700 (PDT)
Local: Wed, Mar 19 2008 12:30 am
Subject: Streaming Uploads Discussion
Hello,

It's become clear to me that the API for 2070 [1] (which is currently
a mess :) has never really been discussed by the developers, and I've
been thinking about ways of organizing it such that one can add
features to file uploading without modifying the core.
Here's an overview of what I had in mind.

(Any comments/suggestions/angry cries are greatly appreciated.)

Extremely Visible Changes
================

request.set_upload_handler(<upload_handler>)
--------------------------------------------------------------
This new function will allow one to register a new FileUploadHandler
object to deal with the incoming data.
The API for the FileUploadHandler (and the default child
TemporaryFileUploadHandler) is discussed below.

request.FILES
-----------------
This is no longer a MultiValueDict of raw content, but a
MultiValueDict of UploadedFile objects.
This will probably hurt the most, as there are probably applications
assuming that they can take the content from this dict.
The API of the UploadedFile (and the default child
TemporaryUploadedFile) is discussed below.

APIs for New Objects
=============

FileUploadHandler
-----------------------
Objects of this type must define the following methods:
   1. new_file(filename, content_length) -- A signal that a new file
       has been reached by the parser.
   2. receive_data_chunk(raw_data, start, stop) -- Some data has been
       received by the parser. A non-nonzero return value indicates
that
       the file upload should cease.
   3. file_complete(file_size) -- A signal that the current file has
       completed (no more than one file will ever be dealt with).
       The expected return value is an object of UploadedFile.
   4. upload_complete() -- A signal that all uploads have finished
successfully.
   5. get_chunk_size() -- Returns a number that represents the maximum
number
       of bytes read into memory for each chunk.

By adding a set_upload_handler() method to request, anyone can
override the default upload handler. However, this must be done before
the POST was accessed, and it is probably recommended we raise an
error if someone tries to set a new upload handler after the FILES
MultiValueDict is populated.

The default handler (TemporaryFileUploadHandler) will stream the data
to a temporary file for each part of the upload, and return a
corresponding TemporaryUploadedFile object that represents that data.
In addition, the TemporaryFileUploadHandler can know to stream it to
memory if the content_size is small enough (and start streaming it
into disk if the header lied for some grace period).

UploadedFile
----------------
Objects of this class represent files that were just uploaded and
handled by some handler above. I'm assuming that this will be the most
difficult class to nail, since it has to balance a few different
aspects:
   1. Encapsulating what it means to be a uploaded file.
   2. Making it easy for a developer to send it to a FileBackend and
       have it Do The Right Thing.
   3. Making it easy for a developer to get the content and do
       something with it.

Methods of the UploadedFile:
   1. open() -- Open the file for reading.
   2. read([num_bytes]) -- Read a number of bytes from the file.
   3. close() -- Round out the standard read file operations.
   4. chunk([chunk_size]) -- Generates chunks from the file (if the
       content is in memory already, it should ignore chunk_size).
   5. filename() -- The filename from the content-disposition header.
   6. multiple_chunks([chunk_size]) -- Whether or not you can expect
       multiple chunks from chunk() (Boolean). Useful if you want to
do
       something particularly when you know you can put it all into
       memory at once.
   7. file_size() -- The size in bytes of the uploaded file.

Added method of a TemporaryUploadedFile:
   8. temporary_file_path() -- The whole path (directory and basename)
       of the temporary file on disk. A FileBackend may decide to move
       a file (using the OS move operations) if it has a non-empty
       temporary_file_path.

Notes: After talking with Marty online, it seems that this will work
with the FileBackends, provided that they know how to deal with it.
For instance, if the upload is going to a temporary file, and the
FileBackend is a standard file, then the FileBackend for standard
files should know from temporary_file_path() that it can move, and
perform an OS move operation. Otherwise, the FileBackends should use
the chunk() iterator and do whatever it needs in chunks.

It's interesting to note that with this framework a lot of interesting
possibilities open up. I will not write any of the code to do anything
but the basic temporary disk storage, but here are a few interesting
examples of what can happen:
   - Gzipping data on the fly [GZipFileUploadHandler +
GZipFileBackend].
   - Saving file to another FTP Server on the fly
                                            [FTPFileUploadHandler +
NoOpFileBackend].
   - Having Cool Ajax-y file uploads [AjaxProgressUploadHandler + Any
Backend].
   - Having user-based quotas [QuotaUploadHandler + Any Backend].
   ...

Anyway, the list immediately above isn't what's important right now,
but I'd like to get the API
close enough so we can break things once.

Cheers,
Michael Axiak


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Axiak  
View profile
 More options Mar 19, 1:34 am
From: Mike Axiak <mcax...@gmail.com>
Date: Tue, 18 Mar 2008 22:34:22 -0700 (PDT)
Local: Wed, Mar 19 2008 1:34 am
Subject: Re: Streaming Uploads Discussion
One slight modification so far...

On Mar 19, 12:30 am, Mike Axiak <mcax...@gmail.com> wrote:

> Hello,

> [...]

> FileUploadHandler
> -----------------------
> Objects of this type must define the following methods:
>    1. new_file(filename, content_length) -- A signal that a new file
>        has been reached by the parser.
>    2. receive_data_chunk(raw_data, start, stop) -- Some data has been
>        received by the parser. A non-nonzero return value indicates
> that
>        the file upload should cease.

Scratch that return value. You will have to raise StopUpload and
SkipFile to stop the entire upload and skip the file respectively.

> Cheers,
> Michael Axiak

-Mike

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Casper Jensen  
View profile
 More options Mar 19, 7:06 am
From: "Casper Jensen" <t...@sema.dk>
Date: Wed, 19 Mar 2008 12:06:04 +0100
Local: Wed, Mar 19 2008 7:06 am
Subject: Re: Streaming Uploads Discussion
The filename returned by UploadedFile, will that be an unicode object
or a string? Currently request.FILES uses string objects for the
filenames, which results in some problems with uploaded files with
non-latin filenames. See #6009.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile
 More options Mar 19, 7:31 am
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Wed, 19 Mar 2008 14:31:45 +0300
Local: Wed, Mar 19 2008 7:31 am
Subject: Re: Streaming Uploads Discussion

Mike Axiak wrote:
> Extremely Visible Changes
> ================

> request.set_upload_handler(<upload_handler>)
> --------------------------------------------------------------
> This new function will allow one to register a new FileUploadHandler
> object to deal with the incoming data.
> The API for the FileUploadHandler (and the default child
> TemporaryFileUploadHandler) is discussed below.

I remember long time ago I had a bit different idea.

There are actually two aspects of what happens to an uploaded file:

- its data should be stored somewhere
- a user might want to do some additional processing (count bytes, unzip
it on the fly, resend it to a remote machine)

And I thought that for the first part -- storing it somewhere -- there
shouldn't be actually any handlers, Django should just store it in temp
files indefinitely. However this your proposal may be better. Because,
given my example of unzipping files on the fly, user might not want
event store original file as it is. What do you think about it?

> request.FILES
> -----------------
> This is no longer a MultiValueDict of raw content, but a
> MultiValueDict of UploadedFile objects.
> This will probably hurt the most, as there are probably applications
> assuming that they can take the content from this dict.

I believe this can be made backwards compatible. In my patch[1] to
ticket 1484 (which was duplicated by 2070 long ago) I had a FileDict
class that was lazily fetching file content upon accessing
request.FILES['some_file']['content']. Have a look at it.

>    2. receive_data_chunk(raw_data, start, stop) -- Some data has been
>        received by the parser.

Am I right thinking that raw_data is not a raw data from socket but is
already decoded from whatever content-transfer-encoding it might be in
(i.e. base64)?

>    5. get_chunk_size()

Why not just an attribute chunk_size? It's shorter and it's almost
always will be just a plain class attribute.

> By adding a set_upload_handler() method to request, anyone can
> override the default upload handler. However, this must be done before
> the POST was accessed, and it is probably recommended we raise an
> error if someone tries to set a new upload handler after the FILES
> MultiValueDict is populated.

Instead of having users figure out where to stick a call to
set_upload_handler we could steal the design from middleware and just
have UPLOAD_HANDLERS setting... It might not be such a good idea if
people will often want different logic per view. However I think a
single global setting is needed for the most common use case: store
everything in temp files.

> It's interesting to note that with this framework a lot of interesting
> possibilities open up. I will not write any of the code to do anything
> but the basic temporary disk storage, but here are a few interesting
> examples of what can happen:
>    - Gzipping data on the fly [GZipFileUploadHandler +
> GZipFileBackend].
>    - Saving file to another FTP Server on the fly
>                                             [FTPFileUploadHandler +
> NoOpFileBackend].
>    - Having Cool Ajax-y file uploads [AjaxProgressUploadHandler + Any
> Backend].
>    - Having user-based quotas [QuotaUploadHandler + Any Backend].

Heh :-). I was inventing my use cases before I get to this point :-).

[1]: http://code.djangoproject.com/attachment/ticket/1484/1484.m-r.6.diff


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jacob Kaplan-Moss  
View profile
 More options Mar 19, 8:47 am
From: "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com>
Date: Wed, 19 Mar 2008 07:47:02 -0500
Local: Wed, Mar 19 2008 8:47 am
Subject: Re: Streaming Uploads Discussion
Hi Mike --

*Very* well done -- I'm in agreement with nearly every aspect of your
proposal. Major props for taking on such a sticky issue -- walking
into #2070 is a bit like exploring an overgrown jungle :) A few
comments inline below, but in general I quite like your API and would
like to see your code.

On Tue, Mar 18, 2008 at 11:30 PM, Mike Axiak <mcax...@gmail.com> wrote:
>  request.set_upload_handler(<upload_handler>)

I especially like this -- it neatly encapsulates the fact that
different folks are going to want quite different file upload
behavior. A few things to think about:

* Do you think it's worth firing a signal
(file_upload_started/file_upload_finished) from request or from the
base upload handler? This would let decentralized processing of file
uploads, but could get confusing.
* Do you think we should allow multiple upload handlers (which makes
this call into something like request.add_upload_handler(handler))?
The other option would be to just make folks compose upload handlers
with a wrapper class, which is hardly a hardship.

I don't have answers to either of these questions; something to think
about, at least.

>  request.FILES
>  -----------------
>  This is no longer a MultiValueDict of raw content, but a
>  MultiValueDict of UploadedFile objects.
>  This will probably hurt the most, as there are probably applications
>  assuming that they can take the content from this dict.

It seems to me that you could pretty easily provide a
backwards-compatible API by defining __getitem__ on UploadedFile;
raise a DeprecationWarning there but provide the data in the "old"
style.

>    5. get_chunk_size() -

Just make this FileUploadHandler.chunk_size -- no reason for getters/setters.

Again, thanks -- this is good stuff.

Jacob


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile
 More options Mar 19, 9:19 am
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Wed, 19 Mar 2008 16:19:56 +0300
Local: Wed, Mar 19 2008 9:19 am
Subject: Re: Streaming Uploads Discussion

Jacob Kaplan-Moss wrote:
>>  request.set_upload_handler(<upload_handler>)

> I especially like this -- it neatly encapsulates the fact that
> different folks are going to want quite different file upload
> behavior.

But what about default behaviour? There should be some place to say "all
file uploads should go on disk".

P.S. In my other response we seem to agree on __getitem__ for
request.FILES and an attribute .chunk_size almost exactly :-).


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Axiak  
View profile
 More options Mar 19, 12:08 pm
From: Mike Axiak <mcax...@gmail.com>
Date: Wed, 19 Mar 2008 09:08:59 -0700 (PDT)
Local: Wed, Mar 19 2008 12:08 pm
Subject: Re: Streaming Uploads Discussion
Good question. The issue of encoding is a little murky. As of now, my
(almost working) patch will pretty much try to encode everything
(using force_to_unicode) *except* data in a file upload.
I'm not exactly sure we'll ever support encoding data in file uploads,
but if we do it's definitely a tricky problem.
The issue is two-fold:
   - What files should we encode and which shouldn't we? (do we do
matching on the mimetype? that doesn't sound nice)
   - Dealing with multibyte boundaries in the strings while streaming
the data. (Not a question, just a fun night of work.)

Anyway, I'll go back to ignoring that issue for now unless I get angry
cries for it.

-Mike

On Mar 19, 7:06 am, "Casper Jensen" <t...@sema.dk> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Axiak  
View profile
 More options Mar 19, 12:18 pm
From: Mike Axiak <mcax...@gmail.com>
Date: Wed, 19 Mar 2008 09:18:19 -0700 (PDT)
Local: Wed, Mar 19 2008 12:18 pm
Subject: Re: Streaming Uploads Discussion

On Mar 19, 7:31 am, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:

> [...]
> Am I right thinking that raw_data is not a raw data from socket but is
> already decoded from whatever content-transfer-encoding it might be in
> (i.e. base64)?

Yes. The idea is to abstract away much of the HTTP work and just get
the raw data after it's been pulled from whatever casing it's in.

> Instead of having users figure out where to stick a call to
> set_upload_handler we could steal the design from middleware and just
> have UPLOAD_HANDLERS setting... It might not be such a good idea if
> people will often want different logic per view. However I think a
> single global setting is needed for the most common use case: store
> everything in temp files.

The idea is that you can set this anywhere, so long as you haven't
accessed POST and FILES yet. So a view will be a place you can do
this.
The TemporaryFileUploadHandler will be default, and is what will
happen "out-of-the-box".
(Well...there's the idea of using the MemoryFileUploadHandler if the
Content-Length exists and is small...but that's beside the point. :)

-Mike


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jacob Kaplan-Moss  
View profile
(1 user)  More options Mar 19, 12:40 pm
From: "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com>
Date: Wed, 19 Mar 2008 11:40:45 -0500
Local: Wed, Mar 19 2008 12:40 pm
Subject: Re: Streaming Uploads Discussion
On Wed, Mar 19, 2008 at 8:19 AM, Ivan Sagalaev

<man...@softwaremaniacs.org> wrote:
>  But what about default behaviour? There should be some place to say "all
>  file uploads should go on disk".

I think the default behavior doesn't need to be a setting:
upload-to-disk for all files over N bytes (1M, maybe?), and
upload-to-RAM for anything smaller.

Jacob


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile
 More options Mar 19, 12:41 pm
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Wed, 19 Mar 2008 19:41:41 +0300
Local: Wed, Mar 19 2008 12:41 pm
Subject: Re: Streaming Uploads Discussion

Mike Axiak wrote:
> The idea is that you can set this anywhere, so long as you haven't
> accessed POST and FILES yet. So a view will be a place you can do
> this.
> The TemporaryFileUploadHandler will be default, and is what will
> happen "out-of-the-box".

My concern is for this "out-of-the-box" default not to be hard-coded.
There should be some way for a user to set its own global (i.e. non
per-view) handler. I understand that everyone can write a one-line
middleware that sets the handler but while we're at designing stage I
thought we could make it a setting instead of another FAQ :-)

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile
 More options Mar 19, 12:47 pm
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Wed, 19 Mar 2008 19:47:24 +0300
Local: Wed, Mar 19 2008 12:47 pm
Subject: Re: Streaming Uploads Discussion

Mike Axiak wrote:
> Good question. The issue of encoding is a little murky. As of now, my
> (almost working) patch will pretty much try to encode everything
> (using force_to_unicode) *except* data in a file upload.

This is the right way. Uploaded file is almost never will be treated as
text and unicode doesn't make sense for binary data. It's also
especially useless for the default behaviour of storing uploaded data to
disk: you'll have to encode just decoded data back to a stream of bytes
to store it in a file.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.