Streaming big files POST-ed

1,006 views
Skip to first unread message

László Nagy

unread,
Nov 13, 2015, 7:56:16 AM11/13/15
to Tornado Web Server

  Hello,

It has been a while I have cheked this. In about a year ago in tornado version 3, there was a thread about this. bdarned had a pull request that allowed streaming post requests:

https://github.com/tornadoweb/tornado/pull/1021

This allows the user to use the data_received method, but that won't parse the streamed data in any way. Then I have written some code that will parse multipart form data on the fly and posted in to stackofverflow:

http://stackoverflow.com/questions/25529804/tornado-mime-type-of-the-stream-request-body-output/26025961#26025961

A year passed, and as far as I know, nothing similar was implemented in tornado. Tornado already parses request data into get and post parameters. It would be logical to allow it to parse multipart form data too.

I wonder if this would be something to add? My particular implementation is probably not the best, but I think there should be a standard way of posting huge files to a tornado server.

Please let me know your thoughts.

Thanks,

   Laszlo

Ben Darnell

unread,
Nov 14, 2015, 12:29:43 PM11/14/15
to Tornado Mailing List
I think a streaming multipart/form-data parser would be useful, but it's not obvious what the right interface is. Writing everything to disk to be read later (as you've done) is one good approach, but I deliberately avoided any direct disk access in the @stream_request_body interface so that applications could e.g. pass an uploaded file directly into s3 without touching local disk (and enforce appropriate flow control). I think any interface built in to Tornado itself should have similar properties to the existing @stream_request_body, even if there is a higher-level alternative to buffer everything on disk. 

--
You received this message because you are subscribed to the Google Groups "Tornado Web Server" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-tornad...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

László Nagy

unread,
Nov 16, 2015, 6:05:55 AM11/16/15
to Tornado Web Server, b...@bendarnell.com



I think a streaming multipart/form-data parser would be useful, but it's not obvious what the right interface is. Writing everything to disk to be read later (as you've done) is one good approach, but I deliberately avoided any direct disk access in the @stream_request_body interface so that applications could e.g. pass an uploaded file directly into s3 without touching local disk (and enforce appropriate flow control). I think any interface built in to Tornado itself should have similar properties to the existing @stream_request_body, even if there is a higher-level alternative to buffer everything on disk. 


   Hi Ben,

Thank you for your reply. I'm almost ready with a new version of the multipart/form-data streamer. It is modular, and it can be used to feed data from various parts into different file like objects: Popen instances or httpclient instances etc. I do not want to release this version yet, because I see a potential security problem. Most web servers will not use streaming for most of their request handlers. There will be a few handlers that  stream data into files or processes. The problem I see is that the max_buffer_size parameter for the HTTPServer instance is global. If I want the users to be able to upload 4GB files, then I will have to specify max_buffer_size=4*1024**3 - but that will also affect all the other request handlers. It means that anyone will be able to POST huge amounts of data to any request handler, and they will all be loaded into memory. It would be almost trivial to do an attack that consumes all available memory on the server.

There should be a way to set a default max_buffer_size for the server globally, and increase max_buffer_size for individual request handlers.

Is it possible?

Thanks,

   Laszlo

László Nagy

unread,
Nov 16, 2015, 6:31:29 AM11/16/15
to Tornado Web Server, b...@bendarnell.com

There should be a way to set a default max_buffer_size for the server globally, and increase max_buffer_size for individual request handlers.

Is it possible?

I can see that max_buffer_size is passed to TCPServer and max_body_size is passed to HTTP1ConnectionParameters. I guess there is no way to set a different max_buffer_size for request handlers, because there is a single TCPServer that accepts the incoming requests. When tornado decides which request handler is to be used, HTTP1ConnectionParameters was already used to determine the parameters, right?

Suggestion: Normally, RequestHandler.prepare is called when the entire request body is available. If the stream_request_body decorator is given, then prepare is called when request headers are available. In order to limit the max body size for request handlers individually, we should have a base method in HTTPRequest that is called when request headers are available, even if the request handler was not decorated with stream_request_body. In that base method, it should be possible to set a maximum request size. The server should terminate the connection if Content-Length is higher than the maximum allowed, and also when the size of the (partly) loaded request exceeeds the given limit.

Right now I do not see an API in tornado that could be used to safely to upload large files, and handle "normal" pages with small requests at the same time, because:

  • if we limit max_buffer_size and max_body_size, then large files cannot be posted to the server
  • if we set max_buffer_size and max_body_size to a large value, then huge files can be posted to the server, and it is not possible to prevent posting large amounts of data and force the server to load it into memory

Please let me know your thoughts.

Best,

  Laszlo

László Nagy

unread,
Nov 16, 2015, 6:34:25 AM11/16/15
to Tornado Web Server, b...@bendarnell.com


  • if we limit max_buffer_size and max_body_size, then large files cannot be posted to the server
  • if we set max_buffer_size and max_body_size to a large value, then huge files can be posted to the server, and it is not possible to prevent posting large amounts of data and force the server to load it into memory
Oh and because user authentication happens in the request handlers, of course it is possible to POST large amounts of data without user authentication so any server that increases max_buffer_size has a vulnerablity, right?

Ben Darnell

unread,
Nov 16, 2015, 8:02:40 PM11/16/15
to László Nagy, Tornado Web Server
On Mon, Nov 16, 2015 at 6:31 AM, László Nagy <nag...@gmail.com> wrote:

There should be a way to set a default max_buffer_size for the server globally, and increase max_buffer_size for individual request handlers.

Is it possible?

I can see that max_buffer_size is passed to TCPServer and max_body_size is passed to HTTP1ConnectionParameters. I guess there is no way to set a different max_buffer_size for request handlers, because there is a single TCPServer that accepts the incoming requests. When tornado decides which request handler is to be used, HTTP1ConnectionParameters was already used to determine the parameters, right?

Suggestion: Normally, RequestHandler.prepare is called when the entire request body is available. If the stream_request_body decorator is given, then prepare is called when request headers are available. In order to limit the max body size for request handlers individually, we should have a base method in HTTPRequest that is called when request headers are available, even if the request handler was not decorated with stream_request_body. In that base method, it should be possible to set a maximum request size. The server should terminate the connection if Content-Length is higher than the maximum allowed, and also when the size of the (partly) loaded request exceeeds the given limit.

We already have an API like this, although it was omitted from the documentation. `self.request.connection.set_max_body_size()` may be called in `prepare()` and works as you describe:


This way you can set max_buffer_size and the default max_body_size to relatively small values (the size of the largest body you will accept without @stream_request_body), then increase max_body_size in prepare() for those handlers that accept larger files (and because it's streaming, you don't need to change max_buffer_size).

-Ben

László Nagy

unread,
Nov 17, 2015, 7:11:18 AM11/17/15
to Tornado Web Server, nag...@gmail.com, b...@bendarnell.com
 
We already have an API like this, although it was omitted from the documentation. `self.request.connection.set_max_body_size()` may be called in `prepare()` and works as you describe:


This way you can set max_buffer_size and the default max_body_size to relatively small values (the size of the largest body you will accept without @stream_request_body), then increase max_body_size in prepare() for those handlers that accept larger files (and because it's streaming, you don't need to change max_buffer_size).

That is perfect, thank you! Maybe mentioning this in the documentation of prepare() would be useful for others

http://www.tornadoweb.org/en/stable/web.html#tornado.web.RequestHandler.prepare

Something like:

"When the stream_request_body decorator was used on the RequestHandler, then prepare() is called  after headers are parsed, but before the body is streamed. Inside prepare, you can call self.request.connection.set_max_body_size() to override the maximum allowed body size for an individual request."

I'm a happy camper now!

   Laszlo

Cong Wang

unread,
Feb 9, 2017, 7:49:06 AM2/9/17
to Tornado Web Server
Hi
I would like to upload big file without form,how could I use stream_request_body decorator?

I would like to transfer some data from client to server 
on the client side ,I would like to select the file in code not in form ,so how could I use stream_request_body to do that ?
when you run the code,you could use the command:
> client:
> g++ -shared -fPIC -o libuploadclient.so uploadclient.cpp
> -I/root/Python-3.5.0/Include/ -lpython3.5m -L/usr/local/lib/
>
> g++ main.cpp -o main -I/root/Python-3.5.0/Include/ -lpython3.5m
> ./libuploadclient.so
>
> ./main
>
> server:
> python3 server.py
>
> I am so eager to know the answer,thanks!
>
> best regards,
> cong

在 2015年11月13日星期五 UTC+8下午8:56:16,László Nagy写道:

Les

unread,
Feb 9, 2017, 8:53:12 AM2/9/17
to python-...@googlegroups.com
On the server side, you can use the tornadostreamform package. ( https://pythonhosted.org/tornadostreamform/ )

On the client side, you need to post the file data in "multipart/form-data" format. A browser can do this for you. It is also possible to stream it with an asynchttpclient instance, but if you cannot read the whole file into the memory, then you need to implement on-the-fly conversion from file binary data into multipart/form-data. In order to do that, use AsyncHTTPClient and an HTTPRequest object with a body_producer parameter given.

Here is an example of body_decorator - but it uses chunked encoding instead of multipart/form-data: https://gist.github.com/bdarnell/5bb1bd04a443c4e06ccd

If you can wait until next week, I can create a module that does the hard part of the on-the-fly conversion (e.g. a method that allows you to stream a local file into an async PUT or POST multipart/form-data request) I will do that anyway, but I won't have the time until next week.



--
You received this message because you are subscribed to a topic in the Google Groups "Tornado Web Server" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-tornado/izEXQd71rQk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-tornado+unsubscribe@googlegroups.com.

Cong Wang

unread,
Feb 10, 2017, 2:48:41 AM2/10/17
to python-...@googlegroups.com, nag...@gmail.com, Ben Darnell
ok,thanks a lot.

waiting for your module of the on-the-fly conversion as soon as possible.

Actually,I have read the code(https://gist.github.com/bdarnell/5bb1bd04a443c4e06ccd) on github,and I am  a little confused.
Does it mean that all the server and client are fulfilled in the same streaming.py file? Shouldn't they be separated?
class UploadHandler and class ProxyHandler are the request handler of server,
but I don't really understand what does def client mean,I have checked the API document of IOLoop.spawn_callback(callback*args**kwargs), does it mean that we use another ioloop to run the client?could you explain for me?

Thanks!
best regards

László Nagy

unread,
Feb 10, 2017, 7:09:46 AM2/10/17
to Tornado Web Server, nag...@gmail.com, b...@bendarnell.com


2017. február 10., péntek 8:48:41 UTC+1 időpontban Cong Wang a következőt írta:
ok,thanks a lot.

waiting for your module of the on-the-fly conversion as soon as possible.

Actually,I have read the code(https://gist.github.com/bdarnell/5bb1bd04a443c4e06ccd) on github,and I am  a little confused.
Does it mean that all the server and client are fulfilled in the same streaming.py file?
Actually, that code contains a client, a server that acts as a proxy and a server. The client sends data to the proxy, the proxy forwards it to the server. It demonstrates the usage of body_producer, but it uses chunked encoding which is not very useful for most of the servers. Also, the client produces some garbage data. In the real world, you would read data from a file or from another process. (And at that point, you are facing the problem that you cannot do blocking file read without blocking the main ioloop thread, unless you create a a separate thread for reading...) 
Shouldn't they be separated?
This is just an example that demonstrates streaming capabilities. It was put into a single file because it is a minimal example that is designed to be compact and easy to understand. Of course for a real problem, you won't create the client, the proxy and the server on the same computer. There is no point. But for an example, this is fine. 
 
class UploadHandler and class ProxyHandler are the request handler of server,
but I don't really understand what does def client mean,I have checked the API document of IOLoop.spawn_callback(callback*args**kwargs), does it mean that we use another ioloop to run the client?could you explain for me?

It schedules the given callback (in this case: client() ) to be executed.

László Nagy

unread,
Feb 10, 2017, 1:55:22 PM2/10/17
to Tornado Web Server
It is scheduled in the "current" ioloop. In the streaming.py example, there is a single ioloop. In most async applications, there is a single loop. Spawn_callback does not create a loop.

Cong Wang

unread,
Feb 11, 2017, 2:36:19 AM2/11/17
to python-...@googlegroups.com, László Nagy
Hi,
I have tried to run this example,and I am a little confused by some statements:
1.logging.info('ProxyHandler.data_received(%d bytes: %r)',
                     len(chunk), chunk[:9])
    what does chunk[:9]mean?is it a regular expression?
2.also ,I can't make sure what does chunk = ('chunk %02d ' % i) * 10000   mean?
3.
   so when you finish the module to transform local file to  "multipart/form-data" format in the request ,and then I replace the chunk part  in the example with your module ,it could be the best way to transfer some big files in my system?right?

best regards,
cong

2017-02-11 2:55 GMT+08:00 László Nagy <nag...@gmail.com>:
It is scheduled in the "current" ioloop. In the streaming.py example, there is a single ioloop. In most async applications, there is a single loop. Spawn_callback does not create a loop.

László Nagy

unread,
Feb 14, 2017, 1:56:27 AM2/14/17
to Tornado Web Server, nag...@gmail.com


2017. február 11., szombat 8:36:19 UTC+1 időpontban Cong Wang a következőt írta:
Hi,
I have tried to run this example,and I am a little confused by some statements:
1.logging.info('ProxyHandler.data_received(%d bytes: %r)',
                     len(chunk), chunk[:9])
    what does chunk[:9]mean?is it a regular expression?

That is a string slice. That is basic Python code. Before you learn tornado, you must learn Python. Slices are explained in the beginning of any Python tutorial.

2.also ,I can't make sure what does chunk = ('chunk %02d ' % i) * 10000   mean?

It creates a very long string. Strings can be multiplied by integers. But again: that is basic Python. You need to learn the language before you learn tornado.
 

Cong Wang

unread,
Feb 14, 2017, 2:16:27 AM2/14/17
to python-...@googlegroups.com, László Nagy
ok,thanks,
so when will you finish the file transform module?
I am looking forward to use that.
thanks a lot.

best regards,
cong

Cong Wang

unread,
Feb 21, 2017, 1:41:55 AM2/21/17
to python-...@googlegroups.com, László Nagy
Hi,
so how about the module of the  transferring part?I am so eager to use that,thanks!

Best regards,
cong

2017-02-09 21:53 GMT+08:00 Les <nag...@gmail.com>:
Message has been deleted

PLSD plogic

unread,
Jan 23, 2020, 1:52:18 AM1/23/20
to Tornado Web Server

@stream_request_body
class StreamHandler(RequestHandler):
    def post(self):
         self.temp_file.close()


def prepare(self):

     max_buffer_size = 4 * 1024**3 # 4GB
     self.request.connection.set_max_body_size(max_buffer_size)
     self.temp_file = open("test.txt","w")


def data_received(self, chunk):
    self.temp_file.write(chunk)

With the above code I am able to upload file but in raw data form as shown below

-----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="dest"

csv -----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="carrier"

bandwidth -----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="file1"; filename="test.csv" Content-Type: text/csv

And the content of uploaded file follows here.

How do I get the request parameters parsed from the file and separate the data of the file uploaded? Is there any other way to upload large files(around 2 GB) in tornado


@stream_request_body class StreamHandler(RequestHandler): def post(self): self.temp_file.close()

def prepare(self):

     max_buffer_size = 4 * 1024**3 # 4GB
     self.request.connection.set_max_body_size(max_buffer_size)
     self.temp_file = open("test.txt","w")


def data_received(self, chunk):
    self.temp_file.write(chunk)

With the above code I am able to upload file but in raw data form as shown below

-----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="dest"

csv -----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="carrier"

bandwidth -----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="file1"; filename="test.csv" Content-Type: text/csv

And the content of uploaded file follows here.

How do I get the request parameters parsed from the file and separate the data of the file uploaded? Is there any other way to upload large files(around 2 GB) in tornado

 

Phyo Arkar

unread,
Jan 23, 2020, 6:11:23 AM1/23/20
to Tornado Mailing List
Necromancer spotted.

--
You received this message because you are subscribed to the Google Groups "Tornado Web Server" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-tornad...@googlegroups.com.

Ben Darnell

unread,
Jan 23, 2020, 1:18:31 PM1/23/20
to Tornado Mailing List
On Thu, Jan 23, 2020 at 1:52 AM PLSD plogic <pls...@gmail.com> wrote:

@stream_request_body
class StreamHandler(RequestHandler):
    def post(self):
         self.temp_file.close()


def prepare(self):

     max_buffer_size = 4 * 1024**3 # 4GB
     self.request.connection.set_max_body_size(max_buffer_size)
     self.temp_file = open("test.txt","w")


def data_received(self, chunk):
    self.temp_file.write(chunk)

With the above code I am able to upload file but in raw data form as shown below

-----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="dest"

csv -----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="carrier"

bandwidth -----------------------------6552719992117258671800152707 Content-Disposition: form-data; name="file1"; filename="test.csv" Content-Type: text/csv

And the content of uploaded file follows here.


This is the multipart protocol used by HTML forms. Tornado can currently only parse this if it sees all the data at once, not in a streaming upload. There's a third-party library that should be able to handle this: https://github.com/siddhantgoel/streaming-form-data. See this issue for more: https://github.com/tornadoweb/tornado/issues/1842
 

How do I get the request parameters parsed from the file and separate the data of the file uploaded? Is there any other way to upload large files(around 2 GB) in tornado


If you control the client, it may be simpler to use a plain HTTP PUT instead of the HTML multipart form protocol. This doesn't require any special handling on the server side.

-Ben

László Nagy

unread,
Jan 24, 2020, 8:33:25 AM1/24/20
to Tornado Web Server


2020. január 23., csütörtök 19:18:31 UTC+1 időpontban Ben Darnell a következőt írta:
This is the multipart protocol used by HTML forms. Tornado can currently only parse this if it sees all the data at once, not in a streaming upload. There's a third-party library that should be able to handle this: https://github.com/siddhantgoel/streaming-form-data. See this issue for more: https://github.com/tornadoweb/tornado/issues/1842

Hmm interesting, I did not know about this lib. I can see some problems: cannot handle multiple parts with the same name, and you must know all fields names in advance (e.g. you cannot just accept a POST and examine its contents later). Otherwise, it seems to be easier to use than tornadostreamform, and also more Pythonic.


Reply all
Reply to author
Forward
0 new messages