[BUG] low perfomance on big chunks writing

90 views
Skip to first unread message

Andrew Grigorev

unread,
May 23, 2012, 5:37:49 PM5/23/12
to Tornado Mailing List
Hello!

I found that tornado gives terrifically low perfomance on large
responses - it produces 1MB/s output in IOStream.write if you would pass
a 512MB buffer to it. It is caused by the tonns of heavy memcpy
operations in [1].

Could be reproduced by:

import tornado.ioloop
import tornado.web

class Handler(tornado.web.RequestHandler):
def get(self):
self.write("0"*512*1024*1024)

if __name__ == "__main__":
application = tornado.web.Application([
(r"/", Handler),
])
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()

You can see output stream speed using:

$ curl http://127.0.0.1:8888/ >/dev/null
% Total % Received % Xferd Average Speed Time Time
Time Current
Dload Upload Total Spent
Left Speed
1 512M 1 5519k 0 0 730k 0 0:11:58 0:00:07
0:11:51 914k

I fixed it in this [2] way for now, using BytesIO:

$ curl http://127.0.0.1:8888/ >/dev/null
% Total % Received % Xferd Average Speed Time Time
Time Current
Dload Upload Total Spent
Left Speed
100 512M 100 512M 0 0 277M 0 0:00:01 0:00:01
--:--:-- 277M

But it is bad for long-living IOStream's - memory would not be freed
until IOStream will be destroyed. As I can guess from commit messages
related to deque-based code, this way is the way from which that code
was changed to deque one. I will try to provide a more accurate solution
tommorrow, using an offset variable with deque elements. Just wonder if
any one else had spoted/fixed that bug too, or have a better idea how to
do this?..

[1]
https://github.com/facebook/tornado/blob/351c3223f52124dadb135b629818cd5cec8a4065/tornado/iostream.py#L730
[2]
https://github.com/ei-grad/tornado/commit/d89e3e4a2184398f36e5b488711c6cf9655f7c56

--
Andrew

kzahel

unread,
May 24, 2012, 12:36:14 AM5/24/12
to Tornado Web Server
My guess is that most people are not doing large responses, so the
constant slicing is not a problem. I see no reason why you shouldn't
be able to pass in an option (for responses you know will be large) to
allow the response buffer to use BytesIO or perhaps just store some
indexes for where its currently writing to prevent the excessive
copying.

On May 23, 2:37 pm, Andrew Grigorev <and...@ei-grad.ru> wrote:
> Hello!
>
> I found that tornado gives terrifically low perfomance on large
> responses - it produces 1MB/s output in IOStream.write if you would pass
> a 512MB buffer to it. It is caused by the tonns of heavy memcpy
> operations in [1].
>
> Could be reproduced by:
>
>     import tornado.ioloop
>     import tornado.web
>
>     class Handler(tornado.web.RequestHandler):
>         def get(self):
>             self.write("0"*512*1024*1024)
>
>     if __name__ == "__main__":
>         application = tornado.web.Application([
>             (r"/", Handler),
>         ])
>         application.listen(8888)
>         tornado.ioloop.IOLoop.instance().start()
>
> You can see output stream speed using:
>
>     $ curlhttp://127.0.0.1:8888/>/dev/null
>       % Total    % Received % Xferd  Average Speed   Time    Time
> Time  Current
>                                      Dload  Upload   Total   Spent
> Left  Speed
>       1  512M    1 5519k    0     0   730k      0  0:11:58  0:00:07
> 0:11:51  914k
>
> I fixed it in this [2] way for now, using BytesIO:
>
>     $ curlhttp://127.0.0.1:8888/>/dev/null
>       % Total    % Received % Xferd  Average Speed   Time    Time
> Time  Current
>                                      Dload  Upload   Total   Spent
> Left  Speed
>     100  512M  100  512M    0     0   277M      0  0:00:01  0:00:01
> --:--:--  277M
>
> But it is bad for long-living IOStream's - memory would not be freed
> until IOStream will be destroyed. As I can guess from commit messages
> related to deque-based code, this way is the way from which that code
> was changed to deque one. I will try to provide a more accurate solution
> tommorrow, using an offset variable with deque elements. Just wonder if
> any one else had spoted/fixed that bug too, or have a better idea how to
> do this?..
>
> [1]https://github.com/facebook/tornado/blob/351c3223f52124dadb135b629818...
> [2]https://github.com/ei-grad/tornado/commit/d89e3e4a2184398f36e5b488711...
>
> --
> Andrew

Ben Darnell

unread,
May 24, 2012, 1:06:02 AM5/24/12
to python-...@googlegroups.com
If you've got such a huge string, you're already sort of in trouble
from the memory fragmentation, so ideally whatever is generating this
very large output would be splitting things into reasonably-sized
chunks. If that's not feasible it might be reasonable for
IOStream.write to split up its input before adding it to _write_buffer
- that wouldn't be optimally efficient since there would still be some
resize/copies, but it would be pretty simple and avoid the worst-case
behavior (without getting into tricky questions of frozen write
buffers, etc)

-Ben

Andrew Grigorev

unread,
May 24, 2012, 8:49:30 AM5/24/12
to python-...@googlegroups.com
Yeah, of course there are some troubles in my application, caused by
processing of such a huge peaces of data, as at least it blocks the main
thread for a severeal tenth of a second on any synchronous operation
with such amounts of data. But, actually, output is the large JSON block
containing some statistics information used for maintenance, and the
problem here is not that the large amount of data is dumped to JSON, but
in the fact that tornado fails to write it to the socket in acceptable
amount of time (without splitting the output in my application).

Split the buffer in IOStream.write looks reasonable for now and works
good. Should I make a pull request, or you will commit it yourself?

About the resize/copies - we have them any way if using _merge_prefix.
Also, I don't understand, is there any profit in joining a group of
little chunks into the large one? I guess that it would be right to
delegate such work to the kernel, am I wrong? May be we could do it
better passing the little chunks to socket.send without modification and
using the memoryview on the large chunks, to avoid memcpy operations in
python at all? Would it be acceptable, or I should not even try to
implement it? Is there any benchmarks, I should use to test it? Probably
this may be an ideal solution.

--
Andrew


24.05.2012 09:06, Ben Darnell написал:

Ben Darnell

unread,
May 25, 2012, 12:24:32 AM5/25/12
to python-...@googlegroups.com
On Thu, May 24, 2012 at 5:49 AM, Andrew Grigorev <and...@ei-grad.ru> wrote:
> Yeah, of course there are some troubles in my application, caused by
> processing of such a huge peaces of data, as at least it blocks the main
> thread for a severeal tenth of a second on any synchronous operation
> with such amounts of data. But, actually, output is the large JSON block
> containing some statistics information used for maintenance, and the
> problem here is not that the large amount of data is dumped to JSON, but
> in the fact that tornado fails to write it to the socket in acceptable
> amount of time (without splitting the output in my application).
>
> Split the buffer in IOStream.write looks reasonable for now and works
> good. Should I make a pull request, or you will commit it yourself?

If you'd like to put together a pull request, feel free - you may get
to it before I do.

>
> About the resize/copies - we have them any way if using _merge_prefix.
> Also, I don't understand, is there any profit in joining a group of
> little chunks into the large one? I guess that it would be right to
> delegate such work to the kernel, am I wrong? May be we could do it
> better passing the little chunks to socket.send without modification and
> using the memoryview on the large chunks, to avoid memcpy operations in
> python at all? Would it be acceptable, or I should not even try to
> implement it? Is there any benchmarks, I should use to test it? Probably
> this may be an ideal solution.

That's a good question. Some buffering is nice if the application
were sending data in a few bytes at a time, but since people don't
generally do that it may not be worth it. Maybe the _merge_prefix
call in _handle_write should only be used to shrink the buffer, and
never to grow it.

-Ben
Reply all
Reply to author
Forward
0 new messages