Memory leak in 3.0.1?

211 views
Skip to first unread message

Yuexuan Chen

unread,
May 16, 2013, 3:19:40 AM5/16/13
to python-tornado
import time
import tornado.gen
import tornado.ioloop

def do_something(callback):
  tornado.ioloop.IOLoop.instance().add_timeout(time.time()+0.001,
      callback)

@tornado.gen.engine
def run_test():
  yield tornado.gen.Task(do_something)
  run_test()

for i in range(100):
  run_test()
tornado.ioloop.IOLoop.instance().start()


The above code runs fine under tornado 2.4.1, but seems it has memory leak under tornado 3.0.1, memory consumption getting larger and larger.
(btw, if i remember it clearly, a similar issue exists in tornado2.2?)

--
Best Regards,
yuexuan

Ben Darnell

unread,
May 17, 2013, 12:06:25 AM5/17/13
to Tornado Mailing List
Hmm, you're right.  In fact, I think we had one leak introduced in 3.0 and another introduced since then.  I tried to add a unittest for the leak fixed back in the 2.x series, but the test was too narrow and didn't catch this new issue.  I wonder if there's a more principled way to detect these kinds of growing reference chains.

-Ben


--
You received this message because you are subscribed to the Google Groups "Tornado Web Server" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-tornad...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ben Darnell

unread,
May 17, 2013, 12:42:13 AM5/17/13
to Tornado Mailing List, Serge Koval
On Fri, May 17, 2013 at 12:06 AM, Ben Darnell <b...@bendarnell.com> wrote:
Hmm, you're right.  In fact, I think we had one leak introduced in 3.0 and another introduced since then.  I tried to add a unittest for the leak fixed back in the 2.x series, but the test was too narrow and didn't catch this new issue.  I wonder if there's a more principled way to detect these kinds of growing reference chains.

I've committed a fix in branch3.0 for the leak in 3.0.1.  That still leaves a separate leak in the master branch.  Serge, I think this has to do with the deactivation callbacks you removed in your StackContext refactoring.  I thought we could get by without them, but maybe not.  

-Ben

Serge S. Koval

unread,
May 17, 2013, 2:30:20 AM5/17/13
to Ben Darnell, Tornado Mailing List
Question - 3.0.1 does not include new StackContext implementation, only master?

I will take a look and see if I can reproduce the issue.

Serge S. Koval

unread,
May 17, 2013, 2:47:45 AM5/17/13
to Ben Darnell, Tornado Mailing List
Yes, I see the issue.

While I can argue that it is, in fact, endless recursion and that's why stack grows - it won't really help :-)

Stack deactivation approach is quite ugly - it adds performance penalty for every asynchrnous call and breaks stack paradigm. I'll try to think of any other way to do this. If there's none, I'll bring stack deactivation back.

Serge.

Antoine Pitrou

unread,
May 17, 2013, 4:54:15 AM5/17/13
to python-...@googlegroups.com
Le Fri, 17 May 2013 00:06:25 -0400,
Ben Darnell <b...@bendarnell.com> a écrit :
> Hmm, you're right. In fact, I think we had one leak introduced in
> 3.0 and another introduced since then. I tried to add a unittest for
> the leak fixed back in the 2.x series, but the test was too narrow
> and didn't catch this new issue. I wonder if there's a more
> principled way to detect these kinds of growing reference chains.

You can do what we do in CPython (regrtest -R): use a debug build, run
your tests N times in a loop (with N somewhere around 5), and watch
sys.gettotalrefcount() after each iteration.

You have to be careful to clean up Python's internal caches after every
iteration, which is why I'd suggest re-using CPython's code (the
function is named dash_R, I think).

Regards

Antoine.


Ben Darnell

unread,
May 17, 2013, 11:07:16 AM5/17/13
to Serge S. Koval, Tornado Mailing List
On Fri, May 17, 2013 at 2:47 AM, Serge S. Koval <serge...@gmail.com> wrote:
Yes, I see the issue.

While I can argue that it is, in fact, endless recursion and that's why stack grows - it won't really help :-)

Aha, you're right!  The deactivation callback was providing in effect a kind of tail call optimization.  If I rewrite Yuexuan's example to use a loop instead of tail recursion, the leak goes away.  

 

Stack deactivation approach is quite ugly - it adds performance penalty for every asynchrnous call and breaks stack paradigm. I'll try to think of any other way to do this. If there's none, I'll bring stack deactivation back.

Asynchronous functions already break the stack paradigm:  even though the second call to run_test starts inside the first, it finishes after the first call to run_test has already left the stack since there was no yield.  I'm not sure that allowing this style of fire-and-forget call is worth the ugliness of stack context deactivation callbacks, though.  I would generally recommend rewriting code like this to use a loop, or if you do want to fire-and-forget without waiting for the result, use an explicit NullContext to break the exception-handling chain.  

-Ben

Serge S. Koval

unread,
May 17, 2013, 12:19:43 PM5/17/13
to Ben Darnell, Tornado Mailing List
On Fri, May 17, 2013 at 6:07 PM, Ben Darnell <b...@bendarnell.com> wrote:
Stack deactivation approach is quite ugly - it adds performance penalty for every asynchrnous call and breaks stack paradigm. I'll try to think of any other way to do this. If there's none, I'll bring stack deactivation back.
Asynchronous functions already break the stack paradigm:  even though the second call to run_test starts inside the first, it finishes after the first call to run_test has already left the stack since there was no yield.  I'm not sure that allowing this style of fire-and-forget call is worth the ugliness of stack context deactivation callbacks, though.  I would generally recommend rewriting code like this to use a loop, or if you do want to fire-and-forget without waiting for the result, use an explicit NullContext to break the exception-handling chain.  
There's chance that existing code will break - if wrapped method uses add_callback/add_timeout to reschedule own execution, stack will continue to grow. 

It is hard to explain magic behind stack contexts (especially when they're implicit in gen.coroutine), why recursive calls should be wrapped with the NullContext, etc.

So, maybe, it is better to play it safe?

I will try to add stack deactivation callback over weekend - there's way to make it slightly faster than it was before, but it will still require traversing whole chain. What's good - there's no need to call anything and cleanup function can be moved to C speedup module later.

Serge.

Ben Darnell

unread,
May 18, 2013, 3:07:20 PM5/18/13
to Serge S. Koval, Tornado Mailing List
I'd also consider a solution specific to gen.engine/coroutine, since I believe those are the only places the deactivation callback was ever used.  Stack contexts are kind of a poor fit for the coroutine error handling case, and before you made nested stack contexts less expensive I was planning to look into using them less.  When you yield a Future you don't need a stack context at all because the Future encapsulates its error status.  Tasks could start their own stack context instead of relying on a context from the surrounding coroutine.  The one thing I'm not sure of is the Callback/Wait pattern - people could be relying on a stack context being present when they call the thing that they pass the Callback to.  We could perhaps have the gen.Runner decide on a segment-by-segment basis whether a stack context is needed (based on whether there are outstanding Callbacks), but that seems complicated.  Maybe as Futures become more widespread we can phase out Callback/Wait.

-Ben
 

Serge.

Serge S. Koval

unread,
May 20, 2013, 7:55:33 AM5/20/13
to Ben Darnell, Tornado Mailing List
Submitted pull request.

Still, not convinced it is necessary - this situation can only happen in recursive calls and python developers know that python does not do any call-tail optimization.

Anyway, performance-wise it is not that bad. Two tests with 50 nested StackContexts and ExceptionStackContexts:

Current 3.0:

(tornado)serge@Cheetah ~/Work/tornado/prj> python -m timeit -n 10000 -s "import bench; import old_stack as stack_context" "bench.stacked(50, stack_context)"
10000 loops, best of 3: 359 usec per loop

(tornado)serge@Cheetah ~/Work/tornado/prj> python -m timeit -n 10000 -s "import bench; import old_stack as stack_context" "bench.stacked_exc(50, stack_context)"
10000 loops, best of 3: 368 usec per loop


Current master (without stack deactivation):

(tornado)serge@Cheetah ~/Work/tornado/prj> python -m timeit -s "import bench; import new_stack as stack_context" "bench.stacked(50, stack_context)"
10000 loops, best of 3: 128 usec per loop

(tornado)serge@Cheetah ~/Work/tornado/prj> python -m timeit -n 10000 -s "import bench; import new_stack as stack_context" "bench.stacked_exc(50, stack_context)"
10000 loops, best of 3: 131 usec per loop


New with stack deactivation:

(tornado)serge@Cheetah ~/Work/tornado/prj> python -m timeit -s "import bench; from tornado import stack_context" "bench.stacked(50, stack_context)"
10000 loops, best of 3: 148 usec per loop

(tornado)serge@Cheetah ~/Work/tornado/prj> python -m timeit -n 10000 -s "import bench; import tornado.stack_context as stack_context" "bench.stacked_exc(50, stack_context)"
10000 loops, best of 3: 151 usec per loop

Ben Darnell

unread,
May 20, 2013, 11:57:37 PM5/20/13
to Serge S. Koval, Tornado Mailing List
On Mon, May 20, 2013 at 7:55 AM, Serge S. Koval <serge...@gmail.com> wrote:
Submitted pull request.

Great, thanks for putting this together.
 

Still, not convinced it is necessary - this situation can only happen in recursive calls and python developers know that python does not do any call-tail optimization.

True, but do people think of it in this way?  I go back and forth on it - sometimes it looks like tail recursion, sometimes add_callback looks like a kind of tail-call optimization itself.  This "tail recursive" function can run forever without blowing out the stack; should adding @gen.engine change the behavior of this non-generator idiom?
  def f(x):
    print x
    IOLoop.current().add_callback(f, x+1)

I think we have to do this, even though deactivation callbacks are an ugly hack, just for backwards compatibility.  The pseudo-tail-recursive pattern is difficult to grep for, and few projects have enough testing infrastructure to figure out whether they will have a memory problem before they deploy 3.1 and it starts blowing up in production.
 

Anyway, performance-wise it is not that bad. Two tests with 50 nested StackContexts and ExceptionStackContexts:

Good, this still leaves us with a substantial gain over 3.0.  Is this still the same benchmark from https://gist.github.com/mrjoes/5374044, though?  I'm not sure that tests the right things here.  It calls enter and exit 50 times each, but only wraps and calls one function.  The expense of deactivation is paid when calling a wrapped function, so we need more of those.  Try something like this:

def stacked_wrapped(depth, mod, func=test, ctx=ctx_exception):
    with ctx(mod):
        if depth > 0:
            fn = mod.wrap(functools.partial(stacked_wrapped, depth - 1, mod))
        else:
            fn = mod.wrap(functools.partial(func, 123, 456))
    # outside the with statement
    fn()

This should show a larger difference between the old and new.  

I think we have some options to speed things up here.  If we clear self.exception_handler in _deactivate() and check the active flag in _handle_exception, we don't have to walk the exception chain every time.  We just need to make sure that the chain doesn't grow without bound and we eventually clean up the husks of inactive ExceptionStackContexts, but if we're not hanging on to expensive user objects we can just do that periodically (e.g. keep a depth counter on each ExceptionStackContext and run cleanup when the depth hits a multiple of N).  

There's an analogous change that could be made for non-exception StackContexts, but it's not obvious to me that we can beat the list comprehension there.  (The only known use of the deactivation callback is for ExceptionStackContext, so maybe we don't even need to support deactivation in other cases)

-Ben

Ben Darnell

unread,
May 26, 2013, 2:20:04 PM5/26/13
to Serge S. Koval, Tornado Mailing List
OK, I've written some new benchmarks.  The enter_exit benchmark below is similar to yours and primarily tests the cost of the nested "with" statements; the call_wrapped benchmark wraps a callback and runs it on a clean stack instead of going straight to the next level of nesting.  The deactivation callback had its biggest impact on the call_wrapped benchmark for ExceptionStackContext:  calling a wrapped function once again has a cost proportional to the depth of the stack.  The numbers aren't too bad, though.  For a depth of 500, this benchmark took 681ms in 3.0.1, 4.1ms with your changes and without deactivation, and 41ms with deactivation.  


For version 3.0.1:
StackBenchmark().enter_exit(50)
1000 loops, best of 3: 660 usec per loop
StackBenchmark().call_wrapped(50)
100 loops, best of 3: 14.1 msec per loop
StackBenchmark().enter_exit(500)
100 loops, best of 3: 6.16 msec per loop
StackBenchmark().call_wrapped(500)
10 loops, best of 3: 1.46 sec per loop
ExceptionBenchmark().enter_exit(50)
1000 loops, best of 3: 311 usec per loop
ExceptionBenchmark().call_wrapped(50)
100 loops, best of 3: 9.53 msec per loop
ExceptionBenchmark().enter_exit(500)
100 loops, best of 3: 3.41 msec per loop
ExceptionBenchmark().call_wrapped(500)
10 loops, best of 3: 681 msec per loop

Optimized version without deactivation (commit f55614a64):
StackBenchmark().enter_exit(50)
1000 loops, best of 3: 779 usec per loop
StackBenchmark().call_wrapped(50)
100 loops, best of 3: 10.6 msec per loop
StackBenchmark().enter_exit(500)
100 loops, best of 3: 6.28 msec per loop
StackBenchmark().call_wrapped(500)
10 loops, best of 3: 877 msec per loop
ExceptionBenchmark().enter_exit(50)
1000 loops, best of 3: 168 usec per loop
ExceptionBenchmark().call_wrapped(50)
1000 loops, best of 3: 415 usec per loop
ExceptionBenchmark().enter_exit(500)
1000 loops, best of 3: 1.9 msec per loop
ExceptionBenchmark().call_wrapped(500)
100 loops, best of 3: 4.15 msec per loop

With deactivation (commit 4b88839):
StackBenchmark().enter_exit(50)
1000 loops, best of 3: 731 usec per loop
StackBenchmark().call_wrapped(50)
100 loops, best of 3: 11.1 msec per loop
StackBenchmark().enter_exit(500)
100 loops, best of 3: 6.32 msec per loop
StackBenchmark().call_wrapped(500)
10 loops, best of 3: 857 msec per loop
ExceptionBenchmark().enter_exit(50)
10000 loops, best of 3: 158 usec per loop
ExceptionBenchmark().call_wrapped(50)
1000 loops, best of 3: 889 usec per loop
ExceptionBenchmark().enter_exit(500)
100 loops, best of 3: 1.74 msec per loop
ExceptionBenchmark().call_wrapped(500)
10 loops, best of 3: 41.7 msec per loop

-Ben

Serge S. Koval

unread,
May 26, 2013, 5:41:04 PM5/26/13
to Ben Darnell, Tornado Mailing List
I found two related bugs with existing implementation:
1. If top stack context is getting deactivated, it'll crash. I'm working on this one
2. Stack deactivation messes up leaked context verification in some cases. Figuring out how to fix it.

Serge.

Serge S. Koval

unread,
May 26, 2013, 5:55:44 PM5/26/13
to Ben Darnell, Tornado Mailing List
Submitted pull request.

Unfortunately, I didn't time to play with performance last week, but few things I hope to finish:
1. Do some refactoring of the new StackContext implementation.

Was thinking of storing parent in the stack, so there's no need to do property lookup in an object when traversing the chain for exception handling and stack deactivation. Also, as you suggested, there's no need to do stack deactivation every time, so it is also good idea to store depth of the chain as well, etc.

2. Was doing some profiling and it looks like HTTP header parser can be slightly improved. For simple benchmarks most of the time is spent there. Not sure it really helps for real applications, but for some simple API services optimization makes sense.

Serge.

Ben Darnell

unread,
May 26, 2013, 10:13:16 PM5/26/13
to Serge S. Koval, Tornado Mailing List
On Sun, May 26, 2013 at 5:55 PM, Serge S. Koval <serge...@gmail.com> wrote:
Submitted pull request.

Thanks.
 

Unfortunately, I didn't time to play with performance last week, but few things I hope to finish:
1. Do some refactoring of the new StackContext implementation.

Was thinking of storing parent in the stack, so there's no need to do property lookup in an object when traversing the chain for exception handling and stack deactivation. Also, as you suggested, there's no need to do stack deactivation every time, so it is also good idea to store depth of the chain as well, etc.

2. Was doing some profiling and it looks like HTTP header parser can be slightly improved. For simple benchmarks most of the time is spent there. Not sure it really helps for real applications, but for some simple API services optimization makes sense.

The HTTPHeaders code is already fairly well-tuned.  I'm open to suggestions, but I wouldn't advise spending too much time exploring it.  I suspect that the only way to get improvements that aren't "slight" will be to rewrite in C (or maybe use cython).
 
I found two related bugs with existing implementation:
1. If top stack context is getting deactivated, it'll crash. I'm working on this one
2. Stack deactivation messes up leaked context verification in some cases. Figuring out how to fix it.

Is this second problem still happening, or did the fix for deactivating the top context fix it?  If it's still a problem, can you describe how to reproduce it?

-Ben

Serge S. Koval

unread,
May 27, 2013, 3:06:20 AM5/27/13
to Ben Darnell, Tornado Mailing List
On Mon, May 27, 2013 at 5:13 AM, Ben Darnell <b...@bendarnell.com> wrote: 
Is this second problem still happening, or did the fix for deactivating the top context fix it?  If it's still a problem, can you describe how to reproduce it?
Yes, it was fixed.
 
Serge.
Reply all
Reply to author
Forward
0 new messages