OK, I've written some new benchmarks. The enter_exit benchmark below is similar to yours and primarily tests the cost of the nested "with" statements; the call_wrapped benchmark wraps a callback and runs it on a clean stack instead of going straight to the next level of nesting. The deactivation callback had its biggest impact on the call_wrapped benchmark for ExceptionStackContext: calling a wrapped function once again has a cost proportional to the depth of the stack. The numbers aren't too bad, though. For a depth of 500, this benchmark took 681ms in 3.0.1, 4.1ms with your changes and without deactivation, and 41ms with deactivation.
StackBenchmark().enter_exit(50)
1000 loops, best of 3: 660 usec per loop
StackBenchmark().call_wrapped(50)
100 loops, best of 3: 14.1 msec per loop
StackBenchmark().enter_exit(500)
100 loops, best of 3: 6.16 msec per loop
StackBenchmark().call_wrapped(500)
10 loops, best of 3: 1.46 sec per loop
ExceptionBenchmark().enter_exit(50)
1000 loops, best of 3: 311 usec per loop
ExceptionBenchmark().call_wrapped(50)
100 loops, best of 3: 9.53 msec per loop
ExceptionBenchmark().enter_exit(500)
100 loops, best of 3: 3.41 msec per loop
ExceptionBenchmark().call_wrapped(500)
10 loops, best of 3: 681 msec per loop
Optimized version without deactivation (commit f55614a64):
StackBenchmark().enter_exit(50)
1000 loops, best of 3: 779 usec per loop
StackBenchmark().call_wrapped(50)
100 loops, best of 3: 10.6 msec per loop
StackBenchmark().enter_exit(500)
100 loops, best of 3: 6.28 msec per loop
StackBenchmark().call_wrapped(500)
10 loops, best of 3: 877 msec per loop
ExceptionBenchmark().enter_exit(50)
1000 loops, best of 3: 168 usec per loop
ExceptionBenchmark().call_wrapped(50)
1000 loops, best of 3: 415 usec per loop
ExceptionBenchmark().enter_exit(500)
1000 loops, best of 3: 1.9 msec per loop
ExceptionBenchmark().call_wrapped(500)
100 loops, best of 3: 4.15 msec per loop
With deactivation (commit 4b88839):
StackBenchmark().enter_exit(50)
1000 loops, best of 3: 731 usec per loop
StackBenchmark().call_wrapped(50)
100 loops, best of 3: 11.1 msec per loop
StackBenchmark().enter_exit(500)
100 loops, best of 3: 6.32 msec per loop
StackBenchmark().call_wrapped(500)
10 loops, best of 3: 857 msec per loop
ExceptionBenchmark().enter_exit(50)
10000 loops, best of 3: 158 usec per loop
ExceptionBenchmark().call_wrapped(50)
1000 loops, best of 3: 889 usec per loop
ExceptionBenchmark().enter_exit(500)
100 loops, best of 3: 1.74 msec per loop
ExceptionBenchmark().call_wrapped(500)
10 loops, best of 3: 41.7 msec per loop
-Ben