epoll_wait, clock_gettime getting called excessively

262 views
Skip to first unread message

jmclar...@gmail.com

unread,
Apr 6, 2017, 9:11:59 AM4/6/17
to gevent: coroutine-based Python network library
Hi All,

I'm using gevent 1.2.1, with gunicorn.   I have specific endpoints in my app, that will trigger whichever worker handles it, to excessively call epoll_wait, clock_gettime, per the strace below.  Every time I hit the endpoint, it increases the frequency of the calls below until the worker becomes unresponsive (100s - 1000s).  The problematic endpoints reach out to different different database technologies, but business logic completes successfully.   

We're using python 2.7.  Any advice would be appreciated.

12:07:21.495656 epoll_wait(11, {}, 64, 1) = 0
12:07:21.496823 clock_gettime(CLOCK_MONOTONIC, {2135990, 965514473}) = 0
12:07:21.496950 clock_gettime(CLOCK_MONOTONIC, {2135990, 965639451}) = 0
12:07:21.497056 clock_gettime(CLOCK_MONOTONIC, {2135990, 965745152}) = 0
12:07:21.497140 epoll_wait(11, {}, 64, 31) = 0
12:07:21.528272 clock_gettime(CLOCK_MONOTONIC, {2135990, 996949207}) = 0
12:07:21.528314 clock_gettime(CLOCK_MONOTONIC, {2135990, 996987787}) = 0
12:07:21.528347 epoll_wait(11, {}, 64, 1) = 0
12:07:21.529457 clock_gettime(CLOCK_MONOTONIC, {2135990, 998133082}) = 0
12:07:21.529558 clock_gettime(CLOCK_MONOTONIC, {2135990, 998236146}) = 0
12:07:21.529655 clock_gettime(CLOCK_MONOTONIC, {2135990, 998333894}) = 0
12:07:21.529694 epoll_wait(11, {}, 64, 1) = 0
12:07:21.530804 clock_gettime(CLOCK_MONOTONIC, {2135990, 999480246}) = 0
12:07:21.530877 clock_gettime(CLOCK_MONOTONIC, {2135990, 999619690}) = 0
12:07:21.531010 clock_gettime(CLOCK_MONOTONIC, {2135990, 999688676}) = 0
12:07:21.531048 epoll_wait(11, {}, 64, 1) = 0
12:07:21.532178 clock_gettime(CLOCK_MONOTONIC, {2135991, 854724}) = 0
12:07:21.532218 clock_gettime(CLOCK_MONOTONIC, {2135991, 891969}) = 0
12:07:21.532303 epoll_wait(11, {}, 64, 1) = 0
12:07:21.533422 clock_gettime(CLOCK_MONOTONIC, {2135991, 2098535}) = 0
12:07:21.533519 clock_gettime(CLOCK_MONOTONIC, {2135991, 2196492}) = 0
12:07:21.533582 clock_gettime(CLOCK_MONOTONIC, {2135991, 2271461}) = 0
12:07:21.533632 epoll_wait(11, {}, 64, 3) = 0
12:07:21.536764 clock_gettime(CLOCK_MONOTONIC, {2135991, 5440205}) = 0
12:07:21.536803 clock_gettime(CLOCK_MONOTONIC, {2135991, 5476636}) = 0
12:07:21.536835 epoll_wait(11, {}, 64, 1) = 0
12:07:21.537957 clock_gettime(CLOCK_MONOTONIC, {2135991, 6633807}) = 0
12:07:21.538032 clock_gettime(CLOCK_MONOTONIC, {2135991, 6708823}) = 0
12:07:21.538093 clock_gettime(CLOCK_MONOTONIC, {2135991, 6769151}) = 0
12:07:21.538129 epoll_wait(11, {}, 64, 7) = 0
12:07:21.545296 clock_gettime(CLOCK_MONOTONIC, {2135991, 14003710}) = 0
12:07:21.545495 clock_gettime(CLOCK_MONOTONIC, {2135991, 14187142}) = 0
12:07:21.545588 epoll_wait(11, {}, 64, 1) = 0
12:07:21.546749 clock_gettime(CLOCK_MONOTONIC, {2135991, 15435360}) = 0
12:07:21.546870 clock_gettime(CLOCK_MONOTONIC, {2135991, 15556867}) = 0
12:07:21.546972 clock_gettime(CLOCK_MONOTONIC, {2135991, 15658323}) = 0
12:07:21.547051 epoll_wait(11, {}, 64, 15) = 0
12:07:21.562371 clock_gettime(CLOCK_MONOTONIC, {2135991, 31084069}) = 0
12:07:21.562575 clock_gettime(CLOCK_MONOTONIC, {2135991, 31266250}) = 0
12:07:21.562664 epoll_wait(11, {}, 64, 1) = 0
12:07:21.563834 clock_gettime(CLOCK_MONOTONIC, {2135991, 32521719}) = 0
12:07:21.563955 clock_gettime(CLOCK_MONOTONIC, {2135991, 32662752}) = 0
12:07:21.564087 clock_gettime(CLOCK_MONOTONIC, {2135991, 32775530}) = 0

Jason Madden

unread,
Apr 6, 2017, 9:30:48 AM4/6/17
to gev...@googlegroups.com


> On Apr 3, 2017, at 12:12, jmclar...@gmail.com wrote:
>
> Hi All,
>
> I'm using gevent 1.2.1, with gunicorn. I have specific endpoints in my app, that will trigger whichever worker handles it, to excessively call epoll_wait, clock_gettime, per the strace below. Every time I hit the endpoint, it increases the frequency of the calls below until the worker becomes unresponsive (100s - 1000s). The problematic endpoints reach out to different different database technologies, but business logic completes successfully.
>
> We're using python 2.7. Any advice would be appreciated.
>
> 12:07:21.495656 epoll_wait(11, {}, 64, 1) = 0
> 12:07:21.496823 clock_gettime(CLOCK_MONOTONIC, {2135990, 965514473}) = 0
> 12:07:21.496950 clock_gettime(CLOCK_MONOTONIC, {2135990, 965639451}) = 0
> 12:07:21.497056 clock_gettime(CLOCK_MONOTONIC, {2135990, 965745152}) = 0
> 12:07:21.497140 epoll_wait(11, {}, 64, 31) = 0
> 12:07:21.528272 clock_gettime(CLOCK_MONOTONIC, {2135990, 996949207}) = 0
> 12:07:21.528314 clock_gettime(CLOCK_MONOTONIC, {2135990, 996987787}) = 0
> 12:07:21.528347 epoll_wait(11, {}, 64, 1) = 0
> 12:07:21.529457 clock_gettime(CLOCK_MONOTONIC, {2135990, 998133082}) = 0

`epoll_wait` is the core of the event loop. The last argument to it is the timeout, which here is often 1. So what we're seeing is the event loop looping just as fast as it can, without actually really waiting for any IO to become ready. The first argument is the epoll file descriptor, which is constant (11) here. Since each native thread has its own event loop, and each event loop has its own epoll file descriptor, we can assume this is being done from the same native thread and a single event loop.

Putting that together, what that suggests to me is that this particular endpoint is leaking a resource each time, possibly a greenlet, and most likely one that's using a timer in a loop, or requesting IO with a short timeout, again in a loop. (Although I think just leaking started timer instances could also cause this, if there are enough of them.) The more of these that leak, the shorter the event loop will become each time, because even if the timeouts are relatively large, the staggered nature of them will tend towards one of them being nearly ready to expire at each iteration, leading the timeout value to decrease.

Be sure that greenlets are killed if you finish with them prematurely and be sure that timers get cancel()'d when you're done with them (whether or not they expire--always use a try/finally or `with_timeout`, or just `with`).

If it's not obvious from the code what is leaking, you can try using one of the greenlet trace functions that you can find on the web to watch them switch. Alternately, you could hook up a function to a signal handler to dump all running threads and greenlets (found with the `gc` module). objgraph can be useful to determine if Timeout objects are leaking.

Jason
Reply all
Reply to author
Forward
0 new messages