timeouts triggering too soon.

431 views
Skip to first unread message

Shaun Lindsay

unread,
Sep 19, 2011, 5:41:32 PM9/19/11
to gev...@googlegroups.com
Hello,

We've been noticing timeouts on connections triggering before they're supposed to.  I just submitted a bug report here.  The root of the issue is that libev (libevent as well) caches the time at the start of a dispatch loop and uses that cached time when setting timeouts.  Effectively, the real value of a timeout is the requested time minus the time spent handling callbacks since the start of that dispatch loop.  For shorter timeouts (e.g. a 100ms timeout on a connect() call), this can lead to instant timeout failures and very hard to debug error conditions masquerading as network/remote system failures.  

One possible solution is to litter time.sleep(0)'s throughout the code to force a yield and, thus, a fresh run through libev's dispatch loop with a refreshed cache time.  This seems really bad, since it requires a user to understand the internals of both gevent and libev as well as have a reasonable estimate about how long sections of their code take to run.  Under high concurrency, that approach probably wouldn't work anyway.  

As I mention in the bug report, I don't really know how best to solve this.  The only way to really be sure that the timeouts aren't going to expire too soon is to set them relative to a clock_gettime() called for every timeout.  Under high concurrency, this could result in a lot of time wasted in system calls, but with the guarantee that your timeouts would take the requested time.  It would also require a patch to libev (specifically, not using mh_now when checking timer expiration).  If we really want to avoid system calls per timeout check, we could try one of the non-portable userspace time guessing tricks (like RDTSC), although that's a pretty poor solution as well.  

Anyway, a long term fix for this issue would be cool, but I'd settle for some sort of reliable mitigation we can use in the interim.  Has anyone else run in to this and come up with a solution?

Creotiv

unread,
Sep 20, 2011, 4:17:20 PM9/20/11
to gev...@googlegroups.com
When you use "while time.time() < end: pass" you totaly block event loop, and because of this you get time problem.

Denis Bilenko

unread,
Oct 7, 2011, 4:50:59 AM10/7/11
to gev...@googlegroups.com


Hi Shaun,

Thanks for taking time to report this issue. I've also bumped into it
but haven't had time to look closely. I've got some comments though.

> The root of the issue is that libev (libevent as well) caches the time

I've made a simple test (attached) to reproduce this issue and it
shows that gevent-0.13/libevent does not actually have this problem.
Are you sure about libevent having this problem as well?

> One possible solution is to litter time.sleep(0)'s throughout the code to
> force a yield and, thus, a fresh run through libev's dispatch loop with a
> refreshed cache time. This seems really bad

This is unnecessary. Libev has a special function to update cached
time: ev_now_update:
http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#The_special_problem_of_time_updates

It is exposed by gevent as gevent.get_hub().loop.update()

So you can use it instead of sleep(0).

One idea on how to use it is to set up an alarm handler to
periodically call loop.update()

loop = gevent.get_hub().loop
signal.signal(signal.SIGALRM, lambda *args: loop.update())
signal.setitimer(signal.ITIMER_REAL, 0.1, 0.1)

This will update cached time every 100ms. Which means the problem is
reduced from "timeouts expire earlier" to "timeouts expire earlier but
not earlier than by 100ms".

You can of course tune the parameter to suit your application. The
socket connect timeouts are likely to be bigger than 100ms so it
should fix the problem.

I would also prefer libev to defer timer starts to the end of the
event dispatch cycle. I wonder if the current behaviour is accidental
or is it intentional and have some desired properties. You mentioned
that you plan to ask libev developer about this - have you done so?

Regards,
Denis.

Denis Bilenko

unread,
Oct 7, 2011, 5:06:52 AM10/7/11
to gev...@googlegroups.com
The script I used to test it.
timer_expire_early.py

Shaun Lindsay

unread,
Oct 7, 2011, 6:04:41 PM10/7/11
to gev...@googlegroups.com
Hi Denis,

Thanks for taking time to report this issue. I've also bumped into it
but haven't had time to look closely. I've got some comments though.

> The root of the issue is that libev (libevent as well) caches the time

I've made a simple test (attached) to reproduce this issue and it
shows that gevent-0.13/libevent does not actually have this problem.
Are you sure about libevent having this problem as well?

We first encountered the problem using gevent-0.13 and libevent's dispatch cycle uses the same timestamp caching pattern.  We upgraded to the libev-based gevent in hopes that it would fix the issue (I did a cursory review of the libev code, looking for cached timestamps and didn't find them, not realizing that the timers were entirely separate from the fd events themselves).  I'm certain libevent has the same issue. 
 
> One possible solution is to litter time.sleep(0)'s throughout the code to
> force a yield and, thus, a fresh run through libev's dispatch loop with a
> refreshed cache time.  This seems really bad

This is unnecessary. Libev has a special function to update cached
time: ev_now_update:
http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#The_special_problem_of_time_updates

It is exposed by gevent as gevent.get_hub().loop.update()

So you can use it instead of sleep(0).

One idea on how to use it is to set up an alarm handler to
periodically call loop.update()

loop = gevent.get_hub().loop
signal.signal(signal.SIGALRM, lambda *args: loop.update())
signal.setitimer(signal.ITIMER_REAL, 0.1, 0.1)

This will update cached time every 100ms. Which means the problem is
reduced from "timeouts expire earlier" to "timeouts expire earlier but
not earlier than by 100ms".

You can of course tune the parameter to suit your application. The
socket connect timeouts are likely to be bigger than 100ms so it
should fix the problem.

Ah, this is a much better solution than time.sleep(0).  In our case, we actually are using a connection timeout of 100ms:  we're connecting to datacenter local cassandra nodes, so if they haven't connected in 100ms they're probably never going to connect and thus we should try the next node in the cassandra pool. 

Actually, does the signal handler actually interrupt the currently running coroutine?  I think libev uses signalfd, which should make the signal callbacks run as part of the usual event dispatch cycle.  If that's the case, the problem would remain.  I'll combine my original test script with the SIGALRM update() call and see what happens.
 
I would also prefer libev to defer timer starts to the end of the
event dispatch cycle. I wonder if the current behaviour is accidental
or is it intentional and have some desired properties. You mentioned
that you plan to ask libev developer about this - have you done so?

I haven't.  I couldn't find a place to submit a bug report, moreover I wasn't sure it was actually a bug.  Reading the link about 'The special problem of time updates' makes me think they're very aware of the issue.  I don't know if they've considered the deferred timeout solution, though.  I also wanted to have a working patch for the deferred timer starts before submitting.  The libev code is fairly complicated, so I've mainly just produced a very broken libev instead of the efficient solution I'd envisioned. :)  I suppose I should just suck it up and ask on the libev mailing list without a working patch.  I hate not having a patch.  

Anyway, thanks for responding.  I was beginning to think that we were the only ones seeing this behavior.  

--Shaun

Shaun Lindsay

unread,
Oct 7, 2011, 6:30:09 PM10/7/11
to gev...@googlegroups.com
Just looked a bit closer and noticed that you were using an actual signal, not the monkey-patched gevent signal, thus avoiding my concerns about signalfd/fd callbacks in general.  You'd need to set this up before monkey-patching then.  I'll play around with it and see how that will work.

Denis Bilenko

unread,
Oct 8, 2011, 6:41:00 AM10/8/11
to gev...@googlegroups.com
> Actually, does the signal handler actually interrupt the currently running coroutine?

If the signal is set up with signal module, then yes, it does.

On Sat, Oct 8, 2011 at 5:30 AM, Shaun Lindsay <srli...@gmail.com> wrote:
> Just looked a bit closer and noticed that you were using an actual signal,
> not the monkey-patched gevent signal, thus avoiding my concerns about
> signalfd/fd callbacks in general.  You'd need to set this up before
> monkey-patching then.

No, you don't, because signal module is not monkey patched.

Shaun Lindsay

unread,
Oct 11, 2011, 7:34:47 PM10/11/11
to gev...@googlegroups.com
I just submitted a patch for this issue on the bug tracker (available here).  Whenever ev_start_timer is called, the timer is pushed on to a list.  After event dispatch, but before the evaluation of timers, the queued timers are then kicked off and added to the heap.  This fixes the timeout behavior.  I'm planning on also submitting this to the libev mailing list as well.

Denis Bilenko

unread,
Dec 20, 2011, 12:08:19 PM12/20/11
to gev...@googlegroups.com

Vitaly

unread,
May 11, 2012, 4:54:45 AM5/11/12
to gev...@googlegroups.com
Hi Denis,

I just ran into this problem with gevent/libevent.  In my script, I only
monkey-patch the 'time' module (can't patch socket because it causes 
strange errors with pymysql/DBUtils).  In one case, I have a very long
blocking (built-in) socket wait (about 40 seconds); because it's a built-in
socket (not gevent.socket), the cached time is way behind when the call
finally returns.  After that, any gevent.Timeout() calls with timeout value
of less than 40 seconds trigger too soon (immediately).

If I call gevent.sleep(0) after the non-gevent-aware, long-blocking call
returns, then the timeouts begin to work properly.  Are there any plans
for a fix along the lines of your recent libev fix?  Sorry, I don't have a patch.

Here is my gevent/libevent version info:

gevent.core.get_version()
'2.0.17-stable'

gevent.__version__
 '0.13.6'

Denis Bilenko

unread,
May 11, 2012, 7:32:12 AM5/11/12
to gev...@googlegroups.com
On Fri, May 11, 2012 at 12:54 PM, Vitaly <vital...@gmail.com> wrote:
> Are there any plans for a fix along the lines of your recent libev fix?

No, I don't have any plans for 0.13.x. I have not used it myself for a
long time and rather spend time on 1.0 which is what I use. Why don't
you try 1.0beta?

Vitaly

unread,
May 11, 2012, 12:19:06 PM5/11/12
to gev...@googlegroups.com
On Friday, May 11, 2012 4:32:12 AM UTC-7, Denis Bilenko wrote:


No, I don't have any plans for 0.13.x. I have not used it myself for a
long time and rather spend time on 1.0 which is what I use. Why don't
you try 1.0beta?

 I tried v1.0 a couple of months ago, but somehow couldn't get it to work.  I remember not being able to access things like gevent.greenlet.LinkedExited, which broke my code (forgot what else didn't work).  I assumed it was an early work-in-progress v1.0 commit, and went back to v0.13.6.  Perhaps I grabbed an earlier/wrong cut of the software than what was already available at the time.  I would love to give it another try -- is the beta stable enough for production use?  We are developing/running on MacOS and Linux and we're presently on Python 2.6.7 -- will gevent-1.0b2.tar.gz work for those environments?

Thank you,
Vitaly

Vitaly

unread,
May 11, 2012, 12:33:21 PM5/11/12
to gev...@googlegroups.com
On Friday, May 11, 2012 4:32:12 AM UTC-7, Denis Bilenko wrote:

No, I don't have any plans for 0.13.x. I have not used it myself for a
long time and rather spend time on 1.0 which is what I use. Why don't
you try 1.0beta?


Denis -- could you please delete your post that has my email address in the quote.  It must have snuck in there from your email reply.

Many thanks,
Vitaly 

Denis Bilenko

unread,
May 12, 2012, 7:34:14 AM5/12/12
to gev...@googlegroups.com
On Fri, May 11, 2012 at 8:33 PM, Vitaly <vital...@gmail.com> wrote:
> Denis -- could you please delete your post that has my email address in the
> quote.  It must have snuck in there from your email reply.

Why? Google groups web interface does not show it without asking for a captcha.

Vitaly Krug

unread,
May 12, 2012, 5:51:19 PM5/12/12
to gev...@googlegroups.com


On Sat, May 12, 2012 at 4:34 AM, Denis Bilenko <denis....@gmail.com> wrote:

On Fri, May 11, 2012 at 8:33 PM, Vitaly <> wrote:
> Denis -- could you please delete your post that has my email address in the
> quote.  It must have snuck in there from your email reply.

Why? Google groups web interface does not show it without asking for a captcha.


>>

I could see it completely, so thought others could, too.

Best,
Vitaly

Vitaly

unread,
May 23, 2012, 4:57:48 PM5/23/12
to gev...@googlegroups.com
On Friday, May 11, 2012 4:32:12 AM UTC-7, Denis Bilenko wrote:
On Fri, May 11, 2012 at 12:54 PM, Vitaly <> wrote:
> Are there any plans for a fix along the lines of your recent libev fix?

No, I don't have any plans for 0.13.x. I have not used it myself for a
long time and rather spend time on 1.0 which is what I use. Why don't
you try 1.0beta?


I switched to gevent v1.0 and preliminary results look good with respect to the timeout error.  In case someone needs a way to reproduce this problem on gevent.core.get_version='2.0.17-stable'/gevent.__version__= '0.13.6', the attached script should do the trick.

Vitaly

unread,
May 23, 2012, 4:59:40 PM5/23/12
to gev...@googlegroups.com
Here is the attachment.
gevent_timeout.py

Vitaly

unread,
May 24, 2012, 1:05:55 AM5/24/12
to gev...@googlegroups.com
Attaching a much simpler version of the script to reproduce the  "timeouts triggering too soon" problem when used with gevent v0.13.6.
gevent_timeout_too_soon.py

François-Xavier Bourlet

unread,
May 24, 2012, 2:33:46 PM5/24/12
to gev...@googlegroups.com
Is this bug come from libevent or gevent?

If I could get ride of it without yet switching to gevent >= 1.0 i
would love it. (you know, migration, maintenance, time to do it...)
--
François-Xavier Bourlet

Vitaly

unread,
May 24, 2012, 2:47:10 PM5/24/12
to gev...@googlegroups.com


On Thursday, May 24, 2012 11:33:46 AM UTC-7, bombela wrote:
Is this bug come from libevent or gevent?

If I could get ride of it without yet switching to gevent >= 1.0 i
would love it. (you know, migration, maintenance, time to do it...)

 
François-Xavier Bourlet

The details of the problem and possible workarounds are discussed in the earlier messages on this thread.  Before switching to gevent v1.0b2, I successfully (?) used the "sprinkle gevent.sleep(0) calls" workaround discussed on this thread, but that was not a good long-term solution.

François-Xavier Bourlet

unread,
May 24, 2012, 2:49:37 PM5/24/12
to gev...@googlegroups.com
oop, I should have read the whole thread before spamming, my bad.

thanks for the details. I will try the workaround and I hope to switch
to gevent>1 in few weeks.

Thanks for time.
--
François-Xavier Bourlet
Reply all
Reply to author
Forward
0 new messages