High resolution epoll timers for libuv

Andrius Bentkus

unread,

Apr 20, 2015, 10:01:09 AM4/20/15

to li...@googlegroups.com

While toying around with libuv I have found out that the epoll timer has a resolution of only a second which creates a drift.

If we do something like

```js

var count = 0;

var last = count;

setInterval(function () {

count++;

}, 1); // every millisecond

setInterval(function () {

console.log(count - last);

last = count;

}, 1000); // after a second

```

With libuv (this is nodejs code, but the same would hold true for C/libuv code) we get < 1000 events per second. On my machine its around ~923 events.

Now this is impossible to solve with epoll having a timeout based on milliseconds and not nanoseconds, but I just wanted to post this: https://github.com/famz/linux/tree/epoll_pwait1

This patch introduces an epoll api which works on nano second resolution and would allow us to get the thing running sufficiently fine graind. The author wants that feature to get into the kernel for qemu.

Here is the entire discussion on the linux kernel mailing list http://thread.gmane.org/gmane.linux.kernel.api/8250

I just wanted to post this so we would keep a look out for when the feature hits the kernel

Andrius Bentkus

unread,

Apr 20, 2015, 1:40:00 PM4/20/15

to li...@googlegroups.com

I meant to say of only a milli second.

Saúl Ibarra Corretgé

unread,

Apr 21, 2015, 3:38:24 AM4/21/15

to li...@googlegroups.com

Nice!

When drafting the uv_timeout API someone (Ben IIRC) told me that we
could make the timeout a double, and thus support sub-millisecond
precision, where possible. AFAIK we could have sub-millisecond precision
if we use a timerfd instead of passing the timeout to epoll_wait/pwait.
We'd arm a single timerfd per loop, witht he value of the closest timer,
and add it to the epoll set, and then we'd call epoll_wait/pwait with a
NULL timeout.

I'm not sure if there are limitations I'm not aware of here, but that
could be a start. The new epoll API could be supported as well, but
before that the user facing APIs need to support specifying
sub-millisecond precision.

--
Saúl Ibarra Corretgé
bettercallsaghul.com

signature.asc

Andrius Bentkus

unread,

Apr 21, 2015, 7:01:39 AM4/21/15

to li...@googlegroups.com

Nice!

When drafting the uv_timeout API someone (Ben IIRC) told me that we
could make the timeout a double, and thus support sub-millisecond
precision, where possible. AFAIK we could have sub-millisecond precision
if we use a timerfd instead of passing the timeout to epoll_wait/pwait.
We'd arm a single timerfd per loop, witht he value of the closest timer,
and add it to the epoll set, and then we'd call epoll_wait/pwait with a
NULL timeout.

This is indeed possible and probably the best way! I thought about

implementing a timerfd for every timer would be too much since file

descriptors are a precious resource (a lot of systems set them to

small values). However this would mean only one additional fd,

which makes it a reasonable trade off for real time precision.

A downside is that we have to call an additional system call

after every event loop iteration for resetting the timerfd and

calculating when the next timer is to be used.

That could be mitigated though in the future once that patch hits

mainline linux.

I'm not sure if there are limitations I'm not aware of here, but that
could be a start. The new epoll API could be supported as well, but
before that the user facing APIs need to support specifying
sub-millisecond precision.

I actually now wonder why the timerfd was not used before?
I mean, obviously we have now a suboptimal behaviour

since my machine can handle 1k loop iterations in one second

but I only get 920.

The only downside I see now is that windows only supports milliseconds

on the IOCP main loop as well, i wonder if we can wake it with a timer

like on epoll.

Ben Noordhuis

unread,

Apr 21, 2015, 7:20:19 AM4/21/15

to li...@googlegroups.com

That was me, yes. :-)

There is a minor complication in that the linux kernel has a concept
of 'timer slack'; to increase throughput, the kernel tries to coalesce
timer events by postponing process wake-up. The default slack is 50
microseconds, or 1/20,000th of a second; probably fast enough for most
applications.

You can disable it on a per-thread basis with
`prctl(PR_SET_TIMERSLACK, 1)`. Zero resets it.

However, the slack also affects the granularity of the futex() system
call, in turn affecting library functions that are built on top of it,
like pthread_cond_timedwait() and sem_timedwait(). Depending on your
viewpoint, that may or may not be a good thing.

Ben Noordhuis

unread,

Apr 21, 2015, 7:32:01 AM4/21/15

to li...@googlegroups.com

On Tue, Apr 21, 2015 at 1:01 PM, Andrius Bentkus <toxed...@gmail.com> wrote:
>> Nice!
>>
>> When drafting the uv_timeout API someone (Ben IIRC) told me that we
>> could make the timeout a double, and thus support sub-millisecond
>> precision, where possible. AFAIK we could have sub-millisecond precision
>> if we use a timerfd instead of passing the timeout to epoll_wait/pwait.
>> We'd arm a single timerfd per loop, witht he value of the closest timer,
>> and add it to the epoll set, and then we'd call epoll_wait/pwait with a
>> NULL timeout.
>
>
> This is indeed possible and probably the best way! I thought about
> implementing a timerfd for every timer would be too much since file
> descriptors are a precious resource (a lot of systems set them to
> small values). However this would mean only one additional fd,
> which makes it a reasonable trade off for real time precision.
>
> A downside is that we have to call an additional system call
> after every event loop iteration for resetting the timerfd and
> calculating when the next timer is to be used.

That doesn't have to be the case. Libuv would only use the timerfd
when sub-millisecond precision is needed, and only in one-shot mode.

A potential drawback is that libuv, upon returning from epoll_wait(),
may have to scan the events list for the timerfd first. If a number
of normal callbacks run first, a high-resolution timer isn't much
good.

In most cases when the timerfd expires, the list will only contain the
timerfd and nothing else, so it probably won't be too bad.

Perhaps it's an idea to batch events first instead of dispatching
callbacks straight from uv__io_poll(). That would allow for more
optimizations than just avoiding the scan; for example, it would make
it possible to coalesce a read and write event for the same file
descriptor into a single callback.

> That could be mitigated though in the future once that patch hits
> mainline linux.
>
>>
>>
>> I'm not sure if there are limitations I'm not aware of here, but that
>> could be a start. The new epoll API could be supported as well, but
>> before that the user facing APIs need to support specifying
>> sub-millisecond precision.
>
>
> I actually now wonder why the timerfd was not used before?
> I mean, obviously we have now a suboptimal behaviour
> since my machine can handle 1k loop iterations in one second
> but I only get 920.

I can answer that question: because it was more work and < 1 ms
precision wasn't needed at the time. :-)

Andrius Bentkus

unread,

Apr 21, 2015, 8:53:42 AM4/21/15

to li...@googlegroups.com

That doesn't have to be the case. Libuv would only use the timerfd
when sub-millisecond precision is needed, and only in one-shot mode.

The thing now is that if I create a timer with 1ms precision it will get called

~930 times and not 1k(+-1) times. I think the nano second precision is needed

even if we only want precise millisecond precision. (look at the node script i provided

in the beginning of the discussion). The drift really flat lines only at 5-6 ms (the

% off is then to small to calculate based on milli seconds)

I haven't run into an issue yet where I really need those 1k calls a second ... but it

would nice to have?

A potential drawback is that libuv, upon returning from epoll_wait(),
may have to scan the events list for the timerfd first. If a number
of normal callbacks run first, a high-resolution timer isn't much
good.

We feed the timerfd as the first into the epoll_event array and then it

is immediately always the first one when we are checking for the events?

Doesn't save us from scanning the entire event list?

A satured event loop (more events then manageable in a small timerframe)

on the other hand is a problem (webservers under high load) but maybe sometimes

a user has not many events in the event loop and deliberately wants that

1000fps?

In most cases when the timerfd expires, the list will only contain the
timerfd and nothing else, so it probably won't be too bad.

Perhaps it's an idea to batch events first instead of dispatching
callbacks straight from uv__io_poll(). That would allow for more
optimizations than just avoiding the scan; for example, it would make
it possible to coalesce a read and write event for the same file
descriptor into a single callback.

I can answer that question: because it was more work and < 1 ms
precision wasn't needed at the time. :-)

Can you come up with some scenarios where such precision would be useful?

Would an actual implementation downgrade the performance? (since we need

to use nano second based timers and a while ago we a discussion about

Performance Counters)

Ben Noordhuis

unread,

Apr 21, 2015, 12:57:02 PM4/21/15

to li...@googlegroups.com

On Tue, Apr 21, 2015 at 2:53 PM, Andrius Bentkus <toxed...@gmail.com> wrote:
>> That doesn't have to be the case. Libuv would only use the timerfd
>> when sub-millisecond precision is needed, and only in one-shot mode.
>
> The thing now is that if I create a timer with 1ms precision it will get
> called ~930 times and not 1k(+-1) times. I think the nano second precision
> is needed even if we only want precise millisecond precision. (look at the
> node script i provided in the beginning of the discussion). The drift really
> flat lines only at 5-6 ms (the % off is then to small to calculate based on milli
> seconds)
>
> I haven't run into an issue yet where I really need those 1k calls a second
> ... but it would nice to have?

If I understand you correctly, you'd still get that drift with
high-precision timers as long as you don't compensate for callback
overhead. The timer fires every millisecond but wall clock time spent
inside libuv and the callback causes it to shift. Without
compensation, you are limited to 1000 / (1+overhead) events/second.

Also, circling back to timer slack: with ~930 events, you stand to
accumulate up to 930 * 50 / 1000 = 46.5 ms of slack.

>> A potential drawback is that libuv, upon returning from epoll_wait(),
>> may have to scan the events list for the timerfd first. If a number
>> of normal callbacks run first, a high-resolution timer isn't much
>> good.
>
> We feed the timerfd as the first into the epoll_event array and then it
> is immediately always the first one when we are checking for the events?
> Doesn't save us from scanning the entire event list?

There is no way to control where in the ready list the kernel puts the timerfd.

Reply all

Reply to author

Forward