futex high sys cpu utilization

Jaydeep Chovatia

unread,

Jul 17, 2012, 1:20:20 AM7/17/12

to

Hi,

In my c++ (Linux) application I am seeing high CPU utilization (almost 90%) and it that sys-cpu:user-cpu ratio is 8:2. I then tried to run “strace” command and found that around 80% of time is being spent in “futex” system call, please see snippet of strace here:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
80.66 100.007743 1567 63806 22761 futex
9.15 11.340274 96925 117 4 restart_syscall
5.60 6.939598 4401 1577 poll

Initially I though this is because of pthread_lock, unlock, condition timedwait, condition wait, etc. system calls I am using in the application, to prove that I have overridden pthread_lock, unlock, condition timedwait, condition wait, unlock, etc. using LD_PRELOAD functionality and concluded that not many threads are waiting here (mutex->__data.__nusers remains less than 5 most of the time), so it doesn’t seem that these sys calls are part of the problem.

Now I am running out of clue about how to find out code location/system call which is causing this futex high utilization, any help on this would be appreciated.

Thank you,
Jaydeep

Nicolas George

unread,

Jul 17, 2012, 4:03:29 AM7/17/12

to

Jaydeep Chovatia , dans le message
<f87be638-dccd-4a9d...@googlegroups.com>, a écrit :

> In my c++ (Linux) application I am seeing high CPU utilization (almost
> 90%) and it that sys-cpu:user-cpu ratio is 8:2. I then tried to run
> “strace” command and found that around 80% of time is being spent in
> “futex” system call, please see snippet of strace here:

If your uptime is more than 17 days, try:

sudo date -s "$(date)"

(or equivalent) and see if it fixes things. This is not a joke: there was a
leap second at the end of last month, and a strange bug in the NTP handling
causes futexes to expire prematurely. This exact command prevented Iceweasel
from eating all my CPU.

Rainer Weikusat

unread,

Jul 17, 2012, 6:55:41 AM7/17/12

to

Jaydeep Chovatia <chovatia...@gmail.com> writes:
> In my c++ (Linux) application I am seeing high CPU utilization
> (almost 90%) and it that sys-cpu:user-cpu ratio is 8:2. I then tried
> to run “strace” command and found that around 80% of time is being
> spent in “futex” system call, please see snippet of strace here:
>
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 80.66 100.007743 1567 63806 22761 futex
> 9.15 11.340274 96925 117 4 restart_syscall
> 5.60 6.939598 4401 1577 poll
>
> Initially I though this is because of pthread_lock, unlock,
> condition timedwait, condition wait, etc. system calls I am using in
> the application, to prove that I have overridden pthread_lock,
> unlock, condition timedwait, condition wait, unlock, etc. using
> LD_PRELOAD functionality and concluded that not many threads are
> waiting here (mutex->__data.__nusers remains less than 5 most of the
> time), so it doesn’t seem that these sys calls are part of the
> problem.

Wild guess: Try running date -s "`date`". If this helps, you are
probably the last person on earth to notice that the Linux leap second
handling code is/was broken ...

Jaydeep Chovatia

unread,

Jul 17, 2012, 11:46:58 AM7/17/12

to

On Tuesday, 17 July 2012 03:55:41 UTC-7, Rainer Weikusat wrote:
> Jaydeep Chovatia <chovatia...@gmail.com> writes:
> > In my c++ (Linux) application I am seeing high CPU utilization
> > (almost 90%) and it that sys-cpu:user-cpu ratio is 8:2. I then tried
> > to run “strace” command and found that around 80% of time is being
> > spent in “futex” system call, please see snippet of strace here:
> >
> > % time seconds usecs/call calls errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> > 80.66 100.007743 1567 63806 22761 futex
> > 9.15 11.340274 96925 117 4 restart_syscall
> > 5.60 6.939598 4401 1577 poll
> >
> > Initially I though this is because of pthread_lock, unlock,
> > condition timedwait, condition wait, etc. system calls I am using in
> > the application, to prove that I have overridden pthread_lock,
> > unlock, condition timedwait, condition wait, unlock, etc. using
> > LD_PRELOAD functionality and concluded that not many threads are
> > waiting here (mutex->__data.__nusers remains less than 5 most of the
> > time), so it doesn’t seem that these sys calls are part of the
> > problem.
>
> Wild guess: Try running date -s "`date`". If this helps, you are

> probably the last person on earth to notice that the Linux leap second
> handling code is/was broken ...

Thanks for your prompt response.
I had already tried the leap second solution but after that also i see this problem.

Thank you,
Jaydeep

William Ahern

unread,

Jul 17, 2012, 2:09:51 PM7/17/12

to

IIRC, the bug only causes this particular behavior when using sub-second
timeouts on a futex. The OP might also want to reconsider why he's using
sub-second timeouts in a loop.

Rainer Weikusat

unread,

Jul 17, 2012, 2:30:55 PM7/17/12

to

This affects anything using a subsecond timeout together with
CLOCK_REALTIME hrtimer and the reason is these hrtimers maintain a
redundant copy of 'the current time' which is not updated from the
leap second insertion code because Thomas Gleixner killed the call
that was supposed to do this in 2007
(746976a301ac9c9aa10d7d42454f8d6cdad8ff2b).

In case this is of interest to someone: I could offer a working patch
for that (used in production) for anything which predates the
6b43ae8a619d17c4935c3320d2ef9e92bdeed05d kernel/time/ntp.c commit
(Thu Mar 15 13:04:03 2012 -0700, meaning, it uses the hrtimer-based
leap-second handling before John Stultz introduced a locking order
problem into that by splitting a certain lock into two locks).

The patch itself is against 2.6.36.4. It my introduce blasphemous
whitespaces. It was tested with the 'John Stultz' test case program.

Index: linux-2-6/kernel/time/ntp.c
===================================================================
RCS file: /sysdata/cvs/linux-2-6/kernel/time/ntp.c,v
retrieving revision 1.6
retrieving revision 1.7
diff -u -r1.6 -r1.7
--- linux-2-6/kernel/time/ntp.c 19 Aug 2010 17:04:51 -0000 1.6
+++ linux-2-6/kernel/time/ntp.c 4 Jul 2012 15:35:33 -0000 1.7
@@ -184,10 +184,21 @@
* day, the system clock is set back one second; if in leap-delete
* state, the system clock is set ahead one second.
*/
+static void call_clock_was_set(unsigned long unused)
+{
+ (void)unused;
+ clock_was_set();
+}
+
static enum hrtimer_restart ntp_leap_second(struct hrtimer *timer)
{
+ static struct timer_list
+ call_clock_was_set_timer = TIMER_INITIALIZER(call_clock_was_set, 0, 0);
+
enum hrtimer_restart res = HRTIMER_NORESTART;
+ int leap;

+ leap = 0;
write_seqlock(&xtime_lock);

switch (time_state) {
@@ -196,17 +207,17 @@
case TIME_INS:
timekeeping_leap_insert(-1);
time_state = TIME_OOP;
- printk(KERN_NOTICE
- "Clock: inserting leap second 23:59:60 UTC\n");
hrtimer_add_expires_ns(&leap_timer, NSEC_PER_SEC);
res = HRTIMER_RESTART;
+
+ leap = -1;
break;
case TIME_DEL:
timekeeping_leap_insert(1);
time_tai--;
time_state = TIME_WAIT;
- printk(KERN_NOTICE
- "Clock: deleting leap second 23:59:59 UTC\n");
+
+ leap = 1;
break;
case TIME_OOP:
time_tai++;
@@ -219,6 +230,15 @@
}

write_sequnlock(&xtime_lock);
+
+ if (leap) {
+ mod_timer(&call_clock_was_set_timer, jiffies);
+
+ printk(KERN_NOTICE
+ "Clock: %s leap second 23:59:%u UTC\n",
+ leap == -1 ? "inserting" : "deleting",
+ leap == -1 ? 60 : 59);
+ }

return res;
}

Ersek, Laszlo

unread,

Jul 17, 2012, 5:08:55 PM7/17/12

to

On Mon, 16 Jul 2012, Jaydeep Chovatia wrote:

> Initially I though this is because of pthread_lock, unlock, condition
> timedwait, condition wait, etc. system calls I am using in the
> application, to prove that I have overridden pthread_lock, unlock,
> condition timedwait, condition wait, unlock, etc. using LD_PRELOAD
> functionality and concluded that not many threads are waiting here
> (mutex->__data.__nusers remains less than 5 most of the time), so it

> doesn?t seem that these sys calls are part of the problem.

I think what you describe might amount to high futex contention. ISTR
NPTL/futexen are optimized for the uncontended case (= no one to wait for
when acquiring, no one to wake when releasing), and they are purposely (or
at least knowingly) heavy-weight in the contended case.

... The references I can manage now are
<http://man7.org/linux/man-pages/man7/futex.7.html> and
<http://kerneltrap.org/node/422> (search the latter for "the main
application cost is the lost paralellism" for the argument).

I assume __nusers is the number of blocked threads. "Less than 5 most of
the time" accomodates "more than 3 most of the time", which could mean the
pthread mutex operations take a trip to the kernel too very frequently.

This could be a good question for comp.programming.threads. Maybe look
around in <http://www.1024cores.net/home/lock-free-algorithms>.

Laszlo

Ersek, Laszlo

unread,

Jul 17, 2012, 5:16:48 PM7/17/12

to

On Tue, 17 Jul 2012, Rainer Weikusat wrote:

> This affects anything using a subsecond timeout together with
> CLOCK_REALTIME hrtimer and the reason is these hrtimers maintain a
> redundant copy of 'the current time' which is not updated from the leap
> second insertion code because Thomas Gleixner killed the call that was
> supposed to do this in 2007 (746976a301ac9c9aa10d7d42454f8d6cdad8ff2b).

Thanks for the reference.

<troll>

"Quite a few people consider me to be one of the Grumpy Old Men. That's
related to my age and the age-related unwillingness to cope with crap."

https://www.linux.com/news/special-feature/linux-developers/591039-30-linux-kernel-developers-in-30-weeks-thomas-gleixner

If he thinks he needs an excuse, he's gonna need a better one.

</troll>

Laszlo

Rainer Weikusat

unread,

Jul 18, 2012, 4:18:45 PM7/18/12

to

"Ersek, Laszlo" <la...@caesar.elte.hu> writes:
> On Tue, 17 Jul 2012, Rainer Weikusat wrote:
>> This affects anything using a subsecond timeout together with
>> CLOCK_REALTIME hrtimer and the reason is these hrtimers maintain a
>> redundant copy of 'the current time' which is not updated from the
>> leap second insertion code because Thomas Gleixner killed the call
>> that was supposed to do this in 2007
>> (746976a301ac9c9aa10d7d42454f8d6cdad8ff2b).

[...]

> "Quite a few people consider me to be one of the Grumpy Old
> Men. That's related to my age and the age-related unwillingness to
> cope with crap."
>
> https://www.linux.com/news/special-feature/linux-developers/591039-30-linux-kernel-developers-in-30-weeks-thomas-gleixner
>
> If he thinks he needs an excuse, he's gonna need a better one.

I don't think so. Even the smartest guy will occasionally make silly
mistakes. Because of this, there should be some kind of 'code review
process' by 'other smart guys' to catch these. And the problem is that
this doesn't exist, not the least because critcizing "I'm the core
maintainer of this, that and something else" guys is a really unwise
thing to do for anyone lower in the hierarchy with any ambitions
beyond "My problem is fixed. I'm happy".

William Ahern

unread,

Jul 18, 2012, 5:41:37 PM7/18/12

to

Code review probably would never have caught this. Catching this bug would
have required a thorough understanding of the timer infrastructure
(especially the niche cases of tick adjustments). I mean, the reviewer would
have had to foresee this kind of error (a missing line in some far flung
module) and know to look for it. Very few other developers would have had
this expertise. It's an unintended consequence of specialization; nobody
around to fix the other guy's mistakes.

IME code review is usually more a waste of time than anything. You usually
need someone as experienced or more experienced than the author wrt to the
subject code. But that means the cost of the review is quite high; higher
than most people are willing to regularly accept. Afterall, there's a reason
the other experienced engineer is off doing something else.

I'm not saying it can't be done systematically. It just requires a
considerable investment in time. Maybe the best way to handle code reviews
is to pair programmers together. Each generally goes about his own business,
but part of his responsibility is to keep track of what the other guy is
doing (keeping his opinions to himself, lest egos get in the way). When he's
finally tasked with reviewing some particuarly hairy bit of code he can come
up to speed immediately. Sort of like a President and Vice-President
situation. The VP keeps his mouth shut, and in return the President keeps
him apprised of everything that's going on so the VP can immediately step
into his shoes when required.

Rainer Weikusat

unread,

Jul 19, 2012, 12:12:24 PM7/19/12

to

Not at all. Employing some common sense had been entirely sufficient:
If clock_was_set is really useless when the clock is being set because
of a leap second, why isn't it also useless when the clock is being
set via settimeofday?

And these are the kind of questions someone who is not immersed in the
details of a specific problem is likely to ask: Granted, this is a
tree. Coming to think of it, a lot of trees seem to be all around
here. Could we be in a forest?

Fritz Wuehler

unread,

Jul 22, 2012, 8:22:29 PM7/22/12

to

William Ahern <wil...@wilbur.25thandClement.com> wrote:

> Rainer Weikusat <rwei...@mssgmbh.com> wrote:
> > "Ersek, Laszlo" <la...@caesar.elte.hu> writes:
> > > On Tue, 17 Jul 2012, Rainer Weikusat wrote:

> Code review probably would never have caught this. Catching this bug would
> have required a thorough understanding of the timer infrastructure
> (especially the niche cases of tick adjustments). I mean, the reviewer would
> have had to foresee this kind of error (a missing line in some far flung
> module) and know to look for it. Very few other developers would have had
> this expertise. It's an unintended consequence of specialization; nobody
> around to fix the other guy's mistakes.

I like this last sentence. It's an important concept. I'm not sure there's
anything you can do about it but recognizing it is worth something.

> IME code review is usually more a waste of time than anything. You usually
> need someone as experienced or more experienced than the author wrt to the
> subject code. But that means the cost of the review is quite high; higher
> than most people are willing to regularly accept. Afterall, there's a reason
> the other experienced engineer is off doing something else.

I don't like code reviews either, what I ask my guys to do (and what I do)
is to run it past at least one other developer and explain to that person
what the problem is and how they solved it. More often than not, the person
doing the explaining will do the necessary face-palm when he realizes his
mistake, without the person listening having to do or say anything at
all. Code reviews create an adversarial environment at worst (not that that
is fundamentally unproductive, but in practice it usually is) and at best a
who gives a shit, where do I sign so I can get back to work attitude that
makes the whole review worthless. At the end of the day, code reviews are
just another "tool" for incompetent managers to CYA with, damn them.

> I'm not saying it can't be done systematically. It just requires a
> considerable investment in time. Maybe the best way to handle code reviews
> is to pair programmers together.

Yes!

> Each generally goes about his own business, but part of his responsibility
> is to keep track of what the other guy is doing (keeping his opinions to
> himself, lest egos get in the way).

No!

It's too expensive and it's just a fact of life in small to mid size teams
there aren't enough people for more than one person to own a piece of
code. And code ownership is crucial to code quality. I mean sure you can
say two people are responsible for knowing something but in practice it's
almost never realistic. It's better to ask people to discuss things with one
other developer. That alone solves 95% of the problems.

> When he's finally tasked with reviewing some particuarly hairy bit of code
> he can come up to speed immediately.

On the code I have seen and worked on, that isn't realistic. We have a lot
of code and becoming a subject matter expert takes many years. We have never
had enough money to hire enough guys to have more than one person really
know any one thing and in practice I don't even know if it's possible for
more than one person to be truly expert on a piece of code just because of
the head banging that would have to happen to get to that point. I mean what
do you do, give every other problem to one of the guys? Just because they
worked on half the problems doesn't mean they understand things at the same
level. Certainly splitting the work down the middle is one way to give
people the best chance possible to know a piece of code, but in practice
what usually happens is people gravitate towards what kinds of problems they
solve, and they split the work themselves according to thematic lines rather
than simple work load management. And all that does is make the module you
assigned them both to be responsible for into two, smaller modules.

> Sort of like a President and Vice-President situation. The VP keeps his
> mouth shut, and in return the President keeps him apprised of everything
> that's going on so the VP can immediately step into his shoes when
> required.

I think that is an extreme oversimplification. It is much simpler for
corporate types to exchange information than it is for technical people,
because of what is being discussed. It doesn't take 5 or 10 years to learn
this quarter's business plan, obviously, but it can take that long to really
understand a few hundred thousand lines of source code in a multi-million
line system.