As discussed here ( http://lkml.org/lkml/2007/8/3/250 ), msleep(1) is not
precise enough for many drivers (yes, sleep precision is an unfair notion,
but consistently sleeping for ~an order of magnitude greater than requested
is worth fixing). This patch adds a usleep API so that udelay does not have
to be used. Obviously not every udelay can be replaced (those in atomic
contexts or being used for simple bitbanging come to mind), but there are
many, many examples of
mydriver_write(...)
/* Wait for hardware to latch */
udelay(100)
in various drivers where a busy-wait loop is neither beneficial nor
necessary, but msleep simply does not provide enough precision and people
are using a busy-wait loop instead.
*** SOME QUANTIFIABLE (?) NUMBERS ***
My focus is on Android, so I started by replacing the udelays in
drivers/i2c/busses/i2c-msm.c:
267: udelay(100) --> usleep_range(100, 200)
283: udelay(100) --> usleep_range(100, 200)
333: udelay(20) --> usleep(20)
and measured wakeups after Android was completely booted and stable
across 100 trials (throwing away the first) like so:
for i in {1..100}; do
echo "=== Trial $i" >> test.txt;
echo 1 > /proc/timer_stats; sleep 10; echo 0 > /proc/timer_stats;
cat /proc/timer_stats >> test.txt;
sleep 2s;
done
then averaged the results to see if there was any benefit:
=== ORIGINAL (99 samples) ========================================= ORIGINAL ===
Avg: 188.760000 wakeups in 9.911010 secs (19.045486 wkups/sec) [18876 total]
Wakeups: Min - 179, Max - 208, Mean - 190.666667, Stdev - 6.601194
=== USLEEP (99 samples) ============================================= USLEEP ===
Avg: 188.200000 wakeups in 9.911230 secs (18.988561 wkups/sec) [18820 total]
Wakeups: Min - 181, Max - 213, Mean - 190.101010, Stdev - 6.950757
While not particularly rigorous, the results seem to indicate that there may be
some benefit from pursuing this.
*** HOW MUCH BENEFIT? ***
Somewhat arbitrarily choosing 100 as a cut-off for udelay VS usleep:
git grep 'udelay([[:digit:]]\+)' |
perl -F"[\(\)]" -anl -e 'print if $F[1] >= 100' | wc -l
yeilds 1093 on Linus's tree. There are 313 instances of >= 1000 and still
another 53 >= 10000us of busy wait! (If AVOID_POPS is configured in, the
es18xx driver will udelay(100000) or *0.1 seconds of busy wait*)
*** SUMMARY ***
I believe the usleep functions provide a tangible benefit, but would like
some input before I go for a more thorough udelay removal. Also, at what
point is a reasonable cutoff between udelay and usleep? I found two dated
(2007) papers discussing the overhead of a context switch:
http://www.cs.rochester.edu/u/cli/research/switch.pdf
IBM eServer, dual 2.0GHz Pentium Xeon; 512 KB L2, cache line 128B
Linux 2.6.17, RHEL 9, gcc 3.2.2 (-O0)
3.8 us / context switch
http://delivery.acm.org/10.1145/1290000/1281703/a3-david.pdf
ARMv5, ARM926EJ-S on an OMAP1610 (set to 120MHz clock)
Linux 2.6.20-rc5-omap1
48 us / context switch
However, there is more to consider than just context switching; is there
anyone who knows an appropriate cut-off, or an appropriate way to measure
and find one?
Finally, to address any potential questions of why this isn't built on
top of do_nanosleep, the function usleep_range seems very valuable for
power applications; many of the delays are simply waiting for something
to complete, thus I would prefer if they did not themselves instigate
a wake-up; also, do_nanosleep seems like it is built to be an interface
for the user-space nanosleep function - it did not seem like a good fit.
-Pat
From 26193064936016e3f679c911b4e988a3de97c531 Mon Sep 17 00:00:00 2001
From: Patrick Pannuto <ppan...@codeaurora.org>
Date: Tue, 22 Jun 2010 10:08:08 -0700
Subject: [PATCH] timer: Added usleep[_range][_interruptable] timer
usleep[_range][_interruptable] are finer precision implmentations
of msleep[_interruptable] and are designed to be drop-in
replacements for udelay where a precise sleep / busy-wait is
unnecessary. They also allow an easy interface to specify slack
when a precise (ish) wakeup is unnecessary to help minimize wakeups
Change-Id: I277737744ca58061323837609b121a0fc9d27f33
Change-Id: I088f14e905fc569c0a728fff5dc61ef25f49bb1e
Signed-off-by: Patrick Pannuto <ppan...@codeaurora.org>
---
include/linux/delay.h | 12 ++++++++++++
kernel/timer.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+), 0 deletions(-)
diff --git a/include/linux/delay.h b/include/linux/delay.h
index fd832c6..13f5378 100644
--- a/include/linux/delay.h
+++ b/include/linux/delay.h
@@ -45,6 +45,18 @@ extern unsigned long lpj_fine;
void calibrate_delay(void);
void msleep(unsigned int msecs);
unsigned long msleep_interruptible(unsigned int msecs);
+void usleep_range(unsigned long min, unsigned long max);
+unsigned long usleep_range_interruptible(unsigned long min, unsigned long max);
+
+static inline void usleep(unsigned long usecs)
+{
+ usleep_range(usecs, usecs);
+}
+
+static inline unsigned long usleep_interruptible(unsigned long usecs)
+{
+ return usleep_range_interruptible(usecs, usecs);
+}
static inline void ssleep(unsigned int seconds)
{
diff --git a/kernel/timer.c b/kernel/timer.c
index 5db5a8d..1587dad 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1684,3 +1684,47 @@ unsigned long msleep_interruptible(unsigned int msecs)
}
EXPORT_SYMBOL(msleep_interruptible);
+
+static int __sched do_usleep_range(unsigned long min, unsigned long max)
+{
+ ktime_t kmin;
+ unsigned long delta;
+
+ kmin = ktime_set(0, min * NSEC_PER_USEC);
+ delta = max - min;
+ return schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+}
+
+/**
+ * usleep_range - Drop in replacement for udelay where wakeup is flexible
+ * @min: Minimum time in usecs to sleep
+ * @max: Maximum time in usecs to sleep
+ */
+void usleep_range(unsigned long min, unsigned long max)
+{
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ do_usleep_range(min, max);
+}
+EXPORT_SYMBOL(usleep_range);
+
+/**
+ * usleep_range_interruptible - sleep waiting for signals
+ * @min: Minimum time in usecs to sleep
+ * @max: Maximum time in usecs to sleep
+ */
+unsigned long usleep_range_interruptible(unsigned long min, unsigned long max)
+{
+ int err;
+ ktime_t start;
+
+ start = ktime_get();
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ err = do_usleep_range(min, max);
+
+ if (err == -EINTR)
+ return ktime_us_delta(ktime_get(), start);
+ else
+ return 0;
+}
+EXPORT_SYMBOL(usleep_range_interruptible);
--
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
I think one thing for you to answer would be, why do you think udelay is
a problem? I don't honestly see that many udelay()'s around, and
especially not in important code paths .. Instead of adding a new API
like this you might just rework the problem areas.
Are you approaching this from performance? or battery life? or what?
> *** SOME QUANTIFIABLE (?) NUMBERS ***
>
> then averaged the results to see if there was any benefit:
>
> === ORIGINAL (99 samples) ========================================= ORIGINAL ===
> Avg: 188.760000 wakeups in 9.911010 secs (19.045486 wkups/sec) [18876 total]
> Wakeups: Min - 179, Max - 208, Mean - 190.666667, Stdev - 6.601194
>
> === USLEEP (99 samples) ============================================= USLEEP ===
> Avg: 188.200000 wakeups in 9.911230 secs (18.988561 wkups/sec) [18820 total]
> Wakeups: Min - 181, Max - 213, Mean - 190.101010, Stdev - 6.950757
>
> While not particularly rigorous, the results seem to indicate that there may be
> some benefit from pursuing this.
This is sort of ambiguous .. I don't think you replaced enough of these
for it to have much of an impact. It's actually counter intuitive
because your changes add more timers, yet they reduced average wakeups
by a tiny amount .. Why do you think that is ?
> *** HOW MUCH BENEFIT? ***
>
> Somewhat arbitrarily choosing 100 as a cut-off for udelay VS usleep:
>
> git grep 'udelay([[:digit:]]\+)' |
> perl -F"[\(\)]" -anl -e 'print if $F[1] >= 100' | wc -l
>
> yeilds 1093 on Linus's tree. There are 313 instances of >= 1000 and still
> another 53 >= 10000us of busy wait! (If AVOID_POPS is configured in, the
> es18xx driver will udelay(100000) or *0.1 seconds of busy wait*)
I'd say a better question is how often do they run?
Another thing is that your usleep() can't replace udelay() in critical
sections. However, if your doing udelay() in non-critical areas, I don't
think there is anything stopping preemption during the udelay() .. So
udelay() doesn't really cut off the whole system when it runs unless it
_is_ in a critical section.
Although it looks like you've spent a good deal of time on this write
up, the reasoning for these changes is still illusive (at least to me)..
Daniel
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
First and foremost: power. If switching from udelay to usleep lets the processor
go to a lower C-state once in awhile, then I would consider this a win.
>
>> *** SOME QUANTIFIABLE (?) NUMBERS ***
>>
>
>> then averaged the results to see if there was any benefit:
>>
>> === ORIGINAL (99 samples) ========================================= ORIGINAL ===
>> Avg: 188.760000 wakeups in 9.911010 secs (19.045486 wkups/sec) [18876 total]
>> Wakeups: Min - 179, Max - 208, Mean - 190.666667, Stdev - 6.601194
>>
>> === USLEEP (99 samples) ============================================= USLEEP ===
>> Avg: 188.200000 wakeups in 9.911230 secs (18.988561 wkups/sec) [18820 total]
>> Wakeups: Min - 181, Max - 213, Mean - 190.101010, Stdev - 6.950757
>>
>> While not particularly rigorous, the results seem to indicate that there may be
>> some benefit from pursuing this.
>
> This is sort of ambiguous .. I don't think you replaced enough of these
> for it to have much of an impact. It's actually counter intuitive
> because your changes add more timers, yet they reduced average wakeups
> by a tiny amount .. Why do you think that is ?
>
Yes, this test was leftover from a different project that involved refactoring
timers, so it was available and easy. My guess for the reduction in number of
wakeups is that the processor is able to do other work during the 100us it was
previously busy-waiting, and thus had to wake up less often.
I don't know a good way to test this, if you do, please advise and I will
happily pursue it.
>> *** HOW MUCH BENEFIT? ***
>>
>> Somewhat arbitrarily choosing 100 as a cut-off for udelay VS usleep:
>>
>> git grep 'udelay([[:digit:]]\+)' |
>> perl -F"[\(\)]" -anl -e 'print if $F[1] >= 100' | wc -l
>>
>> yeilds 1093 on Linus's tree. There are 313 instances of >= 1000 and still
>> another 53 >= 10000us of busy wait! (If AVOID_POPS is configured in, the
>> es18xx driver will udelay(100000) or *0.1 seconds of busy wait*)
>
> I'd say a better question is how often do they run?
The i2c guys will get hit any time there is contention / heavy traffic on the
i2c bus (they're in the i2c_poll_notbusy path, also the i2c_poll_writeready),
so any time there is a lot of peripheral traffic (e.g. the phone is probably
doing a lot of stuff), then there are long (ish) busy-wait loops that are
unnecessary.
I haven't researched extensively, but I imagine there are a fair number of
other code paths like this; udelays polling until devices aren't busy - and
devices are generally only busy under some degree of load, not a good time
to busy wait if you don't have to IMHO
>
> Another thing is that your usleep() can't replace udelay() in critical
> sections. However, if your doing udelay() in non-critical areas, I don't
> think there is anything stopping preemption during the udelay() .. So
> udelay() doesn't really cut off the whole system when it runs unless it
> _is_ in a critical section.
>
I mentioned elsewhere that this can't replace all udelays; as for those that
can be pre-empted, it seems like only a win to give up your time slice to
something that will do real work (or sleep at a lower c-state and use less
power) than to sit and loop. Yes, you *could* be pre-empted from doing
absolutely nothing, but I don't think you should *have* to be for the
system to make a more productive use of those cycles.
> Although it looks like you've spent a good deal of time on this write
> up, the reasoning for these changes is still illusive (at least to me)..
>
> Daniel
--
It's not clear if your changes would actually do that tho.. Since your
adding little tiny length timers instead of super long timers.. You want
more long length timers to get into a lower power state.
> >
> >> *** SOME QUANTIFIABLE (?) NUMBERS ***
> >>
> >
> >> then averaged the results to see if there was any benefit:
> >>
> >> === ORIGINAL (99 samples) ========================================= ORIGINAL ===
> >> Avg: 188.760000 wakeups in 9.911010 secs (19.045486 wkups/sec) [18876 total]
> >> Wakeups: Min - 179, Max - 208, Mean - 190.666667, Stdev - 6.601194
> >>
> >> === USLEEP (99 samples) ============================================= USLEEP ===
> >> Avg: 188.200000 wakeups in 9.911230 secs (18.988561 wkups/sec) [18820 total]
> >> Wakeups: Min - 181, Max - 213, Mean - 190.101010, Stdev - 6.950757
> >>
> >> While not particularly rigorous, the results seem to indicate that there may be
> >> some benefit from pursuing this.
> >
> > This is sort of ambiguous .. I don't think you replaced enough of these
> > for it to have much of an impact. It's actually counter intuitive
> > because your changes add more timers, yet they reduced average wakeups
> > by a tiny amount .. Why do you think that is ?
> >
>
> Yes, this test was leftover from a different project that involved refactoring
> timers, so it was available and easy. My guess for the reduction in number of
> wakeups is that the processor is able to do other work during the 100us it was
> previously busy-waiting, and thus had to wake up less often.
As I said in the prior email the udelay()'s don't preclude other types
of work since you can get preempted.
I think your results are just showing noise ..
> I don't know a good way to test this, if you do, please advise and I will
> happily pursue it.
You could test residency in specific power states .. Since you want to
test if your reducing power consumption .. However, I'd replace a ton
more of these udelay()'s , I don't think you'll get a decent results
with out that.
> >> *** HOW MUCH BENEFIT? ***
> >>
> >> Somewhat arbitrarily choosing 100 as a cut-off for udelay VS usleep:
> >>
> >> git grep 'udelay([[:digit:]]\+)' |
> >> perl -F"[\(\)]" -anl -e 'print if $F[1] >= 100' | wc -l
> >>
> >> yeilds 1093 on Linus's tree. There are 313 instances of >= 1000 and still
> >> another 53 >= 10000us of busy wait! (If AVOID_POPS is configured in, the
> >> es18xx driver will udelay(100000) or *0.1 seconds of busy wait*)
> >
> > I'd say a better question is how often do they run?
>
> The i2c guys will get hit any time there is contention / heavy traffic on the
> i2c bus (they're in the i2c_poll_notbusy path, also the i2c_poll_writeready),
> so any time there is a lot of peripheral traffic (e.g. the phone is probably
> doing a lot of stuff), then there are long (ish) busy-wait loops that are
> unnecessary.
If the phone is "doing a lot of stuff" then pretty good chance power
saving is at a minimum anyway. Try to be more specific .. For example,
is there some specific app that is causes power problems, and maybe that
app eventually gets into those i2c calls.
> I haven't researched extensively, but I imagine there are a fair number of
> other code paths like this; udelays polling until devices aren't busy - and
> devices are generally only busy under some degree of load, not a good time
> to busy wait if you don't have to IMHO
The busy waits your replacing are small in length on average .. If you
have timers that trigger in small intervals then your not going to
increase residency in any _lower_ power states .. It's possible that you
could increase residency in the top level power state, but it seem like
it would be really marginal .. You need to show that udelay()'s have an
outside of noise impact on something ..
> >
> > Another thing is that your usleep() can't replace udelay() in critical
> > sections. However, if your doing udelay() in non-critical areas, I don't
> > think there is anything stopping preemption during the udelay() .. So
> > udelay() doesn't really cut off the whole system when it runs unless it
> > _is_ in a critical section.
> >
>
> I mentioned elsewhere that this can't replace all udelays; as for those that
> can be pre-empted, it seems like only a win to give up your time slice to
> something that will do real work (or sleep at a lower c-state and use less
> power) than to sit and loop. Yes, you *could* be pre-empted from doing
> absolutely nothing, but I don't think you should *have* to be for the
> system to make a more productive use of those cycles.
I think you need to do some more research on what your actually doing to
the system. From what your showing us one could make a lot of different
arguments as to what this change will actually do. You really need some
sort of test that doesn't leave a lot of room for argument.
Daniel
--
Sent by a consultant of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
--
I think the underlying issue he's having is that the timer APIs are simply
unadapted, they're awkward to use.
From a driver POV, there really isn't much that you'd really CARE ABOUT
when entering any delay.
All you care about is to get a reliable delay, with the following characteristics:
- requested delay value
- wakeup spread (do I need this with hawk-eye precision, or is it ok if
wakeup is in the next century)
- something else? (perhaps "I need a warm/cold cache"?)
Whether this is preemptable, yieldable, power-managementable or entirely switch-offable
is ENTIRELY FRIGGIN' UNIMPORTANT to a driver, in most cases - it DOES NOT CARE about it.
The driver tells the OS what kind of delay characteristics it needs,
and it's the _OSes_ job to always do the most of that, be that a correspondingly deep
power management idle mode or whatever (one could argue that it should even know
on its own whether a critical section has to be obeyed or not, i.e. whether it's
preemptable or not).
This is just what a _minimal_, perfectly _adapted_ function interface should be.
And I'm afraid the kernel is somewhat off in that regard (mdelay, msleep, udelay,
.. OH MY), which likely is why such discussions come up.
And if someone then says "but udelay is a tiny optimized function which is much faster
than some generic interface which would first need to execute a half-dozen branches
to figure out what mode exactly to choose", I say "to hell with it",
let's do the precisely right thing as fast as possible and not the sometimes right thing
perfectly fast (not to mention that always entering via the same central function
might have additional icache benefits, too).
Whether a particular environment is able to support useful power management quantities
in ms, us or even ns should never be a driver's job to worry about,
it should simply pass its requirements to the kernel and that's it.
Such orders of magnitude easily change over time given hardware's progress -
a well-designed, minimal kernel interface however probably won't need to.
Frankly this is just my feeling, I don't have any precise insight into these APIs,
thus I might be way off as to the complications of this and be talking out of my a... :)
Andreas Mohr
Overall it seems like a good improvement.
> +
> +static inline void usleep(unsigned long usecs)
> +{
> + usleep_range(usecs, usecs);
> +}
> +
> +static inline unsigned long usleep_interruptible(unsigned long usecs)
Is the interruptible case even needed? I assume most drivers won't
bother with that and not being interruptible for a few usecs is not a
big issue.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Honestly, I don't think so, but I was mirroring the msleep API when
I wrote it so I included it for completeness. I can't think of a use
case where it is necessary / useful. I will remove it unless anyone
can think of an application for it?
> > Yes, this test was leftover from a different project that involved refactoring
> > timers, so it was available and easy. My guess for the reduction in number of
> > wakeups is that the processor is able to do other work during the 100us it was
> > previously busy-waiting, and thus had to wake up less often.
>
> As I said in the prior email the udelay()'s don't preclude other types
> of work since you can get preempted.
Yes, you can get preempted, but you'll still spin in the tight loop
counting...
So it does not preclude other task, but then you'll spin
unneccessarily.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> Well here we go then.
Great.
> *** RESULTS ***
>
> Complete results are attached (please forgive the ugly stats.py, it
> evolved; it's functional for the task at hand, but certainly not pretty).
>
> --> Long results also at end of email
>
> From the final results, after 240s:
>
> TIME STATE #TIMES CALLED TIME IN STATE | DELTA FROM ORG
>
> ////////////////////////////////////////////////////////////////////////////////
> === origin (10 samples) =================================== RUNTIME: 241.535 ===
What's the 10 samples here?
> 240 idle-spin -- 2.3 1.45001e-05
> 240 not-idle -- 88368.2 67.8311703321
> 240 idle-request -- 88367.2 23337.6322655
> 240 idle-wfi -- 88364.9 172.296383696
I'm going to assume you did ten in "origin" (i.e. control runs) and
averaged. However, you need to explicitly say that ..
What units is "time in state", and what interfacing does the timing? Are
these states in order from least power saving to most power savings?
> === over1000 (10 samples) ====== (delta: -0.00999999998604) RUNTIME: 241.525 ===
> 240 idle-spin -- 3.1 1.91666e-05 | 4.6665e-06
> 240 not-idle -- 88723.9 65.6361809172 | -2.1949894149
> 240 idle-request -- 88722.9 23311.9855603 | -25.6467051707
> 240 idle-wfi -- 88719.8 174.493487111 | 2.1971034149
> === over500 (10 samples) ======== (delta: -0.0339999999851) RUNTIME: 241.501 ===
> 240 idle-spin -- 2.3 1.88334e-05 | 4.3333e-06
> 240 not-idle -- 88636.3 67.0242803241 | -0.806890008
> 240 idle-request -- 88635.3 23280.1632631 | -57.469002442
> 240 idle-wfi -- 88633.0 173.077055869 | 0.7806721723
> === over100 (10 samples) ======== (delta: -0.0539999999921) RUNTIME: 241.481 ===
> 240 idle-spin -- 0.9 6.6666e-06 | -7.8335e-06
> 240 not-idle -- 88599.0 67.190273164 | -0.6408971681
> 240 idle-request -- 88597.9 23253.4797828 | -84.1524827638
> 240 idle-wfi -- 88597.0 172.884529866 | 0.5881461694
> === equal100 (10 samples) ================== (delta: -0.025) RUNTIME: 241.51 ===
> 240 idle-spin -- 1.4 9.5002e-06 | -4.9999e-06
> 240 not-idle -- 88685.9 66.5067348407 | -1.3244354914
> 240 idle-request -- 88684.9 23294.4341497 | -43.1981158192
> 240 idle-wfi -- 88683.5 173.60379269 | 1.3074089936
> === equal50 (10 samples) ======== (delta: -0.0289999999804) RUNTIME: 241.506 ===
> 240 idle-spin -- 2.0 9.6664e-06 | -4.8337e-06
> 240 not-idle -- 88537.4 65.8619214556 | -1.9692488765
> 240 idle-request -- 88536.4 22979.3665406 | -358.265724952
> 240 idle-wfi -- 88534.4 174.247270576 | 1.9508868794
>
>
> There appears to be very little change outside of noise from origin through
> equal100, however, once equal50 is reached, there is a noticeable change.
it all looks pretty noisy still, but it could be that your just
providing too much information ...
Ok, so the idle-request state had residency of 22979us (guessing units)
in equal50, and 23338us in the control. Which is %1.5 less. I would
think you want more residency rather than less right?
Replacing 50us udelays with timers will cause more timers to trigger at
a high frequency, which would cause you to be running timer code more
often rather than doing something else (maybe being in a sleep state).
So the number make some sense.
Does over100 include everything from 100 to 1000 ?
Daniel