Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH] timers: Fix usleep_range() in the context of wake_up_process()

463 views
Skip to first unread message

Douglas Anderson

unread,
Oct 10, 2016, 3:00:05 PM10/10/16
to
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter. However, nothing in any of the code
ensures this. Specifically:

usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
schedule_hrtimeout_range_clock() just ends up calling schedule() with an
appropriate timeout set using the hrtimer. If someone else happens to
wake up our task then we'll happily return from usleep_range() early.

msleep() already has code to handle this case since it will loop as long
as there was still time left. usleep_range() had no such loop.

The problem is is easily demonstrated with a small bit of test code:

static int usleep_test_task(void *data)
{
atomic_t *done = data;
ktime_t start, end;

start = ktime_get();
usleep_range(50000, 100000);
end = ktime_get();
pr_info("Requested 50000 - 100000 us. Actually slept for %llu us\n",
(unsigned long long)ktime_to_us(ktime_sub(end, start)));
atomic_set(done, 1);

return 0;
}

static void run_usleep_test(void)
{
struct task_struct *t;
atomic_t done;

atomic_set(&done, 0);

t = kthread_run(usleep_test_task, &done, "usleep_test_task");
while (!atomic_read(&done)) {
wake_up_process(t);
udelay(1000);
}
kthread_stop(t);
}

If you run the above code without this patch you get things like:
Requested 50000 - 100000 us. Actually slept for 967 us

If you run the above code _with_ this patch, you get:
Requested 50000 - 100000 us. Actually slept for 50001 us

Presumably this problem was not detected before because:
- It's not terribly common to use wake_up_process() directly.
- Other ways for processes to wake up are not typically mixed with
usleep_range().
- There aren't lotsof places that use usleep_range(), since many people
call either msleep() or udelay().

Reported-by: Tao Huang <huan...@rock-chips.com>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
---
kernel/time/timer.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 32bf6f75a8fe..ab03c7e403a4 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1898,12 +1898,29 @@ EXPORT_SYMBOL(msleep_interruptible);

static void __sched do_usleep_range(unsigned long min, unsigned long max)
{
+ ktime_t start, end;
ktime_t kmin;
u64 delta;
+ unsigned long elapsed = 0;
+ int ret;
+
+ start = ktime_get();
+ do {
+ kmin = ktime_set(0, min * NSEC_PER_USEC);
+ delta = (u64)(max - min) * NSEC_PER_USEC;
+ ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+
+ /*
+ * If schedule_hrtimeout_range() returns 0 then we actually
+ * hit the timeout. If not then we need to re-calculate the
+ * new timeout ourselves.
+ */
+ if (ret == 0)
+ break;

- kmin = ktime_set(0, min * NSEC_PER_USEC);
- delta = (u64)(max - min) * NSEC_PER_USEC;
- schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+ end = ktime_get();
+ elapsed = ktime_to_us(ktime_sub(end, start));
+ } while (elapsed < min);
}

/**
--
2.8.0.rc3.226.g39d4020

Andreas Mohr

unread,
Oct 10, 2016, 4:00:05 PM10/10/16
to
Hi,

On Mon, Oct 10, 2016 at 11:47:57AM -0700, Douglas Anderson wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter. However, nothing in any of the code
> ensures this. Specifically:

Ouch (aka: good catch! - by the reporter, though... ;)
Some thoughts:
- max >= min pre-condition is validated somewhere, I'd hope?
(somewhere in outer API frame?)
> + delta = (u64)(max - min) * NSEC_PER_USEC;

- there is a domain transition in there, **within** the repeated loop:
> + elapsed = ktime_to_us(ktime_sub(end, start));
-->
> + } while (elapsed < min);
should likely be made to be able to
directly compare a *non-converted* "elapsed" value
(IIRC there's an API such as ktime_before()).
One could argue that the "repeat" case is the non-likely case
(thus the conversion should possibly not be done before
actually being required),
however I guess it's exactly hitting the non-likely cases
which will contribute towards
non-predictable, non-deterministic latency behaviour of
a code path
(think RT requirements)

Thanks,

Andreas Mohr

Brian Norris

unread,
Oct 10, 2016, 4:10:06 PM10/10/16
to
Hi Doug,

On Mon, Oct 10, 2016 at 11:47:57AM -0700, Doug Anderson wrote:
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 32bf6f75a8fe..ab03c7e403a4 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1898,12 +1898,29 @@ EXPORT_SYMBOL(msleep_interruptible);
>
> static void __sched do_usleep_range(unsigned long min, unsigned long max)
> {
> + ktime_t start, end;
> ktime_t kmin;
> u64 delta;
> + unsigned long elapsed = 0;
> + int ret;
> +
> + start = ktime_get();
> + do {
> + kmin = ktime_set(0, min * NSEC_PER_USEC);

I believe 'min' is unmodified throughout, and therefore 'kmin' is
computed to be the same minimum timeout in each loop. Shouldn't this be
decreasing on each iteration of the loop? (i.e., either your compute
'kmin' differently here, or you recompute 'min' based on the elapsed
time?)

> + delta = (u64)(max - min) * NSEC_PER_USEC;
> + ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +
> + /*
> + * If schedule_hrtimeout_range() returns 0 then we actually
> + * hit the timeout. If not then we need to re-calculate the
> + * new timeout ourselves.
> + */
> + if (ret == 0)
> + break;
>
> - kmin = ktime_set(0, min * NSEC_PER_USEC);
> - delta = (u64)(max - min) * NSEC_PER_USEC;
> - schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> + end = ktime_get();
> + elapsed = ktime_to_us(ktime_sub(end, start));
> + } while (elapsed < min);

I think Andreas had similar comments, but it seemed to me like
ktime_before() might be nicer. But (as Andreas did) you might get into
(premature?) micro-optimizations down that path, and I'm certainly not
an expert there.

> }
>
> /**

Brian

Doug Anderson

unread,
Oct 10, 2016, 4:20:05 PM10/10/16
to
Hi,

On Mon, Oct 10, 2016 at 1:04 PM, Brian Norris <brian...@chromium.org> wrote:
> Hi Doug,
>
> On Mon, Oct 10, 2016 at 11:47:57AM -0700, Doug Anderson wrote:
>> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
>> index 32bf6f75a8fe..ab03c7e403a4 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -1898,12 +1898,29 @@ EXPORT_SYMBOL(msleep_interruptible);
>>
>> static void __sched do_usleep_range(unsigned long min, unsigned long max)
>> {
>> + ktime_t start, end;
>> ktime_t kmin;
>> u64 delta;
>> + unsigned long elapsed = 0;
>> + int ret;
>> +
>> + start = ktime_get();
>> + do {
>> + kmin = ktime_set(0, min * NSEC_PER_USEC);
>
> I believe 'min' is unmodified throughout, and therefore 'kmin' is
> computed to be the same minimum timeout in each loop. Shouldn't this be
> decreasing on each iteration of the loop? (i.e., either your compute
> 'kmin' differently here, or you recompute 'min' based on the elapsed
> time?)

Yes, I stupidly changed something at the last second and then didn't
test again after my stupid change. Fix coming soon with all comments
addressed. Sorry for posting broken code. :( :( :(

-Doug

Andreas Mohr

unread,
Oct 10, 2016, 4:50:06 PM10/10/16
to
Hi,
I'd justify it this way:
orientation/type of input data is rather irrelevant -
all that matters is how you want to treat things in your *inner handling* -
and if that happens to be ktime-based
(as is usually - if not pretty much always - the case within kernel code),
then it ought to be that way,
*consistently* (to avoid conversion transition overhead).

Andreas Mohr

Andreas Mohr

unread,
Oct 10, 2016, 4:50:08 PM10/10/16
to
With a loop style that is actively re-calculating things,
such implementations should then not fall into the trap of
basing the "next" value on "current" time,
thereby bogusly accumulating scheduling-based delays
with each new loop iteration etc.
(i.e., things should still be based on hard, precise termination according to
an *initially* calculated, *absolute*, *minimum* expiry time).

Andreas Mohr

--
GNU/Linux. It's not the software that's free, it's you.

Douglas Anderson

unread,
Oct 10, 2016, 5:20:05 PM10/10/16
to
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter. However, nothing in any of the code
ensures this. Specifically:

- There aren't lots of places that use usleep_range(), since many people
call either msleep() or udelay().

Reported-by: Tao Huang <huan...@rock-chips.com>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
---
Changes in v2:
- Fixed stupid bug that snuck in before posting
- Use ktime_before
- Remove delta from the loop

NOTE: Tested against 4.4 tree w/ backports. I'm trying to get myself
up and running with mainline again to test there now but it might be a
little while. Hopefully this time I don't shoot myself in the foot.

kernel/time/timer.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 32bf6f75a8fe..219439efd56a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible);

static void __sched do_usleep_range(unsigned long min, unsigned long max)
{
+ ktime_t now, end;
ktime_t kmin;
u64 delta;
+ int ret;

- kmin = ktime_set(0, min * NSEC_PER_USEC);
+ now = ktime_get();
+ end = ktime_add_us(now, min);
delta = (u64)(max - min) * NSEC_PER_USEC;
- schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+ do {
+ kmin = ktime_sub(end, now);
+ ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+
+ /*
+ * If schedule_hrtimeout_range() returns 0 then we actually
+ * hit the timeout. If not then we need to re-calculate the
+ * new timeout ourselves.
+ */
+ if (ret == 0)
+ break;
+
+ now = ktime_get();
+ } while (ktime_before(now, end));
}

/**
--
2.8.0.rc3.226.g39d4020

Doug Anderson

unread,
Oct 10, 2016, 5:20:05 PM10/10/16
to
Hi,

On Mon, Oct 10, 2016 at 12:53 PM, Andreas Mohr <an...@lisas.de> wrote:
> Some thoughts:
> - max >= min pre-condition is validated somewhere, I'd hope?
> (somewhere in outer API frame?)

I don't think this is validated, but it's not a new problem. My patch
doesn't attempt to solve this. A future patch could, though. You can
speculate on whether a WARN_ON would be better or some type of
friendly warning.

-Doug

Doug Anderson

unread,
Oct 10, 2016, 6:40:06 PM10/10/16
to
Hi,
OK, I finally resolved all my issues and have officially tested this
on mainline (not just in my private kernel) linux 4.8 on an
rk3288-veyron-jerry based device. Problem still exists there and fix
still works.

Sorry for the noise earlier. Hopefully this version of the patch
addresses all of Andreas's review feedback (and thank you very much
for the review!)

-Doug

Thomas Gleixner

unread,
Oct 11, 2016, 3:20:06 AM10/11/16
to
On Mon, 10 Oct 2016, Douglas Anderson wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter. However, nothing in any of the code
> ensures this. Specifically:

There is no such guarantee for that interface and never has been, so how
did you make sure that none of the existing users is relying on this?

You can't just can't just declare that all all of the users expect that and
be done with it.

Thanks,

tglx

Doug Anderson

unread,
Oct 11, 2016, 12:40:07 PM10/11/16
to
Thomas,

+clock and regulator folks since part of my arguments below involve
the regulator / clock core. If you're not interested in this topic,
feel free to ignore. Original patch can be found on LKML or at
<https://patchwork.kernel.org/patch/9369963/>
You're right that I can't guarantee that no callers are relying on the
existing behavior of a wake_up_process() causing usleep_range() to
return early. I would say, however, that all callers I've seen are
absolutely relying on the min delay being enforced and I've never
personally seen a caller relying on being woken up from
usleep_range(). All the users relying on the min delay being enforced
have never had a problem because it's not common for a task that's in
usleep_range() to be woken up, but if you happen to call some generic
code in a context where a wake up is possible you'll get a very
unexpected bug (like I just did).

I can try to look through some callers to usleep_range(), but there
are 1620 of them as per my "git grep", so I can't look at them all.
:(

...in any case, below are my arguments about why we should change
this. If we can't agree to change this, then IMHO the way forward is
to deprecate usleep_range() and start the transition of all callers to
one of two functions: usleep_atlest() and usleep_wakeable(). As
argued below I think that usleep_range() name implies that it will at
least sleep the minimum so I would really like to avoid keeping the
name usleep_range() and also keeping the existing behavior.

--

1. I believe msleep() and usleep_range() should be parallel functions.
msleep() has this guarantee of not ending early. I believe users will
expect that usleep() will also.

2. The "timers-howto.txt" does not have any information about the fact
that usleep_range() might end early. This document implies that
(assuming you're not in atomic) udelay() / usleep_range() should be
chosen between simply based on how long the expected delay is. There
is no note that you shouldn't use usleep_range() if you can't handle
ending early.

3. It seems to me that if someone wants to wait to be woken up but
they want a timeout, they wouldn't reach for usleep_range(). They
would instead think to use schedule_hrtimeout_range() directly. Said
another way: we already have a function for scheduling a delay that
can be interrupted with a wakeup, so why do we need usleep_range() to
also behave this way?

4. We need to change a lot of places if we want to handle the fact
that usleep_range() might wake up early. Every instance of
usleep_range() that I've ever seen is not expecting to end early and
is expecting at least the minimum delay for correctness. These
functions have all worked correctly because they didn't happen to be
called in a context where someone might wake up the calling task.
...but if suddenly one of these functions is called on a task that
might be woken up then all their correctness will go out the door.

In other words, I see a lot of usleep_range() calls that look like this:

/* After 5 ms we know reset is done */
assert_reset()
usleep_range(5000, 7000)

If that returns early then we'll get badness.

--

Here's a (perhaps more realistic) example of the problem. Imagine
that we are trying to do some type of DVF (Dynamic Voltage Frequency)
switch. That involves changing both a regulator and a clock. It's
possible that we might want to try to do these two things in parallel
on two tasks, so we could imagine doing:

Task #1 (voltage)
A) Call into regulator core to change voltage.
A1) Regulator core will call into arbitrary regulator driver.
A2) Regulator core won't return until regulator is at the right level.
A3) Delay waiting for regulator might be done with usleep_range().
B) Signal Task #2 that we're done.
C) Wait on Task #2 to finish.

Task #2 (clock)
A) Call into clock core to change clock.
A1) Clock core will call into arbitrary clock driver.
A2) Clock core won't return until clock is stable.
A3) Delay waiting for clock might be done with usleep_range().
B) Signal Task #1 that we're done.
C) Wait on Task #1 to finish.

Depending on the delays and the mechanism used to signal and wait, it
is possible that the "Signal" step above will end up waking up the
other task. If it does this while the other task is in the
usleep_range() code then we'll have serious problems: we'll either run
with an clock or a regulator that isn't all the way ready.

Now perhaps you will say that we should re-design the signaling
mechanism above so that it doesn't cause a wake up of the other task.
We certainly could try to do that if need be (we'd have to validate
that there is really NEVER an unexpected wakeup), but I'd expect a lot
of push back since I don't think folks would think that waking up a
task would cause such incorrect behavior.

Now perhaps you will say that we should re-design all clock and
regulator drivers to not call usleep_range() (or to add a loop). This
implies that we should add a new function so they don't all need to
get this loop right.


NOTE: In the particular failure case I'm running into, I'm on a system
using the (out of tree) "interactive" governor. I haven't followed
through the exact code path, but I see that it is using
wake_up_process() in at least several places to wake up the
"speedchange_task". This is the same task that might be calling into
cpufreq to change the frequency and might be calling into the
regulator core to do the delay. We were specifically seeing cases
where the usleep_range() in PWM regulator was returning early.

--

OK, I know that the above is pretty long. Hopefully it made sense.
Feel free to grab me on IRC if it helps. I can be found on freenode
as dianders and can join any channel where it's helpful to hash this
out.

-Doug

Brian Norris

unread,
Oct 11, 2016, 3:00:06 PM10/11/16
to
Hi Doug,

On Mon, Oct 10, 2016 at 02:04:02PM -0700, Doug Anderson wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter. However, nothing in any of the code
> ensures this.

Like you and Andreas, I also don't understand Thomas's objection to your
above claim on what users of this function expect. I believe you have
clearly laid out why the current behavior needs to be changed somehow;
IMO the only remaining question is "how." Your follow up covers all this
plenty well for me.

Either we need a fix along the lines of what you've proposed, or else we
need to completely rethink almost all uses of usleep_range().

...

> Reported-by: Tao Huang <huan...@rock-chips.com>
> Signed-off-by: Douglas Anderson <dian...@chromium.org>
> ---
> Changes in v2:
> - Fixed stupid bug that snuck in before posting
> - Use ktime_before
> - Remove delta from the loop
>
> NOTE: Tested against 4.4 tree w/ backports. I'm trying to get myself
> up and running with mainline again to test there now but it might be a
> little while. Hopefully this time I don't shoot myself in the foot.
>
> kernel/time/timer.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)

I've reviewed the logic here to the best of my ability, and it looks
good to me now. I'll admit that I'm not really a timekeeping or
scheduler expert, but FWIW:

Reviewed-by: Brian Norris <brian...@chromium.org>

Andreas Mohr

unread,
Oct 11, 2016, 3:00:09 PM10/11/16
to
Hmm, somehow I don't manage to follow these thoughts.

https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html
(as a hopefully sufficiently authoritative source of documentation)
clearly specifies min to be
"Minimum time in usecs to sleep"
, which is what one would expect a two-param interface here to be
(minimum-maximum),
i.e. what would be the *natural* protocol I'd think.

Also, [finally...] starting to enforce the minimum time
is an additional *constraint* on the protocol,
i.e. it's not at all like we are getting more *liberal* here
(since usually getting more liberal in certain protocols
is what will cause trouble, I'd think).

Not to mention that
desiring a delay in processing most certainly is
what caused users of this API to decide to invoke it in the first place
(else they would just have chosen to carry on with delay-less processing
and be done with it).
And those users then surely wouldn't want to experience a behaviour
where the delay may be ended at any time,
however short that may end up being.


A related topic probably is
premature wakeups (e.g. signal-induced) of select() etc. protocol.

Greetings,

Andreas Mohr

Andreas Mohr

unread,
Oct 11, 2016, 3:50:05 PM10/11/16
to
Hi,

decided to write a review now, slightly delayed, sorry.
This loop implementation relies on negative kmin (result of ktime_sub())
getting internally handled by schedule_hrtimeout_range() as a 0 result.
If that ain't the case, then the loop itself will need to handle
negative values.
OK, scratch that, due to guaranteed-unchanged values of now and end,
result of directly subsequent ktime_sub() is guaranteed to
not deviate from what ktime_before() figured.
However, this is somewhat differing handling of these two APIs.


Which brings me to my second point:
somehow doing both ktime_before() and ktime_sub() feels redundant,
since they are both about arguments now and end,
i.e. they are *both* evaluating a "distance".
(this could simply calculate the distance, and then
1. use that calculated distance value, and
2. check that calculated distance value against being negative).
So, most likely it ought to be achievable to have just *one* of these
two active (which would likely be ktime_sub(),
simply since we need its result as schedule_hrtimeout_range() input...).
And that way we would save big (hohumm) on instruction cache footprint :)

Hmm, but ktime API does not have sufficiently open-coded handling of the
ktime_sub()-based distance value - the possibly only way to do an
"is it negative?" check would be by doing
ktime_compare(dist, ktime_null_value),
which might be pointless
since it's a comparably large effort
vs. the current ktime_before(), ktime_sub() case.

BTW, another argument for loop rework might be that
the result of ktime_sub() might already be improper
(due to being negative!) for
subsequent invocation of schedule_hrtimeout_range(),
i.e. there's an argument to be made that
the loop tail check instead ought to be done as a negative-value check
directly prior to schedule_hrtimeout_range() invocation.


Hmm, if schedule_hrtimeout_range() can be relied on to
internally properly be doing the negative check,
then one could simply say that
the annoyingly extra calculation via ktime_before() call
should simply be removed completely.
Which would be a good step since it would centralize protocol behaviour
within the inner handling of the helper
rather than bloating user-side instruction cache footprint.


</micro_optimization> ?

Thanks,

Andreas Mohr

Doug Anderson

unread,
Oct 11, 2016, 4:10:05 PM10/11/16
to
Andreas,
I read this as: no bug, feel free to ignore.


> Which brings me to my second point:
> somehow doing both ktime_before() and ktime_sub() feels redundant,
> since they are both about arguments now and end,
> i.e. they are *both* evaluating a "distance".
> (this could simply calculate the distance, and then
> 1. use that calculated distance value, and
> 2. check that calculated distance value against being negative).
> So, most likely it ought to be achievable to have just *one* of these
> two active (which would likely be ktime_sub(),
> simply since we need its result as schedule_hrtimeout_range() input...).
> And that way we would save big (hohumm) on instruction cache footprint :)
>
> Hmm, but ktime API does not have sufficiently open-coded handling of the
> ktime_sub()-based distance value - the possibly only way to do an
> "is it negative?" check would be by doing
> ktime_compare(dist, ktime_null_value),
> which might be pointless
> since it's a comparably large effort
> vs. the current ktime_before(), ktime_sub() case.

I agree that we could try to save some math by making the loop more
complicated. On the other hand, one could argue that a sufficiently
good compiler might actually be able to figure some of this stuff out,
since much of this stuff is just inlined 64-bit math comparisons.

That being said, if you want to propose some concrete changes that
save an instruction or two on most architectures and don't make the
code too complicated, I'm happy to spin my patch.


> BTW, another argument for loop rework might be that
> the result of ktime_sub() might already be improper
> (due to being negative!) for
> subsequent invocation of schedule_hrtimeout_range(),
> i.e. there's an argument to be made that
> the loop tail check instead ought to be done as a negative-value check
> directly prior to schedule_hrtimeout_range() invocation.
>
> Hmm, if schedule_hrtimeout_range() can be relied on to
> internally properly be doing the negative check,
> then one could simply say that
> the annoyingly extra calculation via ktime_before() call
> should simply be removed completely.
> Which would be a good step since it would centralize protocol behaviour
> within the inner handling of the helper
> rather than bloating user-side instruction cache footprint.

Ah, so you're saying just do a "while (true)" after changing the
behavior of schedule_hrtimeout_range_clock(). If everyone agrees that
they'd like to see this, I can do it. I'd have to change
schedule_hrtimeout_range_clock() to check for <= 0 instead of just
comparing against 0. ...and that would be an API change for
schedule_hrtimeout_range_clock(), and it seems like that would be yet
another source of bugs. ...and this is to save an instruction? It
doesn't seem worth it.


-Doug

Heiko Stuebner

unread,
Oct 11, 2016, 4:40:06 PM10/11/16
to
It may be my personal ignorance for not finding this explained, but the
function documentation [0] states "min ... Minimum time in usecs to sleep"
which sounds pretty guaranteed to me.

One should of course make sure to not break anything intentionally, but having
things expect to work outside these parameters sounds somewhat broken

If the specified values are flexible beyond their stated range, I guess the
documentation should be updated to reflect that.


[0] https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html

Andreas Mohr

unread,
Oct 11, 2016, 4:50:06 PM10/11/16
to
On Tue, Oct 11, 2016 at 01:02:10PM -0700, Doug Anderson wrote:
> Andreas,
>
> On Tue, Oct 11, 2016 at 12:30 PM, Andreas Mohr <an...@lisas.de> wrote:
> > This loop implementation relies on negative kmin (result of ktime_sub())
> > getting internally handled by schedule_hrtimeout_range() as a 0 result.
> > If that ain't the case, then the loop itself will need to handle
> > negative values.
> > OK, scratch that, due to guaranteed-unchanged values of now and end,
> > result of directly subsequent ktime_sub() is guaranteed to
> > not deviate from what ktime_before() figured.
> > However, this is somewhat differing handling of these two APIs.
>
> I read this as: no bug, feel free to ignore.

Heh, yeah ("an alternate style of review" ;).


> I agree that we could try to save some math by making the loop more
> complicated. On the other hand, one could argue that a sufficiently
> good compiler might actually be able to figure some of this stuff out,
> since much of this stuff is just inlined 64-bit math comparisons.

Indeed. "micro-optimization again".


> > BTW, another argument for loop rework might be that
> > the result of ktime_sub() might already be improper
> > (due to being negative!) for
> > subsequent invocation of schedule_hrtimeout_range(),
> > i.e. there's an argument to be made that
> > the loop tail check instead ought to be done as a negative-value check
> > directly prior to schedule_hrtimeout_range() invocation.
> >
> > Hmm, if schedule_hrtimeout_range() can be relied on to
> > internally properly be doing the negative check,
> > then one could simply say that
> > the annoyingly extra calculation via ktime_before() call
> > should simply be removed completely.
> > Which would be a good step since it would centralize protocol behaviour
> > within the inner handling of the helper
> > rather than bloating user-side instruction cache footprint.
>
> Ah, so you're saying just do a "while (true)" after changing the
> behavior of schedule_hrtimeout_range_clock(). If everyone agrees that
> they'd like to see this, I can do it. I'd have to change
> schedule_hrtimeout_range_clock() to check for <= 0 instead of just
> comparing against 0. ...and that would be an API change for
> schedule_hrtimeout_range_clock(), and it seems like that would be yet
> another source of bugs. ...and this is to save an instruction? It
> doesn't seem worth it.

Yup, I just realized that probably what we'd ideally want is
a very thin decorating wrapper (loop handler)
which cares about nothing else other than
eradicating the relevant "feature" of the inner handler
(namely, premature exit),
and otherwise leaves a much as possible to
central inner handler decision-making implementation.
That to me sounds like
the theoretically most precise (since dead simple) handling.
And even if such exceedingly simple handling is dangerous -
in case of failure we would realize it very soon (infinite loop),
and would then know what will need fixing.


I've been reviewing schedule_hrtimeout_range_clock() as well,
and I'm actually puzzled that it checks for zero value only,
and not less-equal zero.
That to me does not seem like a true-to-the-core protocol
to handle the task at hand.
OTOH perhaps less-equal comparison was deemed to be
much more expensive than a zero check,
and thus possibly has been done in inner handlers only.



For the loop code itself,
I'm not sure whether to pursue it -
given the schedule_hrtimeout_range_clock() API change risks you outlined,
and especially given that
optimization is likely to mostly benefit the "repeat" case only,
which likely would be the less relevant non-hotpath case anyway.

Plus, let's not forget the fact that
even hotpath handling *is* an invocation of the entire delay handler,
and then a couple measly opcodes to have the loop exited reliably. So...


Now, at least we have a
Reviewed-by: Andreas Mohr <and...@users.sf.net>


Oh well, lotsa discussion, some good thoughts,
but no truly revolutionary outcome so far.
However, sometimes the most important thing is
to have had a good educating discussion
(not everything in software circles is about the Get Rich Quick scheme -
you do have to spend some ample time - decades... -
to truly get mental concept things right).

Andreas Mohr

Mark Brown

unread,
Oct 12, 2016, 5:30:04 AM10/12/16
to
On Tue, Oct 11, 2016 at 09:33:15AM -0700, Doug Anderson wrote:
> On Tue, Oct 11, 2016 at 12:14 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
> > On Mon, 10 Oct 2016, Douglas Anderson wrote:

> >> Users of usleep_range() expect that it will _never_ return in less time
> >> than the minimum passed parameter. However, nothing in any of the code
> >> ensures this. Specifically:

> > There is no such guarantee for that interface and never has been, so how
> > did you make sure that none of the existing users is relying on this?

> > You can't just can't just declare that all all of the users expect that and
> > be done with it.

> You're right that I can't guarantee that no callers are relying on the
> existing behavior of a wake_up_process() causing usleep_range() to
> return early. I would say, however, that all callers I've seen are
> absolutely relying on the min delay being enforced and I've never
> personally seen a caller relying on being woken up from
> usleep_range(). All the users relying on the min delay being enforced

Indeed. It's *highly* surprising for any sleep interface to undershoot
on delays, the usual thing is that they might delay for longer. If the
function doesn't actually reliably delay for the minimum time then I'd
expect that a large proportion of those conversions and other recent
code that's been added is buggy.

> one of two functions: usleep_atlest() and usleep_wakeable(). As
> argued below I think that usleep_range() name implies that it will at
> least sleep the minimum so I would really like to avoid keeping the
> name usleep_range() and also keeping the existing behavior.

I tend to agree with everything Doug is saying in terms of API
expectations.
signature.asc

Thomas Gleixner

unread,
Oct 12, 2016, 9:20:05 AM10/12/16
to
I'm well aware what Doug wants to do and I'm not saying that this is wrong,
but I'm not going to look at all usleep() usage sites to make sure none is
relying on such a behaviour and gets surprised by the change,

The point is that we had cases over and over where stuff was depending on
implementation bugs which made the buggy behaviour into an expected
behaviour. I'm not saying that this is the case here, but it's not my duty
to make sure it isn't.

So the very minimum I need in the changelog is some mentioning that the
author at least tried to verify that this is not going to break the world
and some more. That's what I meant by:

You can't just can't just declare that all all of the users expect that and
be done with it.

Thanks,

tglx

Guenter Roeck

unread,
Oct 12, 2016, 12:10:06 PM10/12/16
to
On Mon, Oct 10, 2016 at 02:04:02PM -0700, Douglas Anderson wrote:
> Reviewed-by: Brian Norris <brian...@chromium.org>
> Reviewed-by: Andreas Mohr <and...@users.sf.net>

Reviewed-by: Guenter Roeck <li...@roeck-us.net>

The following drivers may expect the function to be interruptible.

drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume()
drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume()
drivers/iio/accel/mma8452.c:mma8452_runtime_resume()
drivers/iio/accel/mma9551_core.c:mma9551_sleep()
kernel/trace/ring_buffer.c:rb_test()

A possible solution might be to introduce usleep_range_interruptible()
and use it there.

Note:
drivers/scsi/mvumi.c:mvumi_rescan_bus() uses msleep() but should possibly
use msleep_interruptible() instead.

Guenter

Guenter Roeck

unread,
Oct 12, 2016, 1:00:05 PM10/12/16
to
On Wed, Oct 12, 2016 at 09:27:35AM -0700, Doug Anderson wrote:
> Hi,
>
> On Wed, Oct 12, 2016 at 9:03 AM, Guenter Roeck <li...@roeck-us.net> wrote:
> > drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume()
> > drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume()
> > drivers/iio/accel/mma8452.c:mma8452_runtime_resume()
> > drivers/iio/accel/mma9551_core.c:mma9551_sleep()
>
> As far as I can tell these drivers will not suffer unduly from my
> change. Worse case they will delay 20us more, which is listed as the
> max.
>
20 ms.

> Also note that I assume the reason you flagged these is because they
> follow the pattern:
>
> if (sleep_val < 20000)
> usleep_range(sleep_val, 20000);
> else
> msleep_interruptible(sleep_val/1000);
>
Correct

> I will note that usleep_range() is and has always been
> uninterruptible, since the implementation says:
>
> void __sched usleep_range(unsigned long min, unsigned long max)
> {
> __set_current_state(TASK_UNINTERRUPTIBLE);
> do_usleep_range(min, max);
> }
>
Good point.

> So I'm not at all convinced that we are changing behavior here. The
> "interruptible" vs. "uninterruptible" affects whether signals can
> interrupt the sleep, not whether a random wake up of a task can. What
> we really need to know is if they are affected by a wakeup.
>
Yes, you are correct.

> > kernel/trace/ring_buffer.c:rb_test()
>
> I assume that the person who wrote this code was confused since they wrote:
>
> set_current_state(TASK_INTERRUPTIBLE);
> /* Now sleep between a min of 100-300us and a max of 1ms */
> usleep_range(((data->cnt % 3) + 1) * 100, 1000);
>
> That doesn't seem to make sense given the first line of usleep_range().
>
... which, for those who don't pay attention (like me), is
__set_current_state(TASK_UNINTERRUPTIBLE);

> In any case, again I don't think I am changing behavior.
>
> > A possible solution might be to introduce usleep_range_interruptible()
> > and use it there.
>
> This could be a useful function, but I don't think we need it if we
> find someone who needs a wakeup to cut short a sleep. We can just
> call one of the schedule functions directly and use a timeout.
>

Agreed.

>
> Thank you for searching through for stuff and for your review, though!
>
No problem. Thanks for correcting me.

Note that I also searched for use of usleep_range() in conjunction with a
a task wakeup, but did not find anything. I did find a large number of cases,
though, where the explicit assumption is made that the minimum sleep time
is well defined.

Guenter

Doug Anderson

unread,
Oct 12, 2016, 1:00:06 PM10/12/16
to
Hi,

On Wed, Oct 12, 2016 at 9:03 AM, Guenter Roeck <li...@roeck-us.net> wrote:
> drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume()
> drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume()
> drivers/iio/accel/mma8452.c:mma8452_runtime_resume()
> drivers/iio/accel/mma9551_core.c:mma9551_sleep()

As far as I can tell these drivers will not suffer unduly from my
change. Worse case they will delay 20us more, which is listed as the
max.

Also note that I assume the reason you flagged these is because they
follow the pattern:

if (sleep_val < 20000)
usleep_range(sleep_val, 20000);
else
msleep_interruptible(sleep_val/1000);

I will note that usleep_range() is and has always been
uninterruptible, since the implementation says:

void __sched usleep_range(unsigned long min, unsigned long max)
{
__set_current_state(TASK_UNINTERRUPTIBLE);
do_usleep_range(min, max);
}

So I'm not at all convinced that we are changing behavior here. The
"interruptible" vs. "uninterruptible" affects whether signals can
interrupt the sleep, not whether a random wake up of a task can. What
we really need to know is if they are affected by a wakeup.

> kernel/trace/ring_buffer.c:rb_test()

I assume that the person who wrote this code was confused since they wrote:

set_current_state(TASK_INTERRUPTIBLE);
/* Now sleep between a min of 100-300us and a max of 1ms */
usleep_range(((data->cnt % 3) + 1) * 100, 1000);

That doesn't seem to make sense given the first line of usleep_range().

In any case, again I don't think I am changing behavior.

> A possible solution might be to introduce usleep_range_interruptible()
> and use it there.

This could be a useful function, but I don't think we need it if we
find someone who needs a wakeup to cut short a sleep. We can just
call one of the schedule functions directly and use a timeout.


Thank you for searching through for stuff and for your review, though!

-Doug

Doug Anderson

unread,
Oct 12, 2016, 1:40:05 PM10/12/16
to
Hi,

On Wed, Oct 12, 2016 at 6:11 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
> I'm well aware what Doug wants to do and I'm not saying that this is wrong,
> but I'm not going to look at all usleep() usage sites to make sure none is
> relying on such a behaviour and gets surprised by the change,
>
> The point is that we had cases over and over where stuff was depending on
> implementation bugs which made the buggy behaviour into an expected
> behaviour. I'm not saying that this is the case here, but it's not my duty
> to make sure it isn't.

Every change breaks something. <https://xkcd.com/1172/>. I don't
think we promise bug for bug compatibility for Linux releases, do we?
Though I definitely agree that we shouldn't be breaking something on
purpose and it's good to be careful.


> So the very minimum I need in the changelog is some mentioning that the
> author at least tried to verify that this is not going to break the world
> and some more. That's what I meant by:
>
> You can't just can't just declare that all all of the users expect that and
> be done with it.

Do you have some suggestion about how to do that? In general I'm
hesitant to rely on code analysis of 1620 call sites and that would
really be the only way to "be sure". It would also take a really long
time. I think Guenter was searching for files that contained
usleep_range() and some other text. That's a pretty good idea but it
is definitely not exhaustive since there's no reason that the usleep
and a wakeup are guaranteed to be in the same file. It's also
entirely possible that a wakeup could happen through some other source
than just a direct call to wake_up_process(). It's why I didn't try
it originally.

...but I can do it anyway. I tried "git grep -C10000 usleep_range |
grep wake_up_process". I actually do find some things that I don't
think Guenter found (or maybe he found, analyzed, and rejected):

> drivers/block/cciss.c- wake_up_process(cciss_scan_thread);

We possibly wake up scan_thread(). Sleep is in
cciss_wait_for_mode_change_ack(). Callers of that are
cciss_enter_performant_mode() and cciss_enter_simple_mode(). Callers
of those are cciss_put_controller_into_performant_mode() and
cciss_pci_init() (and the performant mode boils down to
cciss_pci_init() anyway). Caller is cciss_init_one(). That is the
probe function. AKA: the thread that is being woken up isn't the one
doing the usleep_range().

No obvious bug here.

> drivers/memstick/host/rtsx_usb_ms.c- wake_up_process(host->detect_ms);

Function being woken is rtsx_usb_detect_ms_card(). Sleeps are in
ms_power_on() and rtsx_usb_ms_set_param() (with no loops).
ms_power_on() is called by rtsx_usb_ms_set_param() anyway. That
function is stored in msh->set_param. That will be fairly hard to
track down, but it seems unlikely it is called by
rtsx_usb_detect_ms_card() and that we intentionally want the usleep to
end early.

No obvious bug here.


> drivers/scsi/mvumi.c- wake_up_process(mhba->dm_thread);
> drivers/scsi/mvumi.c- wake_up_process(mhba->dm_thread);

Function being woken is mvumi_rescan_bus(). Intended to wake up the
task when it's in schedule(). Even after the schedule there is a
hardcoded msleep(1000). To me the msleep(1000) makes it seem like
this loop can't possibly be too performance critical.

...but even if there are performance critical parts of that loop and
somehow the wake was intended not just to wakeup from the schedule but
also the usleep_range, worse case we will sleep 2 ms extra.

No obvious bug here.


> kernel/time/timer.c- wake_up_process((struct task_struct *)__data);

Generic timer code.


> kernel/trace/ring_buffer.c- wake_up_process(rb_threads[cpu]);

This is test code and already seems a bit confused. ...but it does
appear that the caller might actually be intending the the
wake_up_process() to take effect. Looking closer. I see that the
wakeup is called once at the start of the test, not midway through the
test. So I don't think there's a problem already than the already
identified problem of a useless
"set_current_state(TASK_INTERRUPTIBLE);"

No obvious bug here.

===

So the net result of all the above is that:

* With my git grep + code inspection, I see no obvious bug. I will
admit that I didn't do a super deep search and finding bugs by code
inspection is iffy at best, but it might give us some sort of warm
fuzzy.

* With Guenter's search we saw no obvious bugs.

* Several subsystem maintainers and reviewers of lots of code have
chimed in and say that they're not aware of any code where
usleep_range() was expected to wake up early and they were also aware
of lots of code that would break if usleep_range() returned early.


IMHO: let's land it in linuxnext and give it some bake time. If
nobody screams then it can go into linux proper, possibly to stable
trees.

-Doug

Daniel Kurtz

unread,
Oct 18, 2016, 9:50:07 AM10/18/16
to
Hi Doug,

On Tue, Oct 11, 2016 at 5:04 AM, Douglas Anderson <dian...@chromium.org> wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter. However, nothing in any of the code
> ensures this. Specifically:
>
> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
> schedule_hrtimeout_range_clock() just ends up calling schedule() with an
> appropriate timeout set using the hrtimer. If someone else happens to
> wake up our task then we'll happily return from usleep_range() early.

I think this change works, and fixes a real issue, however, I don't
think you are fixing this at the right layer.
The comment for schedule_hrtimeout_range says:

/**
* schedule_hrtimeout_range - sleep until timeout
* @expires: timeout value (ktime_t)
* @delta: slack in expires timeout (ktime_t)
* @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL
*
* Make the current task sleep until the given expiry time has
* elapsed. The routine will return immediately unless
* the current task state has been set (see set_current_state()).
*
* The @delta argument gives the kernel the freedom to schedule the
* actual wakeup to a time that is both power and performance friendly.
* The kernel give the normal best effort behavior for "@expires+@delta",
* but may decide to fire the timer earlier, but no earlier than @expires.
*
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
* pass before the routine returns.
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
* delivered to the current task.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
*
* Returns 0 when the timer has expired otherwise -EINTR
*/

The behavior as specified for this function "at least @timeout time is
guaranteed to pass before the routine returns" already guarantees the
behavior you are adding to do_usleep_range() whenever the current task
state is (pre-)set to TASK_UNINTERRUPTIBLE.

Thus, I think the loop around 'schedule()' should be moved to
schedule_hrtimeout_range() itself.
This would also fix direct callers of schedule_hrtimeout_range() that
use TASK_UNINTERRUPTIBLE, although, I could only find one:

pt3_fetch_thread()

-Dan

Doug Anderson

unread,
Oct 18, 2016, 4:30:05 PM10/18/16
to
Dan,
Hmmm, I would agree with you that the behavior of
schedule_hrtimeout_range() doesn't seem to match the function
comments.

...but I'm not sure I agree with you about what to do here.
Specifically I think that whatever we do we need to try to keep
schedule_hrtimeout_range() and schedule_timeout() parallel. For
schedule_timeout() we have the same comments but it's my understanding
that you'd expect that wake_up_process() would wake it up. In any
case, if wake_up_process() doesn't wake it up then it seems like
msleep() and schedule_timeout_uninterruptible() are the same function
with two names, when in fact one is implemented in terms o the other.

NOTE that also it seems as if we need some other return values besides
0 and -EINTR from schedule_hrtimeout_range() (again, to match
schedule_timeout()) since right now we'll return -EINTR if we were
woken up with wake_up_process(). This would be unexpected in the case
where we had TASK_UNINTERRUPTIBLE set.

-Doug

Daniel Kurtz

unread,
Oct 20, 2016, 5:00:05 AM10/20/16
to
Sounds reasonable.
It would be nice to add a note to all of those function comments
though to make them sound less absolute -
"at least @timeout time is guaranteed to pass before the routine
returns unless the current task is explicitly woken up, (e.g. by
wake_up_process())"

Thomas Gleixner

unread,
Oct 20, 2016, 6:00:05 AM10/20/16
to
Agreed.

Douglas Anderson

unread,
Oct 20, 2016, 4:40:04 PM10/20/16
to
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter. However, nothing in any of the code
ensures this. Specifically:

usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
schedule_hrtimeout_range_clock() just ends up calling schedule() with an
appropriate timeout set using the hrtimer. If someone else happens to
wake up our task then we'll happily return from usleep_range() early.

NOTES:
- An effort was made to look for users relying on the old behavior by
looking for usleep_range() in the same file as wake_up_process().
No problems was found by this search, though it is conceivable that
someone could have put the sleep and wakeup in two different files.
- An effort was made to ask several upstream maintainers if they were
aware of people relying on wake_up_process() to wake up
usleep_range(). No maintainers were aware of that but they were aware
of many people relying on usleep_range() never returning before the
minimum.

Reported-by: Tao Huang <huan...@rock-chips.com>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
Reviewed-by: Andreas Mohr <and...@users.sf.net>
Reviewed-by: Brian Norris <brian...@chromium.org>
Reviewed-by: Guenter Roeck <li...@roeck-us.net>
---
Changes in v3:
- Add Reviewed-by tags
- Add notes about validation

Changes in v2:
- Fixed stupid bug that snuck in before posting
- Use ktime_before
- Remove delta from the loop

kernel/time/timer.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

/**
--
2.8.0.rc3.226.g39d4020

Douglas Anderson

unread,
Oct 20, 2016, 4:40:05 PM10/20/16
to
The documentatoin for schedule_timeout(), schedule_hrtimeout(), and
schedule_hrtimeout_range() all claimed that the routines couldn't
possibly return early if the task state was TASK_UNINTERRUPTIBLE. This
was simply not true, since anyone calling wake_up_process() would cause
those routines to exit early.

As some evidence that the documentation was broken (not the code):
- If we changed the code to match the documentation, msleep() would be
identical to schedule_timeout_uninterruptible() and
msleep_interruptible() would be identical to
schedule_timeout_interruptible(). That doesn't seem likely to have
been the intention.
- The schedule() function sleeps until a task is woken up. Logically,
one would expect that the schedule_timeout() function would sleep
until a task is woken up or a timeout occurrs.

As part of the above observation, it can be seen that
schedule_hrtimeout() and schedule_hrtimeout_range() might return -EINTR
even if the task state was TASK_UNINTERRUPTIBLE. This isn't terrible
behavior so we'll document it and keep it as-is. After all, trying to
match schedule_timeout() and return the time left would incure a bunch
of extra calculation cost that isn't needed in all cases.

Suggested-by: Daniel Kurtz <djk...@chromium.org>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
---
Changes in v3:
- Documentation fix new for v3.

kernel/time/hrtimer.c | 20 ++++++++++++++------
kernel/time/timer.c | 11 +++++++----
2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index bb5ec425dfe0..707ecf47a5fd 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1742,15 +1742,19 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
- * pass before the routine returns.
+ * pass before the routine returns unless the current task is explicitly
+ * woken up, (e.g. by wake_up_process())".
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task.
+ * delivered to the current task or the current task is explicitly woken
+ * up.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
*
- * Returns 0 when the timer has expired otherwise -EINTR
+ * Returns 0 when the timer has expired otherwise -EINTR. Note that
+ * -EINTR can still be returned even if the task state is
+ * TASK_UNINTERRUPTIBLE if the current task is explicitly woken up.
*/
int __sched schedule_hrtimeout_range(ktime_t *expires, u64 delta,
const enum hrtimer_mode mode)
@@ -1772,15 +1776,19 @@ EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
- * pass before the routine returns.
+ * pass before the routine returns unless the current task is explicitly
+ * woken up, (e.g. by wake_up_process())".
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task.
+ * delivered to the current task or the current task is explicitly woken
+ * up.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
*
- * Returns 0 when the timer has expired otherwise -EINTR
+ * Returns 0 when the timer has expired otherwise -EINTR. Note that
+ * -EINTR can still be returned even if the task state is
+ * TASK_UNINTERRUPTIBLE if the current task is explicitly woken up.
*/
int __sched schedule_hrtimeout(ktime_t *expires,
const enum hrtimer_mode mode)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 219439efd56a..b2ca2a6bc4d2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1691,11 +1691,12 @@ static void process_timeout(unsigned long __data)
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
- * pass before the routine returns. The routine will return 0
+ * pass before the routine returns unless the current task is explicitly
+ * woken up, (e.g. by wake_up_process())".
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task. In this case the remaining time
- * in jiffies will be returned, or 0 if the timer expired in time
+ * delivered to the current task or the current task is explicitly woken
+ * up.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
@@ -1704,7 +1705,9 @@ static void process_timeout(unsigned long __data)
* the CPU away without a bound on the timeout. In this case the return
* value will be %MAX_SCHEDULE_TIMEOUT.
*
- * In all cases the return value is guaranteed to be non-negative.
+ * Returns 0 when the timer has expired otherwise the remaining time in
+ * jiffies will be returned. In all cases the return value is guaranteed
+ * to be non-negative.
*/
signed long __sched schedule_timeout(signed long timeout)
{
--
2.8.0.rc3.226.g39d4020

Thomas Gleixner

unread,
Oct 20, 2016, 5:10:05 PM10/20/16
to
On Thu, 20 Oct 2016, Douglas Anderson wrote:
> +++ b/kernel/time/hrtimer.c
> @@ -1742,15 +1742,19 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
> * You can set the task state as follows -
> *
> * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
> - * pass before the routine returns.
> + * pass before the routine returns unless the current task is explicitly
> + * woken up, (e.g. by wake_up_process())".

The double quote is stray.

> *
> * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
> - * delivered to the current task.
> + * delivered to the current task or the current task is explicitly woken
> + * up.
> *
> * The current task state is guaranteed to be TASK_RUNNING when this
> * routine returns.
> *
> - * Returns 0 when the timer has expired otherwise -EINTR
> + * Returns 0 when the timer has expired otherwise -EINTR. Note that
> + * -EINTR can still be returned even if the task state is
> + * TASK_UNINTERRUPTIBLE if the current task is explicitly woken up.

I'd prefer to word it this way:

Returns 0 when the timer has expired. If the task was woken before the
timer expired by a signal (only possible in state TASK_INTERRUPTIBLE) or by
an explicit wakeup, it returns -EINTR.


> */
> int __sched schedule_hrtimeout_range(ktime_t *expires, u64 delta,
> const enum hrtimer_mode mode)
> @@ -1772,15 +1776,19 @@ EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
> * You can set the task state as follows -
> *
> * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
> - * pass before the routine returns.
> + * pass before the routine returns unless the current task is explicitly
> + * woken up, (e.g. by wake_up_process())".

See above

> *
> * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
> - * delivered to the current task.
> + * delivered to the current task or the current task is explicitly woken
> + * up.
> *
> * The current task state is guaranteed to be TASK_RUNNING when this
> * routine returns.
> *
> - * Returns 0 when the timer has expired otherwise -EINTR
> + * Returns 0 when the timer has expired otherwise -EINTR. Note that
> + * -EINTR can still be returned even if the task state is
> + * TASK_UNINTERRUPTIBLE if the current task is explicitly woken up.

See above

> @@ -1691,11 +1691,12 @@ static void process_timeout(unsigned long __data)
> * You can set the task state as follows -
> *
> * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
> - * pass before the routine returns. The routine will return 0
> + * pass before the routine returns unless the current task is explicitly
> + * woken up, (e.g. by wake_up_process())".
> *
> * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
> - * delivered to the current task. In this case the remaining time
> - * in jiffies will be returned, or 0 if the timer expired in time
> + * delivered to the current task or the current task is explicitly woken
> + * up.
> *
> * The current task state is guaranteed to be TASK_RUNNING when this
> * routine returns.
> @@ -1704,7 +1705,9 @@ static void process_timeout(unsigned long __data)
> * the CPU away without a bound on the timeout. In this case the return
> * value will be %MAX_SCHEDULE_TIMEOUT.
> *
> - * In all cases the return value is guaranteed to be non-negative.
> + * Returns 0 when the timer has expired otherwise the remaining time in
> + * jiffies will be returned. In all cases the return value is guaranteed
> + * to be non-negative.

That one is fine.

Thanks,

tglx

Douglas Anderson

unread,
Oct 20, 2016, 5:30:05 PM10/20/16
to
The documentatoin for schedule_timeout(), schedule_hrtimeout(), and
schedule_hrtimeout_range() all claimed that the routines couldn't
possibly return early if the task state was TASK_UNINTERRUPTIBLE. This
was simply not true since anyone calling wake_up_process() would cause
those routines to exit early.

As some evidence that the documentation was broken (not the code):
- If we changed the code to match the documentation, msleep() would be
identical to schedule_timeout_uninterruptible() and
msleep_interruptible() would be identical to
schedule_timeout_interruptible(). That doesn't seem likely to have
been the intention.
- The schedule() function sleeps until a task is woken up. Logically,
one would expect that the schedule_timeout() function would sleep
until a task is woken up or a timeout occurrs.

As part of the above observation, it can be seen that
schedule_hrtimeout() and schedule_hrtimeout_range() might return -EINTR
even if the task state was TASK_UNINTERRUPTIBLE. This isn't terrible
behavior so we'll document it and keep it as-is. After all, trying to
match schedule_timeout() and return the time left would incure a bunch
of extra calculation cost that isn't needed in all cases.

Suggested-by: Daniel Kurtz <djk...@chromium.org>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
---
Changes in v4:
- Fixed stray double quotes.
- Updated wording as per Thomas Gleixner.

Changes in v3:
- Documentation fix new for v3.

kernel/time/hrtimer.c | 20 ++++++++++++++------
kernel/time/timer.c | 11 +++++++----
2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index bb5ec425dfe0..08be5c99d26b 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1742,15 +1742,19 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
- * pass before the routine returns.
+ * pass before the routine returns unless the current task is explicitly
+ * woken up, (e.g. by wake_up_process()).
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task.
+ * delivered to the current task or the current task is explicitly woken
+ * up.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
*
- * Returns 0 when the timer has expired otherwise -EINTR
+ * Returns 0 when the timer has expired. If the task was woken before the
+ * timer expired by a signal (only possible in state TASK_INTERRUPTIBLE) or
+ * by an explicit wakeup, it returns -EINTR.
*/
int __sched schedule_hrtimeout_range(ktime_t *expires, u64 delta,
const enum hrtimer_mode mode)
@@ -1772,15 +1776,19 @@ EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
- * pass before the routine returns.
+ * pass before the routine returns unless the current task is explicitly
+ * woken up, (e.g. by wake_up_process()).
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task.
+ * delivered to the current task or the current task is explicitly woken
+ * up.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
*
- * Returns 0 when the timer has expired otherwise -EINTR
+ * Returns 0 when the timer has expired. If the task was woken before the
+ * timer expired by a signal (only possible in state TASK_INTERRUPTIBLE) or
+ * by an explicit wakeup, it returns -EINTR.
*/
int __sched schedule_hrtimeout(ktime_t *expires,
const enum hrtimer_mode mode)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 219439efd56a..b2ca2a6bc4d2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1691,11 +1691,12 @@ static void process_timeout(unsigned long __data)
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
- * pass before the routine returns. The routine will return 0
+ * pass before the routine returns unless the current task is explicitly
+ * woken up, (e.g. by wake_up_process())".
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
- * delivered to the current task. In this case the remaining time
- * in jiffies will be returned, or 0 if the timer expired in time
+ * delivered to the current task or the current task is explicitly woken
+ * up.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
@@ -1704,7 +1705,9 @@ static void process_timeout(unsigned long __data)
* the CPU away without a bound on the timeout. In this case the return
* value will be %MAX_SCHEDULE_TIMEOUT.
*
- * In all cases the return value is guaranteed to be non-negative.
+ * Returns 0 when the timer has expired otherwise the remaining time in
+ * jiffies will be returned. In all cases the return value is guaranteed
+ * to be non-negative.

Douglas Anderson

unread,
Oct 20, 2016, 5:30:05 PM10/20/16
to
Changes in v4: None
Changes in v3:
- Add Reviewed-by tags
- Add notes about validation

Changes in v2:
- Fixed stupid bug that snuck in before posting
- Use ktime_before
- Remove delta from the loop

kernel/time/timer.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 32bf6f75a8fe..219439efd56a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c

Thomas Gleixner

unread,
Oct 20, 2016, 5:30:06 PM10/20/16
to
On Thu, 20 Oct 2016, Douglas Anderson wrote:
> - An effort was made to look for users relying on the old behavior by
> looking for usleep_range() in the same file as wake_up_process().
> No problems was found by this search, though it is conceivable that
> someone could have put the sleep and wakeup in two different files.
> - An effort was made to ask several upstream maintainers if they were
> aware of people relying on wake_up_process() to wake up
> usleep_range(). No maintainers were aware of that but they were aware
> of many people relying on usleep_range() never returning before the
> minimum.

Thanks for going the extra mile !

> static void __sched do_usleep_range(unsigned long min, unsigned long max)
> {
> + ktime_t now, end;
> ktime_t kmin;
> u64 delta;
> + int ret;
>
> - kmin = ktime_set(0, min * NSEC_PER_USEC);
> + now = ktime_get();
> + end = ktime_add_us(now, min);

So you calculate the absolute expiry time here.

> delta = (u64)(max - min) * NSEC_PER_USEC;
> - schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> + do {
> + kmin = ktime_sub(end, now);
> + ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);

And then you schedule the thing relative. That does not make sense.

> +
> + /*
> + * If schedule_hrtimeout_range() returns 0 then we actually
> + * hit the timeout. If not then we need to re-calculate the
> + * new timeout ourselves.
> + */
> + if (ret == 0)
> + break;
> +
> + now = ktime_get();

And this is broken because schedule_hrtimeout_range() returns with task
state TASK_RUNNING so the next schedule_hrtimeout_range() will return
-EINTR again because nothing sets the task state back to UNINTERRUPTIBLE.
So instead of sleeping you busy loop.

What you really want to do is something like this:

void __sched usleep_range(unsigned long min, unsigned long max)
{
ktime_t expires = ktime_add_us(ktime_get(), min * NSEC_PER_USEC);
ktime_t delta = ktime_set(0, (u64)(max - min) * NSEC_PER_USEC);

for (;;) {
__set_current_state(TASK_UNINTERRUPTIBLE);
/* Do not return before the requested sleep time has elapsed */
if (!schedule_hrtimeout_range(&expires, delta, HRTIMER_MODE_ABS))
break;
}
}

Thanks,

tglx

Thomas Gleixner

unread,
Oct 20, 2016, 5:40:07 PM10/20/16
to
On Thu, 20 Oct 2016, Douglas Anderson wrote:

Please wait for a full review and do not send out patches 5 seconds after
the first mail hits your inbox.

Thanks,

tglx

Doug Anderson

unread,
Oct 20, 2016, 7:40:06 PM10/20/16
to
Hi,
Since you had previously commented on patch 1 and had already
commented on patch 2, I presumed you were done reviewing, but I was
obviously in error. My apologies.

-Doug

Doug Anderson

unread,
Oct 20, 2016, 7:40:10 PM10/20/16
to
Hi,
That explains the mystery of why my delays were always so precise in
the test. I was a bit baffled by the fact that I was ending up with a
delay of almost exactly 50001 or 50002 in my test case.

Thank you for catching!


> What you really want to do is something like this:
>
> void __sched usleep_range(unsigned long min, unsigned long max)
> {
> ktime_t expires = ktime_add_us(ktime_get(), min * NSEC_PER_USEC);
> ktime_t delta = ktime_set(0, (u64)(max - min) * NSEC_PER_USEC);
>
> for (;;) {
> __set_current_state(TASK_UNINTERRUPTIBLE);
> /* Do not return before the requested sleep time has elapsed */
> if (!schedule_hrtimeout_range(&expires, delta, HRTIMER_MODE_ABS))
> break;
> }
> }

The above mostly works other than some small fixups. Thanks!

...other than small fixups, the one substantive change I'll make is to
actually check the timeout in the loop above. If I use your code with
my test, I find that, even though I'm waking up every millisecond I
still end up not exiting the loop until the upper bound of the delay.

Presumably this happens because:

a_time_in_the_past = ktime_sub_us(ktime_get(), 10);
schedule_hrtimeout_range(&a_time_in_the_past, delta, HRTIMER_MODE_ABS)

...delays "delta" nano seconds even though "a_time_in_the_past" is in
the past. I presume that behavior is OK (let me know if it's not).

In the case of usleep_range() if we're waking up anyway, it seems
sensible to spend a few cycles to see if the minimum bound has already
past.


I'll plan to send a new version tomorrow unless I hear that you don't
like the above.


-Doug

Thomas Gleixner

unread,
Oct 21, 2016, 3:00:05 AM10/21/16
to
No problem. As a general rule you should not send updates like a machine
gun. That issue is years old, so there is no rush to fix it just because
some random (probably out of tree) code trips over it.

Thanks,

tglx

Thomas Gleixner

unread,
Oct 21, 2016, 3:20:06 AM10/21/16
to
On Thu, 20 Oct 2016, Doug Anderson wrote:
> > And this is broken because schedule_hrtimeout_range() returns with task
> > state TASK_RUNNING so the next schedule_hrtimeout_range() will return
> > -EINTR again because nothing sets the task state back to UNINTERRUPTIBLE.
> > So instead of sleeping you busy loop.
>
> That explains the mystery of why my delays were always so precise in
> the test. I was a bit baffled by the fact that I was ending up with a
> delay of almost exactly 50001 or 50002 in my test case.

Well, if you see something as a mystery or you are baffled, then you
certainly should figure out why and not just declare: Works for me, but I
don't know why. That's stuff which comes back to hunt you sooner than
later.

> > What you really want to do is something like this:
> >
> > void __sched usleep_range(unsigned long min, unsigned long max)
> > {
> > ktime_t expires = ktime_add_us(ktime_get(), min * NSEC_PER_USEC);
> > ktime_t delta = ktime_set(0, (u64)(max - min) * NSEC_PER_USEC);
> >
> > for (;;) {
> > __set_current_state(TASK_UNINTERRUPTIBLE);
> > /* Do not return before the requested sleep time has elapsed */
> > if (!schedule_hrtimeout_range(&expires, delta, HRTIMER_MODE_ABS))
> > break;
> > }
> > }
>
> The above mostly works other than some small fixups. Thanks!

Yeah, delta is u64. Was too lazy to look it up.

> ...other than small fixups, the one substantive change I'll make is to
> actually check the timeout in the loop above. If I use your code with
> my test, I find that, even though I'm waking up every millisecond I
> still end up not exiting the loop until the upper bound of the delay.
>
> Presumably this happens because:
>
> a_time_in_the_past = ktime_sub_us(ktime_get(), 10);
> schedule_hrtimeout_range(&a_time_in_the_past, delta, HRTIMER_MODE_ABS)
>
> ...delays "delta" nano seconds even though "a_time_in_the_past" is in
> the past. I presume that behavior is OK (let me know if it's not).

It does not delay delta nano seconds. It delays to

(now - 10us) + deltans

The core is free to use the full range for batching or extending idle sleep
times and with a delta > 10us it's expected and correct behaviour.

> In the case of usleep_range() if we're waking up anyway, it seems
> sensible to spend a few cycles to see if the minimum bound has already
> past.

We really can do without that. You are over engineering for something which
is pointless in 99.9% of the use cases. That stuff is what causes bloat. A
few lines of code here and there, which sums up in the end.

Thanks,

tglx

Douglas Anderson

unread,
Oct 21, 2016, 12:00:06 PM10/21/16
to
Requested 50000 - 100000 us. Actually slept for 98896 us

Presumably this problem was not detected before because:
- It's not terribly common to use wake_up_process() directly.
- Other ways for processes to wake up are not typically mixed with
usleep_range().
- There aren't lots of places that use usleep_range(), since many people
call either msleep() or udelay().

NOTES:
- An effort was made to look for users relying on the old behavior by
looking for usleep_range() in the same file as wake_up_process().
No problems was found by this search, though it is conceivable that
someone could have put the sleep and wakeup in two different files.
- An effort was made to ask several upstream maintainers if they were
aware of people relying on wake_up_process() to wake up
usleep_range(). No maintainers were aware of that but they were aware
of many people relying on usleep_range() never returning before the
minimum.

Reported-by: Tao Huang <huan...@rock-chips.com>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
---
Changes in v5:
- Don't accidentally busy wait after first wakeup (Thomas Gleixner)
- Removed Reviewed-by tags

Changes in v4: None
Changes in v3:
- Add Reviewed-by tags
- Add notes about validation

Changes in v2:
- Fixed stupid bug that snuck in before posting
- Use ktime_before
- Remove delta from the loop

kernel/time/timer.c | 22 ++++++++++------------
1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 32bf6f75a8fe..ab65e7bcc2c2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1896,16 +1896,6 @@ unsigned long msleep_interruptible(unsigned int msecs)

EXPORT_SYMBOL(msleep_interruptible);

-static void __sched do_usleep_range(unsigned long min, unsigned long max)
-{
- ktime_t kmin;
- u64 delta;
-
- kmin = ktime_set(0, min * NSEC_PER_USEC);
- delta = (u64)(max - min) * NSEC_PER_USEC;
- schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
-}
-
/**
* usleep_range - Sleep for an approximate time
* @min: Minimum time in usecs to sleep
@@ -1919,7 +1909,15 @@ static void __sched do_usleep_range(unsigned long min, unsigned long max)
*/
void __sched usleep_range(unsigned long min, unsigned long max)
{
- __set_current_state(TASK_UNINTERRUPTIBLE);
- do_usleep_range(min, max);
+ ktime_t expires = ktime_add_us(ktime_get(), min);
+ u64 delta = (u64)(max - min) * NSEC_PER_USEC;
+
+ for (;;) {
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ /* Do not return before the requested sleep time has elapsed */
+ if (!schedule_hrtimeout_range(&expires, delta,
+ HRTIMER_MODE_ABS))
+ break;
+ }
}
EXPORT_SYMBOL(usleep_range);
--
2.8.0.rc3.226.g39d4020

Douglas Anderson

unread,
Oct 21, 2016, 12:00:06 PM10/21/16
to
The documentatoin for schedule_timeout(), schedule_hrtimeout(), and
schedule_hrtimeout_range() all claimed that the routines couldn't
possibly return early if the task state was TASK_UNINTERRUPTIBLE. This
was simply not true since anyone calling wake_up_process() would cause
those routines to exit early.

As some evidence that the documentation was broken (not the code):
- If we changed the code to match the documentation, msleep() would be
identical to schedule_timeout_uninterruptible() and
msleep_interruptible() would be identical to
schedule_timeout_interruptible(). That doesn't seem likely to have
been the intention.
- The schedule() function sleeps until a task is woken up. Logically,
one would expect that the schedule_timeout() function would sleep
until a task is woken up or a timeout occurrs.

As part of the above observation, it can be seen that
schedule_hrtimeout() and schedule_hrtimeout_range() might return -EINTR
even if the task state was TASK_UNINTERRUPTIBLE. This isn't terrible
behavior so we'll document it and keep it as-is. After all, trying to
match schedule_timeout() and return the time left would incure a bunch
of extra calculation cost that isn't needed in all cases.

Suggested-by: Daniel Kurtz <djk...@chromium.org>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
---
Changes in v5: None
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ab65e7bcc2c2..330214a19beb 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c

tip-bot for Douglas Anderson

unread,
Oct 26, 2016, 7:30:05 AM10/26/16
to
Commit-ID: 4b7e9cf9c84b09adc428e0433cd376b91f9c52a7
Gitweb: http://git.kernel.org/tip/4b7e9cf9c84b09adc428e0433cd376b91f9c52a7
Author: Douglas Anderson <dian...@chromium.org>
AuthorDate: Fri, 21 Oct 2016 08:58:51 -0700
Committer: Thomas Gleixner <tg...@linutronix.de>
CommitDate: Wed, 26 Oct 2016 13:14:47 +0200

timers: Fix documentation for schedule_timeout() and similar

The documentation for schedule_timeout(), schedule_hrtimeout(), and
schedule_hrtimeout_range() all claim that the routines couldn't possibly
return early if the task state was TASK_UNINTERRUPTIBLE. This is simply
not true since wake_up_process() will cause those routines to exit early.

We cannot make schedule_[hr]timeout() loop until the timeout expires if the
task state is uninterruptible because we have users which rely on the
existing and designed behaviour.

Make the documentation match the (correct) implementation.

schedule_hrtimeout() returns -EINTR even when a uninterruptible task was
woken up. This might look strange, but making the return code depend on the
state is too much of an effort as it would affect all the call sites. There
is no value in doing so, but we spell it out clearly in the documentation.

Suggested-by: Daniel Kurtz <djk...@chromium.org>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
Cc: huan...@rock-chips.com
Cc: he...@sntech.de
Cc: bro...@kernel.org
Cc: brian...@chromium.org
Cc: Andreas Mohr <an...@lisas.de>
Cc: linux-r...@lists.infradead.org
Cc: tony...@rock-chips.com
Cc: John Stultz <john....@linaro.org>
Cc: li...@roeck-us.net
Cc: tsk...@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-2-gi...@chromium.org
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>

---
kernel/time/hrtimer.c | 20 ++++++++++++++------
kernel/time/timer.c | 11 +++++++----
2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index bb5ec42..08be5c9 100644
index 12681c9..88aab86 100644

tip-bot for Douglas Anderson

unread,
Oct 26, 2016, 7:30:05 AM10/26/16
to
Commit-ID: 6c5e9059692567740a4ee51530dffe51a4b9584d
Gitweb: http://git.kernel.org/tip/6c5e9059692567740a4ee51530dffe51a4b9584d
Author: Douglas Anderson <dian...@chromium.org>
AuthorDate: Fri, 21 Oct 2016 08:58:50 -0700
Committer: Thomas Gleixner <tg...@linutronix.de>
CommitDate: Wed, 26 Oct 2016 13:14:46 +0200

timers: Fix usleep_range() in the context of wake_up_process()

Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter. However, nothing in the code ensures
this, when the sleeping task is woken by wake_up_process() or any other
mechanism which can wake a task from uninterruptible state.

Neither usleep_range() nor schedule_hrtimeout_range*() have any protection
against wakeups. schedule_hrtimeout_range*() is designed this way despite
the fact that the API documentation does not mention it.

msleep() already has code to handle this case since it will loop as long
as there was still time left. usleep_range() has no such loop, add it.

Presumably this problem was not detected before because usleep_range() is
only used in a few places and the function is mostly used in contexts which
are not exposed to wakeups of any form.

An effort was made to look for users relying on the old behavior by
looking for usleep_range() in the same file as wake_up_process().
No problems were found by this search, though it is conceivable that
someone could have put the sleep and wakeup in two different files.

An effort was made to ask several upstream maintainers if they were aware
of people relying on wake_up_process() to wake up usleep_range(). No
maintainers were aware of that but they were aware of many people relying
on usleep_range() never returning before the minimum.

Reported-by: Tao Huang <huan...@rock-chips.com>
Signed-off-by: Douglas Anderson <dian...@chromium.org>
Cc: djk...@chromium.org
Cc: li...@roeck-us.net
Cc: tsk...@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-1-gi...@chromium.org
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>

---
kernel/time/timer.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d47980..12681c9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1896,16 +1896,6 @@ unsigned long msleep_interruptible(unsigned int msecs)

EXPORT_SYMBOL(msleep_interruptible);

-static void __sched do_usleep_range(unsigned long min, unsigned long max)
-{
- ktime_t kmin;
- u64 delta;
-
- kmin = ktime_set(0, min * NSEC_PER_USEC);
- delta = (u64)(max - min) * NSEC_PER_USEC;
- schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
-}
-
/**
* usleep_range - Sleep for an approximate time
* @min: Minimum time in usecs to sleep
@@ -1919,7 +1909,14 @@ static void __sched do_usleep_range(unsigned long min, unsigned long max)
*/
void __sched usleep_range(unsigned long min, unsigned long max)
{
- __set_current_state(TASK_UNINTERRUPTIBLE);
- do_usleep_range(min, max);
+ ktime_t exp = ktime_add_us(ktime_get(), min);
+ u64 delta = (u64)(max - min) * NSEC_PER_USEC;
+
+ for (;;) {
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ /* Do not return before the requested sleep time has elapsed */
+ if (!schedule_hrtimeout_range(&exp, delta, HRTIMER_MODE_ABS))
+ break;
+ }
}
EXPORT_SYMBOL(usleep_range);
0 new messages