[patch 0/9] Fix various __task_cred related invalid RCU assumptions

Thomas Gleixner

unread,

Dec 9, 2009, 8:00:03 PM12/9/09

to

While auditing the read_lock(&tasklist_lock) sites for a possible
conversion to rcu-read_lock() I stumbled over an unprotected user of
__task_cred in kernel/sys.c

That caused me to audit all the __task_cred usage sites except in
kernel/exit.c.

Most of the usage sites are correct, but some of them trip over
invalid assumptions about the protection which is given by RCU.

- spinlocked/preempt_disabled regions are equivalent to rcu_read_lock():

That's wrong. RCU does not guarantee that.

It has been that way due to implementation details and it still is
valid for CONFIG_TREE_PREEMPT_RCU=n, but there is no guarantee that
this will be the case forever.

- interrupt disabled regions are equivalent to rcu_read_lock():

Wrong again. RCU does not guarantee that.

It's true for current mainline, but again this is an implementation
detail and there is no guarantee by the RCU semantics.

Indeed we want to get rid of that to avoid scalability issues on
large systems and preempt-rt got already rid of it to a certain
extent.

I'm sure we are lucky that CONFIG_TREE_PREEMPT_RCU=y is not yet wide
spread and the code pathes are esoteric enough not to trigger that
subtle races (some of them might just error out silently).

Nevertheless we need to fix all invalid assumptions about RCU
protection.

The following patch series fixes all yet known affected __task_cred()
sites, but there is more auditing of all other rcu users necessary.

Thanks,

tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Thomas Gleixner

unread,

Dec 9, 2009, 8:00:03 PM12/9/09

to

documentation-fix-rcu-assumptions.patch

Paul E. McKenney

unread,

Dec 9, 2009, 9:30:02 PM12/9/09

to

On Thu, Dec 10, 2009 at 12:52:46AM -0000, Thomas Gleixner wrote:
> While auditing the read_lock(&tasklist_lock) sites for a possible
> conversion to rcu-read_lock() I stumbled over an unprotected user of
> __task_cred in kernel/sys.c
>
> That caused me to audit all the __task_cred usage sites except in
> kernel/exit.c.
>
> Most of the usage sites are correct, but some of them trip over
> invalid assumptions about the protection which is given by RCU.
>
> - spinlocked/preempt_disabled regions are equivalent to rcu_read_lock():
>
> That's wrong. RCU does not guarantee that.
>
> It has been that way due to implementation details and it still is
> valid for CONFIG_TREE_PREEMPT_RCU=n, but there is no guarantee that
> this will be the case forever.

To back this up, item #2 from Documentation/RCU/checklist.txt says:

2. Do the RCU read-side critical sections make proper use of
rcu_read_lock() and friends? These primitives are needed
to prevent grace periods from ending prematurely, which
could result in data being unceremoniously freed out from
under your read-side code, which can greatly increase the
actuarial risk of your kernel.

As a rough rule of thumb, any dereference of an RCU-protected
pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
or by the appropriate update-side lock.

> - interrupt disabled regions are equivalent to rcu_read_lock():
>
> Wrong again. RCU does not guarantee that.
>
> It's true for current mainline, but again this is an implementation
> detail and there is no guarantee by the RCU semantics.
>
> Indeed we want to get rid of that to avoid scalability issues on
> large systems and preempt-rt got already rid of it to a certain
> extent.

Same item #2 above covers this.

The only exception is when you use synchronize_sched(), as described
in the "Defer"/"Protect" list near line 323 of the 2.6.32 version of
Documentation/RCU/whatisRCU.txt:

Defer Protect

a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
call_rcu()

b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()

c. synchronize_sched() preempt_disable() / preempt_enable()
local_irq_save() / local_irq_restore()
hardirq enter / hardirq exit
NMI enter / NMI exit

And yes, I need to update this based on the addition of
rcu_read_lock_sched() and friends. I will be doing another
documentation update soon.

> I'm sure we are lucky that CONFIG_TREE_PREEMPT_RCU=y is not yet wide
> spread and the code pathes are esoteric enough not to trigger that
> subtle races (some of them might just error out silently).
>
> Nevertheless we need to fix all invalid assumptions about RCU
> protection.

Agreed!!!

> The following patch series fixes all yet known affected __task_cred()
> sites, but there is more auditing of all other rcu users necessary.

Thank you very much for putting this series together -- I will take
a quick look at them.

Thanx, Paul

Linus Torvalds

unread,

Dec 9, 2009, 10:20:01 PM12/9/09

to

On Wed, 9 Dec 2009, Paul E. McKenney wrote:

> On Thu, Dec 10, 2009 at 12:52:46AM -0000, Thomas Gleixner wrote:
> > While auditing the read_lock(&tasklist_lock) sites for a possible
> > conversion to rcu-read_lock() I stumbled over an unprotected user of
> > __task_cred in kernel/sys.c
> >
> > That caused me to audit all the __task_cred usage sites except in
> > kernel/exit.c.
> >
> > Most of the usage sites are correct, but some of them trip over
> > invalid assumptions about the protection which is given by RCU.
> >
> > - spinlocked/preempt_disabled regions are equivalent to rcu_read_lock():
> >
> > That's wrong. RCU does not guarantee that.
> >
> > It has been that way due to implementation details and it still is
> > valid for CONFIG_TREE_PREEMPT_RCU=n, but there is no guarantee that
> > this will be the case forever.
>
> To back this up, item #2 from Documentation/RCU/checklist.txt says:

Hmm. This seems to be a difference that the tree-RCU things introduced,
no? I wonder if we have other areas where we just knew that a spinlock
would make an rcu read-lock unnecessary (which used to be true..)

Linus

Peter Zijlstra

unread,

Dec 10, 2009, 12:20:01 AM12/10/09

to

On Wed, 2009-12-09 at 19:15 -0800, Linus Torvalds wrote:
>
> On Wed, 9 Dec 2009, Paul E. McKenney wrote:
>
> > On Thu, Dec 10, 2009 at 12:52:46AM -0000, Thomas Gleixner wrote:
> > > While auditing the read_lock(&tasklist_lock) sites for a possible
> > > conversion to rcu-read_lock() I stumbled over an unprotected user of
> > > __task_cred in kernel/sys.c
> > >
> > > That caused me to audit all the __task_cred usage sites except in
> > > kernel/exit.c.
> > >
> > > Most of the usage sites are correct, but some of them trip over
> > > invalid assumptions about the protection which is given by RCU.
> > >
> > > - spinlocked/preempt_disabled regions are equivalent to rcu_read_lock():
> > >
> > > That's wrong. RCU does not guarantee that.
> > >
> > > It has been that way due to implementation details and it still is
> > > valid for CONFIG_TREE_PREEMPT_RCU=n, but there is no guarantee that
> > > this will be the case forever.
> >
> > To back this up, item #2 from Documentation/RCU/checklist.txt says:
>
> Hmm. This seems to be a difference that the tree-RCU things introduced,
> no? I wonder if we have other areas where we just knew that a spinlock
> would make an rcu read-lock unnecessary (which used to be true..)

That failed being true when we merged PREEMPT_RCU,.. which was a long
time ago.

Paul E. McKenney

unread,

Dec 10, 2009, 12:40:02 AM12/10/09

to

On Thu, Dec 10, 2009 at 06:13:51AM +0100, Peter Zijlstra wrote:
> On Wed, 2009-12-09 at 19:15 -0800, Linus Torvalds wrote:
> >
> > On Wed, 9 Dec 2009, Paul E. McKenney wrote:
> >
> > > On Thu, Dec 10, 2009 at 12:52:46AM -0000, Thomas Gleixner wrote:
> > > > While auditing the read_lock(&tasklist_lock) sites for a possible
> > > > conversion to rcu-read_lock() I stumbled over an unprotected user of
> > > > __task_cred in kernel/sys.c
> > > >
> > > > That caused me to audit all the __task_cred usage sites except in
> > > > kernel/exit.c.
> > > >
> > > > Most of the usage sites are correct, but some of them trip over
> > > > invalid assumptions about the protection which is given by RCU.
> > > >
> > > > - spinlocked/preempt_disabled regions are equivalent to rcu_read_lock():
> > > >
> > > > That's wrong. RCU does not guarantee that.
> > > >
> > > > It has been that way due to implementation details and it still is
> > > > valid for CONFIG_TREE_PREEMPT_RCU=n, but there is no guarantee that
> > > > this will be the case forever.
> > >
> > > To back this up, item #2 from Documentation/RCU/checklist.txt says:
> >
> > Hmm. This seems to be a difference that the tree-RCU things introduced,
> > no? I wonder if we have other areas where we just knew that a spinlock
> > would make an rcu read-lock unnecessary (which used to be true..)
>
> That failed being true when we merged PREEMPT_RCU,.. which was a long
> time ago.

Ah -- I have a related lockdep question. Is there a primitive that says
whether or not the current task holds at least one lock of any type?
If so, I would like to make rcu_dereference() do at least a little crude
checking for this problem.

Thanx, Paul

Vegard Nossum

unread,

Dec 10, 2009, 7:00:01 PM12/10/09

to

[trimmed Cc]

>2) remove the stale signal code snippet
...
2009/12/10 Thomas Gleixner <tg...@linutronix.de>:
> Index: linux-2.6-tip/Documentation/kmemcheck.txt
> ===================================================================
> --- linux-2.6-tip.orig/Documentation/kmemcheck.txt
> +++ linux-2.6-tip/Documentation/kmemcheck.txt
> @@ -429,8 +429,7 @@ Let's take a look at it:
> 193 /*
> 194 * We won't get problems with the target's UID changing under us
> 195 * because changing it requires RCU be used, and if t != current, the
> -196 * caller must be holding the RCU readlock (by way of a spinlock) and
> -197 * we use RCU protection here
> +196 * caller must be holding the RCU readlocke
> 198 */
> 199 user = get_uid(__task_cred(t)->user);
> 200 atomic_inc(&user->sigpending);

I am not sure that I really agree with this change. This is not a code
example for the sake of showing how to do a particular thing, it's an
example of real code from the tree.

I don't remember if the document is referring to a particular git
version of the code, but I think it might not, in which case it
doesn't REALLY matter even on the microscopic level.

But I won't make a big fuss about it :-)

Vegard

PS: Upon closer inspection, I noticed that one line (line 197) goes
completely missing, there seems to be a typo there too, "readlocke".
Still it's not a huge deal, I admit.

David Howells

unread,

Dec 11, 2009, 8:50:04 AM12/11/09

to

Thomas Gleixner <tg...@linutronix.de> wrote:

> While auditing the read_lock(&tasklist_lock) sites for a possible
> conversion to rcu-read_lock() I stumbled over an unprotected user of
> __task_cred in kernel/sys.c

I'm sure last time I looked, spinlock primitives implied RCU read locks.
Maybe I was mistaken or maybe it's changed. Whatever, good catch, Thomas!

Thanks,
David

David Howells

unread,

Dec 11, 2009, 9:10:02 AM12/11/09

to

Thomas Gleixner <tg...@linutronix.de> wrote:

> -197 * we use RCU protection here
> +196 * caller must be holding the RCU readlocke

You mean "readlock" I suspect.

Otherwise, feel free to add:

Acked-by: David Howells <dhow...@redhat.com>

Linus Torvalds

unread,

Dec 11, 2009, 11:20:01 AM12/11/09

to

On Fri, 11 Dec 2009, David Howells wrote:

> Thomas Gleixner <tg...@linutronix.de> wrote:
>
> > -197 * we use RCU protection here
> > +196 * caller must be holding the RCU readlocke
>
> You mean "readlock" I suspect.

Or maybe he's talking about ye olde readlocke, used widely for OS research
throughout the middle ages. You still find that spelling in some really
old CS literature.

Linus

Paul E. McKenney

unread,

Dec 11, 2009, 11:40:02 AM12/11/09

to

On Fri, Dec 11, 2009 at 01:39:06PM +0000, David Howells wrote:
> Thomas Gleixner <tg...@linutronix.de> wrote:
>
> > While auditing the read_lock(&tasklist_lock) sites for a possible
> > conversion to rcu-read_lock() I stumbled over an unprotected user of
> > __task_cred in kernel/sys.c
>
> I'm sure last time I looked, spinlock primitives implied RCU read locks.
> Maybe I was mistaken or maybe it's changed. Whatever, good catch, Thomas!

It did indeed change with the split of the old synchronize_kernel()
primitive into the synchronize_rcu() and synchronize_sched() primitives
in 2.6.12 -- after this point, synchronize_rcu() was only guaranteed
to respect "real" rcu_read_lock()-base critical sections. But actual
failures would not show up until PREEMPT_RCU was introduced into 2.6.25.

The -rt effort rooted out a bunch of these sorts of problems, but we
clearly missed a few.

Thanx, Paul

Paul E. McKenney

unread,

Dec 11, 2009, 11:40:01 AM12/11/09

to

On Fri, Dec 11, 2009 at 08:07:45AM -0800, Linus Torvalds wrote:
>
>
> On Fri, 11 Dec 2009, David Howells wrote:
>
> > Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> > > -197 * we use RCU protection here
> > > +196 * caller must be holding the RCU readlocke
> >
> > You mean "readlock" I suspect.
>
> Or maybe he's talking about ye olde readlocke, used widely for OS research
> throughout the middle ages. You still find that spelling in some really
> old CS literature.

Interestingly enough, they also tended to split it into two words and
capitalize it, as can be seen by searching for "Read Locke" at

http://faculty.uml.edu/enelson/modern07.htm

Thanx, Paul

Thomas Gleixner

unread,

Dec 11, 2009, 1:10:02 PM12/11/09

to

On Fri, 11 Dec 2009, Paul E. McKenney wrote:
> On Fri, Dec 11, 2009 at 08:07:45AM -0800, Linus Torvalds wrote:
> >
> >
> > On Fri, 11 Dec 2009, David Howells wrote:
> >
> > > Thomas Gleixner <tg...@linutronix.de> wrote:
> > >
> > > > -197 * we use RCU protection here
> > > > +196 * caller must be holding the RCU readlocke
> > >
> > > You mean "readlock" I suspect.
> >
> > Or maybe he's talking about ye olde readlocke, used widely for OS research
> > throughout the middle ages. You still find that spelling in some really
> > old CS literature.
>
> Interestingly enough, they also tended to split it into two words and
> capitalize it, as can be seen by searching for "Read Locke" at
>
> http://faculty.uml.edu/enelson/modern07.htm

The good thing about "Read Locke" is: there is no "Write Locke".

Thanks,
tglx

Arnd Bergmann

unread,

Dec 11, 2009, 4:30:01 PM12/11/09

to

On Thursday 10 December 2009 00:53:26 Thomas Gleixner wrote:
> Index: linux-2.6-tip/Documentation/credentials.txt
> ===================================================================
> --- linux-2.6-tip.orig/Documentation/credentials.txt
> +++ linux-2.6-tip/Documentation/credentials.txt
> @@ -408,9 +408,6 @@ This should be used inside the RCU read
> ...
> }
>
> -A function need not get RCU read lock to use __task_cred() if it is holding a
> -spinlock at the time as this implicitly holds the RCU read lock.
> -
> Should it be necessary to hold another task's credentials for a long period of
> time, and possibly to sleep whilst doing so, then the caller should get a
> reference on them using:

How about changing the documentation to explain why you can't just use a spinlock
or local_irq_disable instead of rcu_read_lock? You explained it well in your
[patch 0/9], but that part had not occurred to me yet and having it in the kernel
sources might prevent more people from getting it wrong in the future.

Arnd

Paul E. McKenney

unread,

Dec 11, 2009, 5:10:03 PM12/11/09

to

On Fri, Dec 11, 2009 at 09:28:25PM +0000, Arnd Bergmann wrote:
> On Thursday 10 December 2009 00:53:26 Thomas Gleixner wrote:
> > Index: linux-2.6-tip/Documentation/credentials.txt
> > ===================================================================
> > --- linux-2.6-tip.orig/Documentation/credentials.txt
> > +++ linux-2.6-tip/Documentation/credentials.txt
> > @@ -408,9 +408,6 @@ This should be used inside the RCU read
> > ...
> > }
> >
> > -A function need not get RCU read lock to use __task_cred() if it is holding a
> > -spinlock at the time as this implicitly holds the RCU read lock.
> > -
> > Should it be necessary to hold another task's credentials for a long period of
> > time, and possibly to sleep whilst doing so, then the caller should get a
> > reference on them using:
>
> How about changing the documentation to explain why you can't just use a spinlock
> or local_irq_disable instead of rcu_read_lock? You explained it well in your
> [patch 0/9], but that part had not occurred to me yet and having it in the kernel
> sources might prevent more people from getting it wrong in the future.

That does make a lot of sense... Will add that in.

Thanx, Paul

Peter Zijlstra

unread,

Dec 13, 2009, 2:00:02 PM12/13/09

to

On Wed, 2009-12-09 at 21:34 -0800, Paul E. McKenney wrote:

> Ah -- I have a related lockdep question. Is there a primitive that says
> whether or not the current task holds at least one lock of any type?
> If so, I would like to make rcu_dereference() do at least a little crude
> checking for this problem.

Hmm, no, but that's not hard to do, however I actually implemented
something like that for RCU a long while ago and that gives a metric TON
of false positives due to things like the radix tree which are RCU-safe
but are not required to be used with RCU.

Paul E. McKenney

unread,

Dec 13, 2009, 9:00:01 PM12/13/09

to

On Sun, Dec 13, 2009 at 07:56:17PM +0100, Peter Zijlstra wrote:
> On Wed, 2009-12-09 at 21:34 -0800, Paul E. McKenney wrote:
>
> > Ah -- I have a related lockdep question. Is there a primitive that says
> > whether or not the current task holds at least one lock of any type?
> > If so, I would like to make rcu_dereference() do at least a little crude
> > checking for this problem.
>
> Hmm, no, but that's not hard to do, however I actually implemented
> something like that for RCU a long while ago and that gives a metric TON
> of false positives due to things like the radix tree which are RCU-safe
> but are not required to be used with RCU.

Understood -- my current guess is that there needs to be a way to tag
a variant of the rcu_dereference() API with the conditions that must be
met, for example, either in an rcu-sched read-side critical section or
holding a specific type of lock.

This does make it a little harder to retroactively add checking to
existing calls to rcu_dereference(), but should allow a good balance
between false positives and false negatives going forward.

Seem reasonable, or am I still missing something?

Thanx, Paul

Peter Zijlstra

unread,

Dec 14, 2009, 5:20:01 AM12/14/09

to

On Sun, 2009-12-13 at 17:53 -0800, Paul E. McKenney wrote:
> On Sun, Dec 13, 2009 at 07:56:17PM +0100, Peter Zijlstra wrote:
> > On Wed, 2009-12-09 at 21:34 -0800, Paul E. McKenney wrote:
> >
> > > Ah -- I have a related lockdep question. Is there a primitive that says
> > > whether or not the current task holds at least one lock of any type?
> > > If so, I would like to make rcu_dereference() do at least a little crude
> > > checking for this problem.
> >
> > Hmm, no, but that's not hard to do, however I actually implemented
> > something like that for RCU a long while ago and that gives a metric TON
> > of false positives due to things like the radix tree which are RCU-safe
> > but are not required to be used with RCU.
>
> Understood -- my current guess is that there needs to be a way to tag
> a variant of the rcu_dereference() API with the conditions that must be
> met, for example, either in an rcu-sched read-side critical section or
> holding a specific type of lock.
>
> This does make it a little harder to retroactively add checking to
> existing calls to rcu_dereference(), but should allow a good balance
> between false positives and false negatives going forward.
>
> Seem reasonable, or am I still missing something?

The only concern is drowning in rcu_dereference() annotations. But I
guess that is unavoidable.

I think you can use lock_is_held(&rcu_lock_map), except you need to deal
with the !debug_locks case, because lockdep stops once debug_locks
becomes false, which means lock_is_held() will return rubbish.

Paul E. McKenney

unread,

Dec 14, 2009, 9:20:02 AM12/14/09

to

On Mon, Dec 14, 2009 at 11:17:39AM +0100, Peter Zijlstra wrote:
> On Sun, 2009-12-13 at 17:53 -0800, Paul E. McKenney wrote:
> > On Sun, Dec 13, 2009 at 07:56:17PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2009-12-09 at 21:34 -0800, Paul E. McKenney wrote:
> > >
> > > > Ah -- I have a related lockdep question. Is there a primitive that says
> > > > whether or not the current task holds at least one lock of any type?
> > > > If so, I would like to make rcu_dereference() do at least a little crude
> > > > checking for this problem.
> > >
> > > Hmm, no, but that's not hard to do, however I actually implemented
> > > something like that for RCU a long while ago and that gives a metric TON
> > > of false positives due to things like the radix tree which are RCU-safe
> > > but are not required to be used with RCU.
> >
> > Understood -- my current guess is that there needs to be a way to tag
> > a variant of the rcu_dereference() API with the conditions that must be
> > met, for example, either in an rcu-sched read-side critical section or
> > holding a specific type of lock.
> >
> > This does make it a little harder to retroactively add checking to
> > existing calls to rcu_dereference(), but should allow a good balance
> > between false positives and false negatives going forward.
> >
> > Seem reasonable, or am I still missing something?
>
> The only concern is drowning in rcu_dereference() annotations. But I
> guess that is unavoidable.

So far, I haven't been able to think of anything better. :-/

> I think you can use lock_is_held(&rcu_lock_map), except you need to deal
> with the !debug_locks case, because lockdep stops once debug_locks
> becomes false, which means lock_is_held() will return rubbish.

OK, so I need to do something like the following, then?

debug_locks ? lock_is_held(&rcu_lock_map) : 1

Thanx, Paul

Peter Zijlstra

unread,

Dec 14, 2009, 9:40:02 AM12/14/09

to

On Mon, 2009-12-14 at 06:16 -0800, Paul E. McKenney wrote:

> > I think you can use lock_is_held(&rcu_lock_map), except you need to deal
> > with the !debug_locks case, because lockdep stops once debug_locks
> > becomes false, which means lock_is_held() will return rubbish.
>
> OK, so I need to do something like the following, then?
>
> debug_locks ? lock_is_held(&rcu_lock_map) : 1

Depending on what the safe return value is, yeah. If you want an assert
like function, false is usually the safe one.

#define lockdep_assert_held(l) WARN_ON(debug_locks && !lockdep_is_held(l))

is the one in-tree user of this.

Paul E. McKenney

unread,

Dec 14, 2009, 8:30:02 PM12/14/09

to

On Mon, Dec 14, 2009 at 03:30:39PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-12-14 at 06:16 -0800, Paul E. McKenney wrote:
>
> > > I think you can use lock_is_held(&rcu_lock_map), except you need to deal
> > > with the !debug_locks case, because lockdep stops once debug_locks
> > > becomes false, which means lock_is_held() will return rubbish.
> >
> > OK, so I need to do something like the following, then?
> >
> > debug_locks ? lock_is_held(&rcu_lock_map) : 1
>
> Depending on what the safe return value is, yeah. If you want an assert
> like function, false is usually the safe one.
>
> #define lockdep_assert_held(l) WARN_ON(debug_locks && !lockdep_is_held(l))
>
> is the one in-tree user of this.

Sounds good, thank you for the tip!!!

Thanx, Paul