That caused me to audit all the __task_cred usage sites except in
kernel/exit.c.
Most of the usage sites are correct, but some of them trip over
invalid assumptions about the protection which is given by RCU.
- spinlocked/preempt_disabled regions are equivalent to rcu_read_lock():
That's wrong. RCU does not guarantee that.
It has been that way due to implementation details and it still is
valid for CONFIG_TREE_PREEMPT_RCU=n, but there is no guarantee that
this will be the case forever.
- interrupt disabled regions are equivalent to rcu_read_lock():
Wrong again. RCU does not guarantee that.
It's true for current mainline, but again this is an implementation
detail and there is no guarantee by the RCU semantics.
Indeed we want to get rid of that to avoid scalability issues on
large systems and preempt-rt got already rid of it to a certain
extent.
I'm sure we are lucky that CONFIG_TREE_PREEMPT_RCU=y is not yet wide
spread and the code pathes are esoteric enough not to trigger that
subtle races (some of them might just error out silently).
Nevertheless we need to fix all invalid assumptions about RCU
protection.
The following patch series fixes all yet known affected __task_cred()
sites, but there is more auditing of all other rcu users necessary.
Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
To back this up, item #2 from Documentation/RCU/checklist.txt says:
2. Do the RCU read-side critical sections make proper use of
rcu_read_lock() and friends? These primitives are needed
to prevent grace periods from ending prematurely, which
could result in data being unceremoniously freed out from
under your read-side code, which can greatly increase the
actuarial risk of your kernel.
As a rough rule of thumb, any dereference of an RCU-protected
pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
or by the appropriate update-side lock.
> - interrupt disabled regions are equivalent to rcu_read_lock():
>
> Wrong again. RCU does not guarantee that.
>
> It's true for current mainline, but again this is an implementation
> detail and there is no guarantee by the RCU semantics.
>
> Indeed we want to get rid of that to avoid scalability issues on
> large systems and preempt-rt got already rid of it to a certain
> extent.
Same item #2 above covers this.
The only exception is when you use synchronize_sched(), as described
in the "Defer"/"Protect" list near line 323 of the 2.6.32 version of
Documentation/RCU/whatisRCU.txt:
Defer Protect
a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
call_rcu()
b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()
c. synchronize_sched() preempt_disable() / preempt_enable()
local_irq_save() / local_irq_restore()
hardirq enter / hardirq exit
NMI enter / NMI exit
And yes, I need to update this based on the addition of
rcu_read_lock_sched() and friends. I will be doing another
documentation update soon.
> I'm sure we are lucky that CONFIG_TREE_PREEMPT_RCU=y is not yet wide
> spread and the code pathes are esoteric enough not to trigger that
> subtle races (some of them might just error out silently).
>
> Nevertheless we need to fix all invalid assumptions about RCU
> protection.
Agreed!!!
> The following patch series fixes all yet known affected __task_cred()
> sites, but there is more auditing of all other rcu users necessary.
Thank you very much for putting this series together -- I will take
a quick look at them.
Thanx, Paul
On Wed, 9 Dec 2009, Paul E. McKenney wrote:
> On Thu, Dec 10, 2009 at 12:52:46AM -0000, Thomas Gleixner wrote:
> > While auditing the read_lock(&tasklist_lock) sites for a possible
> > conversion to rcu-read_lock() I stumbled over an unprotected user of
> > __task_cred in kernel/sys.c
> >
> > That caused me to audit all the __task_cred usage sites except in
> > kernel/exit.c.
> >
> > Most of the usage sites are correct, but some of them trip over
> > invalid assumptions about the protection which is given by RCU.
> >
> > - spinlocked/preempt_disabled regions are equivalent to rcu_read_lock():
> >
> > That's wrong. RCU does not guarantee that.
> >
> > It has been that way due to implementation details and it still is
> > valid for CONFIG_TREE_PREEMPT_RCU=n, but there is no guarantee that
> > this will be the case forever.
>
> To back this up, item #2 from Documentation/RCU/checklist.txt says:
Hmm. This seems to be a difference that the tree-RCU things introduced,
no? I wonder if we have other areas where we just knew that a spinlock
would make an rcu read-lock unnecessary (which used to be true..)
Linus
That failed being true when we merged PREEMPT_RCU,.. which was a long
time ago.
Ah -- I have a related lockdep question. Is there a primitive that says
whether or not the current task holds at least one lock of any type?
If so, I would like to make rcu_dereference() do at least a little crude
checking for this problem.
Thanx, Paul
>2) remove the stale signal code snippet
...
2009/12/10 Thomas Gleixner <tg...@linutronix.de>:
> Index: linux-2.6-tip/Documentation/kmemcheck.txt
> ===================================================================
> --- linux-2.6-tip.orig/Documentation/kmemcheck.txt
> +++ linux-2.6-tip/Documentation/kmemcheck.txt
> @@ -429,8 +429,7 @@ Let's take a look at it:
> 193 /*
> 194 * We won't get problems with the target's UID changing under us
> 195 * because changing it requires RCU be used, and if t != current, the
> -196 * caller must be holding the RCU readlock (by way of a spinlock) and
> -197 * we use RCU protection here
> +196 * caller must be holding the RCU readlocke
> 198 */
> 199 user = get_uid(__task_cred(t)->user);
> 200 atomic_inc(&user->sigpending);
I am not sure that I really agree with this change. This is not a code
example for the sake of showing how to do a particular thing, it's an
example of real code from the tree.
I don't remember if the document is referring to a particular git
version of the code, but I think it might not, in which case it
doesn't REALLY matter even on the microscopic level.
But I won't make a big fuss about it :-)
Vegard
PS: Upon closer inspection, I noticed that one line (line 197) goes
completely missing, there seems to be a typo there too, "readlocke".
Still it's not a huge deal, I admit.
> While auditing the read_lock(&tasklist_lock) sites for a possible
> conversion to rcu-read_lock() I stumbled over an unprotected user of
> __task_cred in kernel/sys.c
I'm sure last time I looked, spinlock primitives implied RCU read locks.
Maybe I was mistaken or maybe it's changed. Whatever, good catch, Thomas!
Thanks,
David
> -197 * we use RCU protection here
> +196 * caller must be holding the RCU readlocke
You mean "readlock" I suspect.
Otherwise, feel free to add:
Acked-by: David Howells <dhow...@redhat.com>
On Fri, 11 Dec 2009, David Howells wrote:
> Thomas Gleixner <tg...@linutronix.de> wrote:
>
> > -197 * we use RCU protection here
> > +196 * caller must be holding the RCU readlocke
>
> You mean "readlock" I suspect.
Or maybe he's talking about ye olde readlocke, used widely for OS research
throughout the middle ages. You still find that spelling in some really
old CS literature.
Linus
It did indeed change with the split of the old synchronize_kernel()
primitive into the synchronize_rcu() and synchronize_sched() primitives
in 2.6.12 -- after this point, synchronize_rcu() was only guaranteed
to respect "real" rcu_read_lock()-base critical sections. But actual
failures would not show up until PREEMPT_RCU was introduced into 2.6.25.
The -rt effort rooted out a bunch of these sorts of problems, but we
clearly missed a few.
Thanx, Paul
Interestingly enough, they also tended to split it into two words and
capitalize it, as can be seen by searching for "Read Locke" at
http://faculty.uml.edu/enelson/modern07.htm
Thanx, Paul
The good thing about "Read Locke" is: there is no "Write Locke".
Thanks,
tglx
How about changing the documentation to explain why you can't just use a spinlock
or local_irq_disable instead of rcu_read_lock? You explained it well in your
[patch 0/9], but that part had not occurred to me yet and having it in the kernel
sources might prevent more people from getting it wrong in the future.
Arnd
That does make a lot of sense... Will add that in.
Thanx, Paul
> Ah -- I have a related lockdep question. Is there a primitive that says
> whether or not the current task holds at least one lock of any type?
> If so, I would like to make rcu_dereference() do at least a little crude
> checking for this problem.
Hmm, no, but that's not hard to do, however I actually implemented
something like that for RCU a long while ago and that gives a metric TON
of false positives due to things like the radix tree which are RCU-safe
but are not required to be used with RCU.
Understood -- my current guess is that there needs to be a way to tag
a variant of the rcu_dereference() API with the conditions that must be
met, for example, either in an rcu-sched read-side critical section or
holding a specific type of lock.
This does make it a little harder to retroactively add checking to
existing calls to rcu_dereference(), but should allow a good balance
between false positives and false negatives going forward.
Seem reasonable, or am I still missing something?
Thanx, Paul
The only concern is drowning in rcu_dereference() annotations. But I
guess that is unavoidable.
I think you can use lock_is_held(&rcu_lock_map), except you need to deal
with the !debug_locks case, because lockdep stops once debug_locks
becomes false, which means lock_is_held() will return rubbish.
So far, I haven't been able to think of anything better. :-/
> I think you can use lock_is_held(&rcu_lock_map), except you need to deal
> with the !debug_locks case, because lockdep stops once debug_locks
> becomes false, which means lock_is_held() will return rubbish.
OK, so I need to do something like the following, then?
debug_locks ? lock_is_held(&rcu_lock_map) : 1
Thanx, Paul
> > I think you can use lock_is_held(&rcu_lock_map), except you need to deal
> > with the !debug_locks case, because lockdep stops once debug_locks
> > becomes false, which means lock_is_held() will return rubbish.
>
> OK, so I need to do something like the following, then?
>
> debug_locks ? lock_is_held(&rcu_lock_map) : 1
Depending on what the safe return value is, yeah. If you want an assert
like function, false is usually the safe one.
#define lockdep_assert_held(l) WARN_ON(debug_locks && !lockdep_is_held(l))
is the one in-tree user of this.
Sounds good, thank you for the tip!!!
Thanx, Paul