[PATCH] kcsan, seqlock: Support seqcount_latch

Marco Elver

unread,

Oct 29, 2024, 4:37:20 AM10/29/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

While fuzzing an arm64 kernel, Alexander Potapenko reported:

| BUG: KCSAN: data-race in ktime_get_mono_fast_ns / timekeeping_update
|
| write to 0xffffffc082e74248 of 56 bytes by interrupt on cpu 0:
| update_fast_timekeeper kernel/time/timekeeping.c:430 [inline]
| timekeeping_update+0x1d8/0x2d8 kernel/time/timekeeping.c:768
| timekeeping_advance+0x9e8/0xb78 kernel/time/timekeeping.c:2344
| update_wall_time+0x18/0x38 kernel/time/timekeeping.c:2360
| tick_do_update_jiffies64+0x164/0x1b0 kernel/time/tick-sched.c:149
| tick_nohz_handler+0xa4/0x2a8 kernel/time/tick-sched.c:232
| __hrtimer_run_queues+0x198/0x33c kernel/time/hrtimer.c:1691
| hrtimer_interrupt+0x16c/0x630 kernel/time/hrtimer.c:1817
| timer_handler drivers/clocksource/arm_arch_timer.c:674 [inline]
| arch_timer_handler_phys+0x60/0x74 drivers/clocksource/arm_arch_timer.c:692
| handle_percpu_devid_irq+0xd8/0x1ec kernel/irq/chip.c:942
| generic_handle_irq_desc include/linux/irqdesc.h:173 [inline]
| handle_irq_desc kernel/irq/irqdesc.c:692 [inline]
| generic_handle_domain_irq+0x5c/0x7c kernel/irq/irqdesc.c:748
| gic_handle_irq+0x78/0x1b0 drivers/irqchip/irq-gic-v3.c:843
| call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:889
| do_interrupt_handler+0x74/0xa8 arch/arm64/kernel/entry-common.c:310
| __el1_irq arch/arm64/kernel/entry-common.c:536 [inline]
| el1_interrupt+0x34/0x54 arch/arm64/kernel/entry-common.c:551
| el1h_64_irq_handler+0x18/0x24 arch/arm64/kernel/entry-common.c:556
| el1h_64_irq+0x64/0x68 arch/arm64/kernel/entry.S:594
| __daif_local_irq_enable arch/arm64/include/asm/irqflags.h:26 [inline]
| arch_local_irq_enable arch/arm64/include/asm/irqflags.h:48 [inline]
| kvm_arch_vcpu_ioctl_run+0x8d4/0xf48 arch/arm64/kvm/arm.c:1259
| kvm_vcpu_ioctl+0x650/0x894 virt/kvm/kvm_main.c:4487
| __do_sys_ioctl fs/ioctl.c:51 [inline]
| __se_sys_ioctl fs/ioctl.c:893 [inline]
| __arm64_sys_ioctl+0xf8/0x170 fs/ioctl.c:893
| [...]
|
| read to 0xffffffc082e74258 of 8 bytes by task 5260 on cpu 1:
| __ktime_get_fast_ns kernel/time/timekeeping.c:372 [inline]
| ktime_get_mono_fast_ns+0x88/0x174 kernel/time/timekeeping.c:489
| init_srcu_struct_fields+0x40c/0x530 kernel/rcu/srcutree.c:263
| init_srcu_struct+0x14/0x20 kernel/rcu/srcutree.c:311
| kvm_dev_ioctl+0x304/0x908 virt/kvm/kvm_main.c:1185
| __do_sys_ioctl fs/ioctl.c:51 [inline]
| __se_sys_ioctl fs/ioctl.c:893 [inline]
| __arm64_sys_ioctl+0xf8/0x170 fs/ioctl.c:893
| [...]
|
| value changed: 0x000002f875d33266 -> 0x000002f877416866
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 UID: 0 PID: 5260 Comm: syz.2.7483 Not tainted 6.12.0-rc3-dirty #78

This is a false positive data race between a seqcount latch writer and a
reader accessing stale data.

Unlike the regular seqlock interface, the seqcount_latch interface for
latch writers never has a well-defined critical section.

To support with KCSAN, optimistically declare that a fixed number of
memory accesses after raw_write_seqcount_latch() are "atomic". Latch
readers follow a similar pattern as the regular seqlock interface. This
effectively tells KCSAN that data races on accesses to seqcount_latch
protected data should be ignored.

Reviewing current raw_write_seqcount_latch() callers, the most common
patterns involve only few memory accesses, either a single plain C
assignment, or memcpy; therefore, the value of 8 memory accesses after
raw_write_seqcount_latch() is chosen to (a) avoid most false positives,
and (b) avoid excessive number of false negatives (due to inadvertently
declaring most accesses in the proximity of update_fast_timekeeper() as
"atomic").

Reported-by: Alexander Potapenko <gli...@google.com>
Tested-by: Alexander Potapenko <gli...@google.com>
Fixes: 88ecd153be95 ("seqlock, kcsan: Add annotations for KCSAN")
Signed-off-by: Marco Elver <el...@google.com>
---
include/linux/seqlock.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index fffeb754880f..e24cf144276e 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -614,6 +614,7 @@ typedef struct {
*/
static __always_inline unsigned raw_read_seqcount_latch(const seqcount_latch_t *s)
{
+ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);
/*
* Pairs with the first smp_wmb() in raw_write_seqcount_latch().
* Due to the dependent load, a full smp_rmb() is not needed.
@@ -631,6 +632,7 @@ static __always_inline unsigned raw_read_seqcount_latch(const seqcount_latch_t *
static __always_inline int
raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
{
+ kcsan_atomic_next(0);
smp_rmb();
return unlikely(READ_ONCE(s->seqcount.sequence) != start);
}
@@ -721,6 +723,13 @@ static inline void raw_write_seqcount_latch(seqcount_latch_t *s)
smp_wmb(); /* prior stores before incrementing "sequence" */
s->seqcount.sequence++;
smp_wmb(); /* increment "sequence" before following stores */
+
+ /*
+ * Latch writers do not have a well-defined critical section, but to
+ * avoid most false positives, at the cost of false negatives, assume
+ * the next few memory accesses belong to the latch writer.
+ */
+ kcsan_atomic_next(8);
}

#define __SEQLOCK_UNLOCKED(lockname) \
--
2.47.0.163.g1226f6d8fa-goog

Peter Zijlstra

unread,

Oct 29, 2024, 7:49:45 AM10/29/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

On Tue, Oct 29, 2024 at 09:36:29AM +0100, Marco Elver wrote:
> Reviewing current raw_write_seqcount_latch() callers, the most common
> patterns involve only few memory accesses, either a single plain C
> assignment, or memcpy;

Then I assume you've encountered latch_tree_{insert,erase}() in your
travels, right?

Also, I note that update_clock_read_data() seems to do things
'backwards' and will completely elide your proposed annotation.

> therefore, the value of 8 memory accesses after
> raw_write_seqcount_latch() is chosen to (a) avoid most false positives,
> and (b) avoid excessive number of false negatives (due to inadvertently
> declaring most accesses in the proximity of update_fast_timekeeper() as
> "atomic").

The above latch'ed RB-trees can certainly exceed this magical number 8.

Given there are so very few latch users, would it make sense to
introduce a raw_write_seqcount_latch_end() callback that does
kcsan_atomic_next(0) ? -- or something along those lines? Then you won't
have to assume such a small number.

Marco Elver

unread,

Oct 29, 2024, 9:06:22 AM10/29/24

to Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

On Tue, 29 Oct 2024 at 12:49, Peter Zijlstra <pet...@infradead.org> wrote:
>
> On Tue, Oct 29, 2024 at 09:36:29AM +0100, Marco Elver wrote:
> > Reviewing current raw_write_seqcount_latch() callers, the most common
> > patterns involve only few memory accesses, either a single plain C
> > assignment, or memcpy;
>
> Then I assume you've encountered latch_tree_{insert,erase}() in your
> travels, right?

Oops. That once certainly exceeds the "8 memory accesses".

> Also, I note that update_clock_read_data() seems to do things
> 'backwards' and will completely elide your proposed annotation.

Hmm, for the first access, yes. This particular oddity could be
"fixed" by surrounding the accesses by
kcsan_nestable_atomic_begin/end(). I don't know if it warrants adding
a raw_write_seqcount_latch_begin().

Preferences?

That's something I considered, but thought I'd try the unintrusive
version first. But since you proposed it here, I'd much prefer that,
too. ;-)
Let me try that.

Thanks,
-- Marco

Peter Zijlstra

unread,

Oct 29, 2024, 9:46:46 AM10/29/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

On Tue, Oct 29, 2024 at 02:05:38PM +0100, Marco Elver wrote:
> On Tue, 29 Oct 2024 at 12:49, Peter Zijlstra <pet...@infradead.org> wrote:
> >
> > On Tue, Oct 29, 2024 at 09:36:29AM +0100, Marco Elver wrote:
> > > Reviewing current raw_write_seqcount_latch() callers, the most common
> > > patterns involve only few memory accesses, either a single plain C
> > > assignment, or memcpy;
> >
> > Then I assume you've encountered latch_tree_{insert,erase}() in your
> > travels, right?
>
> Oops. That once certainly exceeds the "8 memory accesses".
>
> > Also, I note that update_clock_read_data() seems to do things
> > 'backwards' and will completely elide your proposed annotation.
>
> Hmm, for the first access, yes. This particular oddity could be
> "fixed" by surrounding the accesses by
> kcsan_nestable_atomic_begin/end(). I don't know if it warrants adding
> a raw_write_seqcount_latch_begin().
>
> Preferences?

I *think* it is doable to flip it around to the 'normal' order, but
given I've been near cross-eyed with a head-ache these past two days,
I'm not going to attempt a patch for you, since I'm bound to get it
wrong :/

Marco Elver

unread,

Oct 29, 2024, 4:49:33 PM10/29/24

to Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

On Tue, Oct 29, 2024 at 02:46PM +0100, Peter Zijlstra wrote:
> On Tue, Oct 29, 2024 at 02:05:38PM +0100, Marco Elver wrote:
> > On Tue, 29 Oct 2024 at 12:49, Peter Zijlstra <pet...@infradead.org> wrote:
> > >
> > > On Tue, Oct 29, 2024 at 09:36:29AM +0100, Marco Elver wrote:
> > > > Reviewing current raw_write_seqcount_latch() callers, the most common
> > > > patterns involve only few memory accesses, either a single plain C
> > > > assignment, or memcpy;
> > >
> > > Then I assume you've encountered latch_tree_{insert,erase}() in your
> > > travels, right?
> >
> > Oops. That once certainly exceeds the "8 memory accesses".
> >
> > > Also, I note that update_clock_read_data() seems to do things
> > > 'backwards' and will completely elide your proposed annotation.
> >
> > Hmm, for the first access, yes. This particular oddity could be
> > "fixed" by surrounding the accesses by
> > kcsan_nestable_atomic_begin/end(). I don't know if it warrants adding
> > a raw_write_seqcount_latch_begin().
> >
> > Preferences?
>
> I *think* it is doable to flip it around to the 'normal' order, but
> given I've been near cross-eyed with a head-ache these past two days,
> I'm not going to attempt a patch for you, since I'm bound to get it
> wrong :/

Something like this?

------ >8 ------

Author: Marco Elver <el...@google.com>
Date: Tue Oct 29 21:16:21 2024 +0100

time/sched_clock: Swap update_clock_read_data() latch writes

Swap the writes to the odd and even copies to make the writer critical
section look like all other seqcount_latch writers.

With that, we can also add the raw_write_seqcount_latch_end() to clearly
denote the end of the writer section.

Signed-off-by: Marco Elver <el...@google.com>

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 68d6c1190ac7..311c90a0e86e 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -119,9 +119,6 @@ unsigned long long notrace sched_clock(void)
*/
static void update_clock_read_data(struct clock_read_data *rd)
{
- /* update the backup (odd) copy with the new data */
- cd.read_data[1] = *rd;
-
/* steer readers towards the odd copy */
raw_write_seqcount_latch(&cd.seq);

@@ -130,6 +127,11 @@ static void update_clock_read_data(struct clock_read_data *rd)

/* switch readers back to the even copy */
raw_write_seqcount_latch(&cd.seq);
+
+ /* update the backup (odd) copy with the new data */
+ cd.read_data[1] = *rd;
+
+ raw_write_seqcount_latch_end(&cd.seq);
}

/*

------ >8 ------

I also noticed your d16317de9b41 ("seqlock/latch: Provide
raw_read_seqcount_latch_retry()") to get rid of explicit instrumentation
in noinstr.

Not sure how to resolve that. We have that objtool support to erase
calls in noinstr code (is_profiling_func), but that's x86 only.

I could also make kcsan_atomic_next(0) noinstr compatible by checking if
the ret IP is in noinstr, and immediately return if it is.

Preferences?

Peter Zijlstra

unread,

Oct 30, 2024, 4:48:25 PM10/30/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

That looks about right :-)

> ------ >8 ------
>
> I also noticed your d16317de9b41 ("seqlock/latch: Provide
> raw_read_seqcount_latch_retry()") to get rid of explicit instrumentation
> in noinstr.
>
> Not sure how to resolve that. We have that objtool support to erase
> calls in noinstr code (is_profiling_func), but that's x86 only.
>
> I could also make kcsan_atomic_next(0) noinstr compatible by checking if
> the ret IP is in noinstr, and immediately return if it is.
>
> Preferences?

Something like this perhaps?

---
arch/x86/kernel/tsc.c | 5 +++--
include/linux/rbtree_latch.h | 14 ++++++++------
include/linux/seqlock.h | 31 ++++++++++++++++++++++++++++++-
kernel/printk/printk.c | 9 +++++----
kernel/time/sched_clock.c | 20 ++++++++++++--------
kernel/time/timekeeping.c | 10 ++++++----
6 files changed, 64 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index dfe6847fd99e..67aeaba4ba9c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -174,10 +174,11 @@ static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long ts

c2n = per_cpu_ptr(&cyc2ns, cpu);

- raw_write_seqcount_latch(&c2n->seq);
+ write_seqcount_latch_begin(&c2n->seq);
c2n->data[0] = data;
- raw_write_seqcount_latch(&c2n->seq);
+ write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+ write_seqcount_latch_end(&c2n->seq);
}

static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
diff --git a/include/linux/rbtree_latch.h b/include/linux/rbtree_latch.h
index 6a0999c26c7c..bc992c61b7ce 100644
--- a/include/linux/rbtree_latch.h
+++ b/include/linux/rbtree_latch.h
@@ -145,10 +145,11 @@ latch_tree_insert(struct latch_tree_node *node,
struct latch_tree_root *root,
const struct latch_tree_ops *ops)
{
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch_begin(&root->seq);
__lt_insert(node, root, 0, ops->less);
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch(&root->seq);
__lt_insert(node, root, 1, ops->less);
+ write_seqcount_latch_end(&root->seq);
}

/**
@@ -172,10 +173,11 @@ latch_tree_erase(struct latch_tree_node *node,
struct latch_tree_root *root,
const struct latch_tree_ops *ops)
{
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch_begin(&root->seq);
__lt_erase(node, root, 0);
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch(&root->seq);
__lt_erase(node, root, 1);
+ write_seqcount_latch_end(&root->seq);
}

/**
@@ -204,9 +206,9 @@ latch_tree_find(void *key, struct latch_tree_root *root,
unsigned int seq;

do {
- seq = raw_read_seqcount_latch(&root->seq);
+ seq = read_seqcount_latch(&root->seq);
node = __lt_find(key, root, seq & 1, ops->comp);
- } while (raw_read_seqcount_latch_retry(&root->seq, seq));
+ } while (read_seqcount_latch_retry(&root->seq, seq));

return node;
}
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index fffeb754880f..9c2751087185 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -621,6 +621,12 @@ static __always_inline unsigned raw_read_seqcount_latch(const seqcount_latch_t *
return READ_ONCE(s->seqcount.sequence);
}

+static __always_inline unsigned read_seqcount_latch(const seqcount_latch_t *s)
+{
+ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);
+ return raw_read_seqcount_latch(s);
+}
+
/**
* raw_read_seqcount_latch_retry() - end a seqcount_latch_t read section
* @s: Pointer to seqcount_latch_t
@@ -635,6 +641,13 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)

return unlikely(READ_ONCE(s->seqcount.sequence) != start);
}

+static __always_inline int
+read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
+{
+ kcsan_atomic_next(0);
+ return raw_read_seqcount_latch_retry(s, start);
+}
+
/**
* raw_write_seqcount_latch() - redirect latch readers to even/odd copy
* @s: Pointer to seqcount_latch_t
@@ -716,13 +729,29 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
* When data is a dynamic data structure; one should use regular RCU
* patterns to manage the lifetimes of the objects within.
*/
-static inline void raw_write_seqcount_latch(seqcount_latch_t *s)
+static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s)

{
smp_wmb(); /* prior stores before incrementing "sequence" */
s->seqcount.sequence++;
smp_wmb(); /* increment "sequence" before following stores */
}

+static __always_inline void write_seqcount_latch_begin(seqcount_latch_t *s)
+{
+ kcsan_nestable_atomic_begin();
+ raw_write_seqcount_latch(s);
+}
+
+static __always_inline void write_seqcount_latch(seqcount_latch_t *s)
+{
+ raw_write_seqcount_latch(s);
+}
+
+static __always_inline void write_seqcount_latch_end(seqcount_latch_t *s)
+{
+ kcsan_nestable_atomic_end();
+}
+
#define __SEQLOCK_UNLOCKED(lockname) \
{ \
.seqcount = SEQCNT_SPINLOCK_ZERO(lockname, &(lockname).lock), \
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index beb808f4c367..19911c8fa7b6 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -560,10 +560,11 @@ bool printk_percpu_data_ready(void)
/* Must be called under syslog_lock. */
static void latched_seq_write(struct latched_seq *ls, u64 val)
{
- raw_write_seqcount_latch(&ls->latch);
+ write_seqcount_latch_begin(&ls->latch);
ls->val[0] = val;
- raw_write_seqcount_latch(&ls->latch);
+ write_seqcount_latch(&ls->latch);
ls->val[1] = val;
+ write_seqcount_latch_end(&ls->latch);
}

/* Can be called from any context. */
@@ -574,10 +575,10 @@ static u64 latched_seq_read_nolock(struct latched_seq *ls)
u64 val;

do {
- seq = raw_read_seqcount_latch(&ls->latch);
+ seq = read_seqcount_latch(&ls->latch);
idx = seq & 0x1;
val = ls->val[idx];
- } while (raw_read_seqcount_latch_retry(&ls->latch, seq));
+ } while (read_seqcount_latch_retry(&ls->latch, seq));

return val;
}
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 68d6c1190ac7..4958b40ba6c9 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -71,13 +71,13 @@ static __always_inline u64 cyc_to_ns(u64 cyc, u32 mult, u32 shift)

notrace struct clock_read_data *sched_clock_read_begin(unsigned int *seq)
{
- *seq = raw_read_seqcount_latch(&cd.seq);
+ *seq = read_seqcount_latch(&cd.seq);
return cd.read_data + (*seq & 1);
}

notrace int sched_clock_read_retry(unsigned int seq)
{
- return raw_read_seqcount_latch_retry(&cd.seq, seq);
+ return read_seqcount_latch_retry(&cd.seq, seq);
}

unsigned long long noinstr sched_clock_noinstr(void)
@@ -102,7 +102,9 @@ unsigned long long notrace sched_clock(void)
{
unsigned long long ns;
preempt_disable_notrace();
+ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);
ns = sched_clock_noinstr();
+ kcsan_atomic_next(0);
preempt_enable_notrace();
return ns;
}
@@ -119,17 +121,19 @@ unsigned long long notrace sched_clock(void)

*/
static void update_clock_read_data(struct clock_read_data *rd)
{
- /* update the backup (odd) copy with the new data */
- cd.read_data[1] = *rd;
-
/* steer readers towards the odd copy */

- raw_write_seqcount_latch(&cd.seq);
+ write_seqcount_latch_begin(&cd.seq);

/* now its safe for us to update the normal (even) copy */
cd.read_data[0] = *rd;

/* switch readers back to the even copy */

- raw_write_seqcount_latch(&cd.seq);
+ write_seqcount_latch(&cd.seq);

+
+ /* update the backup (odd) copy with the new data */
+ cd.read_data[1] = *rd;
+

+ write_seqcount_latch_end(&cd.seq);
}

/*
@@ -267,7 +271,7 @@ void __init generic_sched_clock_init(void)
*/
static u64 notrace suspended_sched_clock_read(void)
{
- unsigned int seq = raw_read_seqcount_latch(&cd.seq);
+ unsigned int seq = read_seqcount_latch(&cd.seq);

return cd.read_data[seq & 1].epoch_cyc;
}
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 7e6f409bf311..2ca26bfeb8f3 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -424,16 +424,18 @@ static void update_fast_timekeeper(const struct tk_read_base *tkr,
struct tk_read_base *base = tkf->base;

/* Force readers off to base[1] */
- raw_write_seqcount_latch(&tkf->seq);
+ write_seqcount_latch_begin(&tkf->seq);

/* Update base[0] */
memcpy(base, tkr, sizeof(*base));

/* Force readers back to base[0] */
- raw_write_seqcount_latch(&tkf->seq);
+ write_seqcount_latch(&tkf->seq);

/* Update base[1] */
memcpy(base + 1, base, sizeof(*base));
+
+ write_seqcount_latch_end(&tkf->seq);
}

static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf)
@@ -443,11 +445,11 @@ static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf)
u64 now;

do {
- seq = raw_read_seqcount_latch(&tkf->seq);
+ seq = read_seqcount_latch(&tkf->seq);
tkr = tkf->base + (seq & 0x01);
now = ktime_to_ns(tkr->base);
now += __timekeeping_get_ns(tkr);
- } while (raw_read_seqcount_latch_retry(&tkf->seq, seq));
+ } while (read_seqcount_latch_retry(&tkf->seq, seq));

return now;
}

Marco Elver

unread,

Oct 31, 2024, 3:00:18 AM10/31/24

to Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

Looks good.

Let me try to assemble the pieces into a patch. (Your SOB will be
needed - either now or later.)

Thanks,
-- Marco

Peter Zijlstra

unread,

Oct 31, 2024, 5:14:41 AM10/31/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

On Thu, Oct 31, 2024 at 08:00:00AM +0100, Marco Elver wrote:
> Looks good.
>
> Let me try to assemble the pieces into a patch. (Your SOB will be
> needed - either now or later.)

Feel free to add:

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>

Peter Zijlstra

unread,

Oct 31, 2024, 5:20:32 AM10/31/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

On Wed, Oct 30, 2024 at 09:48:15PM +0100, Peter Zijlstra wrote:
> diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
> index 68d6c1190ac7..4958b40ba6c9 100644
> --- a/kernel/time/sched_clock.c
> +++ b/kernel/time/sched_clock.c

> @@ -102,7 +102,9 @@ unsigned long long notrace sched_clock(void)
> {
> unsigned long long ns;
> preempt_disable_notrace();
> + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);
> ns = sched_clock_noinstr();
> + kcsan_atomic_next(0);
> preempt_enable_notrace();
> return ns;
> }

You might want to consider also folding something like this in.
That should give this instrumented version instrumentation :-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 68d6c1190ac7..db26f343233f 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -80,7 +80,7 @@ notrace int sched_clock_read_retry(unsigned int seq)
return raw_read_seqcount_latch_retry(&cd.seq, seq);
}

-unsigned long long noinstr sched_clock_noinstr(void)
+static __always_inline unsigned long long __sched_clock(void)
{
struct clock_read_data *rd;
unsigned int seq;
@@ -98,11 +98,16 @@ unsigned long long noinstr sched_clock_noinstr(void)
return res;
}

+unsigned long long noinstr sched_clock_noinstr(void)
+{
+ return __sched_clock();
+}
+

unsigned long long notrace sched_clock(void)
{
unsigned long long ns;
preempt_disable_notrace();

- ns = sched_clock_noinstr();
+ ns = __sched_clock();
preempt_enable_notrace();
return ns;
}

Marco Elver

unread,

Nov 4, 2024, 11:19:26 AM11/4/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

While fuzzing an arm64 kernel, Alexander Potapenko reported:

| BUG: KCSAN: data-race in ktime_get_mono_fast_ns / timekeeping_update
|
| write to 0xffffffc082e74248 of 56 bytes by interrupt on cpu 0:
| update_fast_timekeeper kernel/time/timekeeping.c:430 [inline]
| timekeeping_update+0x1d8/0x2d8 kernel/time/timekeeping.c:768
| timekeeping_advance+0x9e8/0xb78 kernel/time/timekeeping.c:2344
| update_wall_time+0x18/0x38 kernel/time/timekeeping.c:2360

| [...]
|
| read to 0xffffffc082e74258 of 8 bytes by task 5260 on cpu 1:
| __ktime_get_fast_ns kernel/time/timekeeping.c:372 [inline]
| ktime_get_mono_fast_ns+0x88/0x174 kernel/time/timekeeping.c:489
| init_srcu_struct_fields+0x40c/0x530 kernel/rcu/srcutree.c:263
| init_srcu_struct+0x14/0x20 kernel/rcu/srcutree.c:311

| [...]
|
| value changed: 0x000002f875d33266 -> 0x000002f877416866
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 UID: 0 PID: 5260 Comm: syz.2.7483 Not tainted 6.12.0-rc3-dirty #78

This is a false positive data race between a seqcount latch writer and a reader

accessing stale data. Since its introduction, KCSAN has never understood the
seqcount_latch interface (due to being unannotated).

Unlike the regular seqlock interface, the seqcount_latch interface for latch

writers never has had a well-defined critical section, making it difficult to
teach tooling where the critical section starts and ends.

This series introduces an instrumentable (non-raw) seqcount_latch interface,
with which we can clearly denote writer critical sections. This both helps
readability and tooling like KCSAN to understand when the writer is done
updating all latch copies.

Changelog
=========

v2:
* New interface, courtesy of Peter Zijlstra. This simplifies things and we
avoid instrumenting the raw interface which is now reserved for noinstr
functions.
* Fix for read_seqbegin/retry() found during testing of new changes.

v1: https://lkml.kernel.org/r/20241029083658....@google.com

Marco Elver (5):

time/sched_clock: Swap update_clock_read_data() latch writes

time/sched_clock: Broaden sched_clock()'s instrumentation coverage
kcsan, seqlock: Support seqcount_latch_t
seqlock, treewide: Switch to non-raw seqcount_latch interface
kcsan, seqlock: Fix incorrect assumption in read_seqbegin()

Documentation/locking/seqlock.rst | 2 +-
arch/x86/kernel/tsc.c | 5 +-
include/linux/rbtree_latch.h | 20 ++++---
include/linux/seqlock.h | 98 +++++++++++++++++++++++--------
kernel/printk/printk.c | 9 +--
kernel/time/sched_clock.c | 34 +++++++----
kernel/time/timekeeping.c | 12 ++--
7 files changed, 123 insertions(+), 57 deletions(-)

--
2.47.0.163.g1226f6d8fa-goog

Marco Elver

unread,

Nov 4, 2024, 11:19:29 AM11/4/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

Swap the writes to the odd and even copies to make the writer critical
section look like all other seqcount_latch writers.

Signed-off-by: Marco Elver <el...@google.com>
---

v2:
* New patch.
---
kernel/time/sched_clock.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 68d6c1190ac7..85595fcf6aa2 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c

@@ -119,9 +119,6 @@ unsigned long long notrace sched_clock(void)
*/
static void update_clock_read_data(struct clock_read_data *rd)
{
- /* update the backup (odd) copy with the new data */
- cd.read_data[1] = *rd;
-
/* steer readers towards the odd copy */
raw_write_seqcount_latch(&cd.seq);

@@ -130,6 +127,9 @@ static void update_clock_read_data(struct clock_read_data *rd)

/* switch readers back to the even copy */
raw_write_seqcount_latch(&cd.seq);
+
+ /* update the backup (odd) copy with the new data */
+ cd.read_data[1] = *rd;
}

/*
--
2.47.0.163.g1226f6d8fa-goog

Marco Elver

unread,

Nov 4, 2024, 11:19:31 AM11/4/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

Most of sched_clock()'s implementation is ineligible for instrumentation
due to relying on sched_clock_noinstr().

Split the implementation off into an __always_inline function
__sched_clock(), which is then used by the noinstr and instrumentable
version, to allow more of sched_clock() to be covered by various
instrumentation.

This will allow instrumentation with the various sanitizers (KASAN,
KCSAN, KMSAN, UBSAN). For KCSAN, we know that raw seqcount_latch usage
without annotations will result in false positive reports: tell it that
all of __sched_clock() is "atomic" for the latch writer; later changes
in this series will take care of the readers.

Link: https://lore.kernel.org/all/20241030204...@noisy.programming.kicks-ass.net/
Co-developed-by: Peter Zijlstra (Intel) <pet...@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>

Signed-off-by: Marco Elver <el...@google.com>
---
v2:
* New patch.
---

kernel/time/sched_clock.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 85595fcf6aa2..29bdf309dae8 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c

@@ -80,7 +80,7 @@ notrace int sched_clock_read_retry(unsigned int seq)
return raw_read_seqcount_latch_retry(&cd.seq, seq);
}

-unsigned long long noinstr sched_clock_noinstr(void)
+static __always_inline unsigned long long __sched_clock(void)
{
struct clock_read_data *rd;
unsigned int seq;

@@ -98,11 +98,23 @@ unsigned long long noinstr sched_clock_noinstr(void)

return res;
}

+unsigned long long noinstr sched_clock_noinstr(void)
+{
+ return __sched_clock();
+}
+

unsigned long long notrace sched_clock(void)

{
unsigned long long ns;
preempt_disable_notrace();
- ns = sched_clock_noinstr();

+ /*
+ * All of __sched_clock() is a seqcount_latch reader critical section,
+ * but relies on the raw helpers which are uninstrumented. For KCSAN,
+ * mark all accesses in __sched_clock() as atomic.
+ */
+ kcsan_nestable_atomic_begin();
+ ns = __sched_clock();
+ kcsan_nestable_atomic_end();
preempt_enable_notrace();
return ns;
}
--
2.47.0.163.g1226f6d8fa-goog

Marco Elver

unread,

Nov 4, 2024, 11:19:34 AM11/4/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Alexander Potapenko

While fuzzing an arm64 kernel, Alexander Potapenko reported:

| BUG: KCSAN: data-race in ktime_get_mono_fast_ns / timekeeping_update
|
| write to 0xffffffc082e74248 of 56 bytes by interrupt on cpu 0:
| update_fast_timekeeper kernel/time/timekeeping.c:430 [inline]
| timekeeping_update+0x1d8/0x2d8 kernel/time/timekeeping.c:768
| timekeeping_advance+0x9e8/0xb78 kernel/time/timekeeping.c:2344
| update_wall_time+0x18/0x38 kernel/time/timekeeping.c:2360
| [...]
|
| read to 0xffffffc082e74258 of 8 bytes by task 5260 on cpu 1:
| __ktime_get_fast_ns kernel/time/timekeeping.c:372 [inline]
| ktime_get_mono_fast_ns+0x88/0x174 kernel/time/timekeeping.c:489
| init_srcu_struct_fields+0x40c/0x530 kernel/rcu/srcutree.c:263
| init_srcu_struct+0x14/0x20 kernel/rcu/srcutree.c:311
| [...]
|
| value changed: 0x000002f875d33266 -> 0x000002f877416866
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 UID: 0 PID: 5260 Comm: syz.2.7483 Not tainted 6.12.0-rc3-dirty #78

This is a false positive data race between a seqcount latch writer and a reader
accessing stale data. Since its introduction, KCSAN has never understood the
seqcount_latch interface (due to being unannotated).

Unlike the regular seqlock interface, the seqcount_latch interface for latch
writers never has had a well-defined critical section, making it difficult to
teach tooling where the critical section starts and ends.

Introduce an instrumentable (non-raw) seqcount_latch interface, with

which we can clearly denote writer critical sections. This both helps
readability and tooling like KCSAN to understand when the writer is done
updating all latch copies.

Link: https://lore.kernel.org/all/20241030204...@noisy.programming.kicks-ass.net/
Reported-by: Alexander Potapenko <gli...@google.com>

Fixes: 88ecd153be95 ("seqlock, kcsan: Add annotations for KCSAN")

Co-developed-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Signed-off-by: Marco Elver <el...@google.com>
---
v2:

* Introduce new interface, courtesy of Peter Zijlstra. Adjust
documentation along with its introduction.
---
Documentation/locking/seqlock.rst | 2 +-
include/linux/seqlock.h | 86 +++++++++++++++++++++++++------
2 files changed, 72 insertions(+), 16 deletions(-)

diff --git a/Documentation/locking/seqlock.rst b/Documentation/locking/seqlock.rst
index bfda1a5fecad..ec6411d02ac8 100644
--- a/Documentation/locking/seqlock.rst
+++ b/Documentation/locking/seqlock.rst
@@ -153,7 +153,7 @@ Use seqcount_latch_t when the write side sections cannot be protected
from interruption by readers. This is typically the case when the read
side can be invoked from NMI handlers.

-Check `raw_write_seqcount_latch()` for more information.
+Check `write_seqcount_latch()` for more information.

.. _seqlock_t:
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index fffeb754880f..45eee0e5dca0 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -621,6 +621,23 @@ static __always_inline unsigned raw_read_seqcount_latch(const seqcount_latch_t *
return READ_ONCE(s->seqcount.sequence);
}

+/**
+ * read_seqcount_latch() - pick even/odd latch data copy
+ * @s: Pointer to seqcount_latch_t
+ *
+ * See write_seqcount_latch() for details and a full reader/writer usage
+ * example.
+ *
+ * Return: sequence counter raw value. Use the lowest bit as an index for
+ * picking which data copy to read. The full counter must then be checked
+ * with read_seqcount_latch_retry().
+ */

+static __always_inline unsigned read_seqcount_latch(const seqcount_latch_t *s)
+{
+ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);
+ return raw_read_seqcount_latch(s);
+}
+
/**
* raw_read_seqcount_latch_retry() - end a seqcount_latch_t read section
* @s: Pointer to seqcount_latch_t

@@ -635,9 +652,34 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)

return unlikely(READ_ONCE(s->seqcount.sequence) != start);
}

+/**
+ * read_seqcount_latch_retry() - end a seqcount_latch_t read section
+ * @s: Pointer to seqcount_latch_t
+ * @start: count, from read_seqcount_latch()
+ *
+ * Return: true if a read section retry is required, else false
+ */

+static __always_inline int
+read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
+{
+ kcsan_atomic_next(0);
+ return raw_read_seqcount_latch_retry(s, start);
+}
+
/**
* raw_write_seqcount_latch() - redirect latch readers to even/odd copy
* @s: Pointer to seqcount_latch_t

+ */

+static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s)

+{
+ smp_wmb(); /* prior stores before incrementing "sequence" */
+ s->seqcount.sequence++;
+ smp_wmb(); /* increment "sequence" before following stores */
+}
+
+/**
+ * write_seqcount_latch_begin() - redirect latch readers to odd copy
+ * @s: Pointer to seqcount_latch_t
*
* The latch technique is a multiversion concurrency control method that allows
* queries during non-atomic modifications. If you can guarantee queries never
@@ -665,17 +707,11 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
*
* void latch_modify(struct latch_struct *latch, ...)
* {
- * smp_wmb(); // Ensure that the last data[1] update is visible
- * latch->seq.sequence++;
- * smp_wmb(); // Ensure that the seqcount update is visible
- *
+ * write_seqcount_latch_begin(&latch->seq);
* modify(latch->data[0], ...);
- *
- * smp_wmb(); // Ensure that the data[0] update is visible
- * latch->seq.sequence++;
- * smp_wmb(); // Ensure that the seqcount update is visible
- *
+ * write_seqcount_latch(&latch->seq);
* modify(latch->data[1], ...);
+ * write_seqcount_latch_end(&latch->seq);
* }
*
* The query will have a form like::
@@ -686,13 +722,13 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)
* unsigned seq, idx;
*
* do {
- * seq = raw_read_seqcount_latch(&latch->seq);
+ * seq = read_seqcount_latch(&latch->seq);
*
* idx = seq & 0x01;
* entry = data_query(latch->data[idx], ...);
*
* // This includes needed smp_rmb()
- * } while (raw_read_seqcount_latch_retry(&latch->seq, seq));
+ * } while (read_seqcount_latch_retry(&latch->seq, seq));
*
* return entry;
* }
@@ -716,11 +752,31 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start)

* When data is a dynamic data structure; one should use regular RCU
* patterns to manage the lifetimes of the objects within.
*/
-static inline void raw_write_seqcount_latch(seqcount_latch_t *s)

+static __always_inline void write_seqcount_latch_begin(seqcount_latch_t *s)
{
- smp_wmb(); /* prior stores before incrementing "sequence" */
- s->seqcount.sequence++;
- smp_wmb(); /* increment "sequence" before following stores */

+ kcsan_nestable_atomic_begin();
+ raw_write_seqcount_latch(s);
+}
+

+/**
+ * write_seqcount_latch() - redirect latch readers to even copy
+ * @s: Pointer to seqcount_latch_t
+ */

+static __always_inline void write_seqcount_latch(seqcount_latch_t *s)
+{
+ raw_write_seqcount_latch(s);
+}
+

+/**
+ * write_seqcount_latch_end() - end a seqcount_latch_t write section
+ * @s: Pointer to seqcount_latch_t
+ *
+ * Marks the end of a seqcount_latch_t writer section, after all copies of the
+ * latch-protected data have been updated.
+ */

+static __always_inline void write_seqcount_latch_end(seqcount_latch_t *s)
+{
+ kcsan_nestable_atomic_end();
}

#define __SEQLOCK_UNLOCKED(lockname) \
--
2.47.0.163.g1226f6d8fa-goog

Marco Elver

unread,

Nov 4, 2024, 11:19:36 AM11/4/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

Switch all instrumentable users of the seqcount_latch interface over to
the non-raw interface.

Link: https://lore.kernel.org/all/20241030204...@noisy.programming.kicks-ass.net/

Co-developed-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Signed-off-by: Marco Elver <el...@google.com>
---
v2:

* New patch.
---
arch/x86/kernel/tsc.c | 5 +++--
include/linux/rbtree_latch.h | 20 +++++++++++---------
kernel/printk/printk.c | 9 +++++----
kernel/time/sched_clock.c | 12 +++++++-----
kernel/time/timekeeping.c | 12 +++++++-----
5 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index dfe6847fd99e..67aeaba4ba9c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -174,10 +174,11 @@ static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long ts

c2n = per_cpu_ptr(&cyc2ns, cpu);

- raw_write_seqcount_latch(&c2n->seq);
+ write_seqcount_latch_begin(&c2n->seq);
c2n->data[0] = data;
- raw_write_seqcount_latch(&c2n->seq);
+ write_seqcount_latch(&c2n->seq);
c2n->data[1] = data;
+ write_seqcount_latch_end(&c2n->seq);
}

static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
diff --git a/include/linux/rbtree_latch.h b/include/linux/rbtree_latch.h

index 6a0999c26c7c..2f630eb8307e 100644
--- a/include/linux/rbtree_latch.h
+++ b/include/linux/rbtree_latch.h
@@ -14,7 +14,7 @@
*
* If we need to allow unconditional lookups (say as required for NMI context
* usage) we need a more complex setup; this data structure provides this by
- * employing the latch technique -- see @raw_write_seqcount_latch -- to
+ * employing the latch technique -- see @write_seqcount_latch_begin -- to
* implement a latched RB-tree which does allow for unconditional lookups by
* virtue of always having (at least) one stable copy of the tree.
*
@@ -132,7 +132,7 @@ __lt_find(void *key, struct latch_tree_root *ltr, int idx,
* @ops: operators defining the node order
*
* It inserts @node into @root in an ordered fashion such that we can always
- * observe one complete tree. See the comment for raw_write_seqcount_latch().
+ * observe one complete tree. See the comment for write_seqcount_latch_begin().
*
* The inserts use rcu_assign_pointer() to publish the element such that the
* tree structure is stored before we can observe the new @node.

@@ -145,10 +145,11 @@ latch_tree_insert(struct latch_tree_node *node,
struct latch_tree_root *root,
const struct latch_tree_ops *ops)
{
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch_begin(&root->seq);
__lt_insert(node, root, 0, ops->less);
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch(&root->seq);
__lt_insert(node, root, 1, ops->less);
+ write_seqcount_latch_end(&root->seq);
}

/**

@@ -159,7 +160,7 @@ latch_tree_insert(struct latch_tree_node *node,
*
* Removes @node from the trees @root in an ordered fashion such that we can
* always observe one complete tree. See the comment for
- * raw_write_seqcount_latch().
+ * write_seqcount_latch_begin().
*
* It is assumed that @node will observe one RCU quiescent state before being
* reused of freed.

@@ -172,10 +173,11 @@ latch_tree_erase(struct latch_tree_node *node,
struct latch_tree_root *root,
const struct latch_tree_ops *ops)
{
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch_begin(&root->seq);
__lt_erase(node, root, 0);
- raw_write_seqcount_latch(&root->seq);
+ write_seqcount_latch(&root->seq);
__lt_erase(node, root, 1);
+ write_seqcount_latch_end(&root->seq);
}

/**
@@ -204,9 +206,9 @@ latch_tree_find(void *key, struct latch_tree_root *root,
unsigned int seq;

do {
- seq = raw_read_seqcount_latch(&root->seq);
+ seq = read_seqcount_latch(&root->seq);
node = __lt_find(key, root, seq & 1, ops->comp);
- } while (raw_read_seqcount_latch_retry(&root->seq, seq));
+ } while (read_seqcount_latch_retry(&root->seq, seq));

return node;
}

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 29bdf309dae8..fcca4e72f1ef 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c

@@ -71,13 +71,13 @@ static __always_inline u64 cyc_to_ns(u64 cyc, u32 mult, u32 shift)

notrace struct clock_read_data *sched_clock_read_begin(unsigned int *seq)
{
- *seq = raw_read_seqcount_latch(&cd.seq);
+ *seq = read_seqcount_latch(&cd.seq);
return cd.read_data + (*seq & 1);
}

notrace int sched_clock_read_retry(unsigned int seq)

{
- return raw_read_seqcount_latch_retry(&cd.seq, seq);
+ return read_seqcount_latch_retry(&cd.seq, seq);
}

static __always_inline unsigned long long __sched_clock(void)

@@ -132,16 +132,18 @@ unsigned long long notrace sched_clock(void)

static void update_clock_read_data(struct clock_read_data *rd)
{

/* steer readers towards the odd copy */

- raw_write_seqcount_latch(&cd.seq);
+ write_seqcount_latch_begin(&cd.seq);

/* now its safe for us to update the normal (even) copy */
cd.read_data[0] = *rd;

/* switch readers back to the even copy */

- raw_write_seqcount_latch(&cd.seq);
+ write_seqcount_latch(&cd.seq);

/* update the backup (odd) copy with the new data */

cd.read_data[1] = *rd;
+
+ write_seqcount_latch_end(&cd.seq);
}

/*

@@ -279,7 +281,7 @@ void __init generic_sched_clock_init(void)

*/
static u64 notrace suspended_sched_clock_read(void)
{
- unsigned int seq = raw_read_seqcount_latch(&cd.seq);
+ unsigned int seq = read_seqcount_latch(&cd.seq);

return cd.read_data[seq & 1].epoch_cyc;
}
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c

index 7e6f409bf311..18752983e834 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -411,7 +411,7 @@ static inline u64 timekeeping_get_ns(const struct tk_read_base *tkr)
* We want to use this from any context including NMI and tracing /
* instrumenting the timekeeping code itself.
*
- * Employ the latch technique; see @raw_write_seqcount_latch.
+ * Employ the latch technique; see @write_seqcount_latch.
*
* So if a NMI hits the update of base[0] then it will use base[1]
* which is still consistent. In the worst case this can result is a

--
2.47.0.163.g1226f6d8fa-goog

Marco Elver

unread,

Nov 4, 2024, 11:19:40 AM11/4/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

During testing of the preceding changes, I noticed that in some cases,
current->kcsan_ctx.in_flat_atomic remained true until task exit. This is
obviously wrong, because _all_ accesses for the given task will be
treated as atomic, resulting in false negatives i.e. missed data races.

Debugging led to fs/dcache.c, where we can see this usage of seqlock:

struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
{
struct dentry *dentry;
unsigned seq;

do {
seq = read_seqbegin(&rename_lock);
dentry = __d_lookup(parent, name);
if (dentry)
break;
} while (read_seqretry(&rename_lock, seq));
[...]

As can be seen, read_seqretry() is never called if dentry != NULL;
consequently, current->kcsan_ctx.in_flat_atomic will never be reset to
false by read_seqretry().

Give up on the wrong assumption of "assume closing read_seqretry()", and
rely on the already-present annotations in read_seqcount_begin/retry().

Fixes: 88ecd153be95 ("seqlock, kcsan: Add annotations for KCSAN")

Signed-off-by: Marco Elver <el...@google.com>
---
v2:
* New patch.
---

include/linux/seqlock.h | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 45eee0e5dca0..5298765d6ca4 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -810,11 +810,7 @@ static __always_inline void write_seqcount_latch_end(seqcount_latch_t *s)
*/
static inline unsigned read_seqbegin(const seqlock_t *sl)
{
- unsigned ret = read_seqcount_begin(&sl->seqcount);
-
- kcsan_atomic_next(0); /* non-raw usage, assume closing read_seqretry() */
- kcsan_flat_atomic_begin();
- return ret;
+ return read_seqcount_begin(&sl->seqcount);
}

/**
@@ -830,12 +826,6 @@ static inline unsigned read_seqbegin(const seqlock_t *sl)
*/
static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
{
- /*
- * Assume not nested: read_seqretry() may be called multiple times when
- * completing read critical section.
- */
- kcsan_flat_atomic_end();
-
return read_seqcount_retry(&sl->seqcount, start);
}

--
2.47.0.163.g1226f6d8fa-goog

Peter Zijlstra

unread,

Nov 5, 2024, 4:13:47 AM11/5/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

OK, so this takes us back to kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX)
and kcsan_atomic_next(0).

Which I suppose is safe, except it doesn't nest properly.

Anyway, these all look really nice, let me go queue them up.

Thanks!

Marco Elver

unread,

Nov 5, 2024, 4:23:29 AM11/5/24

to el...@google.com, Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

Oops, typo'd the commit message:

On Mon, 4 Nov 2024 at 17:19, Marco Elver <el...@google.com> wrote:
>
> Most of sched_clock()'s implementation is ineligible for instrumentation
> due to relying on sched_clock_noinstr().
>
> Split the implementation off into an __always_inline function
> __sched_clock(), which is then used by the noinstr and instrumentable
> version, to allow more of sched_clock() to be covered by various
> instrumentation.
>
> This will allow instrumentation with the various sanitizers (KASAN,
> KCSAN, KMSAN, UBSAN). For KCSAN, we know that raw seqcount_latch usage
> without annotations will result in false positive reports: tell it that
> all of __sched_clock() is "atomic" for the latch writer; later changes

s/writer/reader/

> in this series will take care of the readers.

s/readers/writers/

... might be less confusing. If you apply, kindly fix up the commit
message, so that future people will be less confused. The code comment
is correct.

Thanks,
-- Marco

Marco Elver

unread,

Nov 5, 2024, 4:29:28 AM11/5/24

to Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

On Tue, 5 Nov 2024 at 10:13, Peter Zijlstra <pet...@infradead.org> wrote:

> > static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
> > {
> > - /*
> > - * Assume not nested: read_seqretry() may be called multiple times when
> > - * completing read critical section.
> > - */
> > - kcsan_flat_atomic_end();
> > -
> > return read_seqcount_retry(&sl->seqcount, start);
> > }
>
> OK, so this takes us back to kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX)
> and kcsan_atomic_next(0).
>
> Which I suppose is safe, except it doesn't nest properly.

Yes correct - we just give up on trying to be special here. It would
be nice to also have readers have a clear critical section, but that
seems a lot harder to enforce with a bunch of them having rather
convoluted control flow. :-/

> Anyway, these all look really nice, let me go queue them up.

Many thanks!

Peter Zijlstra

unread,

Nov 5, 2024, 4:34:06 AM11/5/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Linus Torvalds

On Mon, Nov 04, 2024 at 04:43:09PM +0100, Marco Elver wrote:

> During testing of the preceding changes, I noticed that in some cases,
> current->kcsan_ctx.in_flat_atomic remained true until task exit. This is
> obviously wrong, because _all_ accesses for the given task will be
> treated as atomic, resulting in false negatives i.e. missed data races.
>
> Debugging led to fs/dcache.c, where we can see this usage of seqlock:
>
> struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
> {
> struct dentry *dentry;
> unsigned seq;
>
> do {
> seq = read_seqbegin(&rename_lock);
> dentry = __d_lookup(parent, name);
> if (dentry)
> break;
> } while (read_seqretry(&rename_lock, seq));
> [...]

How's something like this completely untested hack?

struct dentry *dentry;

read_seqcount_scope (&rename_lock) {

dentry = __d_lookup(parent, name);
if (dentry)
break;
}

But perhaps naming isn't right, s/_scope/_loop/ ?

--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -829,6 +829,33 @@ static inline unsigned read_seqretry(con
return read_seqcount_retry(&sl->seqcount, start);
}

+
+static inline unsigned read_seq_scope_begin(const struct seqlock_t *sl)
+{
+ unsigned ret = read_seqcount_begin(&sl->seqcount);
+ kcsan_atomic_next(0);
+ kcsan_flat_atomic_begin();
+ return ret;
+}
+
+static inline void read_seq_scope_end(unsigned *seq)
+{
+ kcsan_flat_atomic_end();
+}
+
+static inline bool read_seq_scope_retry(const struct seqlock_t *sl, unsigned *seq)
+{
+ bool done = !read_seqcount_retry(&sl->seqcount, *seq);
+ if (!done)
+ *seq = read_seqcount_begin(&sl->seqcount);
+ return done;
+}
+
+#define read_seqcount_scope(sl) \
+ for (unsigned seq __cleanup(read_seq_scope_end) = \
+ read_seq_scope_begin(sl), done = 0; \
+ !done; done = read_seq_scope_retry(sl, &seq))
+
/*
* For all seqlock_t write side functions, use the internal
* do_write_seqcount_begin() instead of generic write_seqcount_begin().

Peter Zijlstra

unread,

Nov 5, 2024, 4:35:19 AM11/5/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org

So done. Thanks!

Peter Zijlstra

unread,

Nov 5, 2024, 4:41:21 AM11/5/24

to Marco Elver, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Linus Torvalds

On Tue, Nov 05, 2024 at 10:34:00AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 04, 2024 at 04:43:09PM +0100, Marco Elver wrote:
> > During testing of the preceding changes, I noticed that in some cases,
> > current->kcsan_ctx.in_flat_atomic remained true until task exit. This is
> > obviously wrong, because _all_ accesses for the given task will be
> > treated as atomic, resulting in false negatives i.e. missed data races.
> >
> > Debugging led to fs/dcache.c, where we can see this usage of seqlock:
> >
> > struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
> > {
> > struct dentry *dentry;
> > unsigned seq;
> >
> > do {
> > seq = read_seqbegin(&rename_lock);
> > dentry = __d_lookup(parent, name);
> > if (dentry)
> > break;
> > } while (read_seqretry(&rename_lock, seq));
> > [...]
>
>
> How's something like this completely untested hack?
>
>
> struct dentry *dentry;
>
> read_seqcount_scope (&rename_lock) {
> dentry = __d_lookup(parent, name);
> if (dentry)
> break;
> }
>
>
> But perhaps naming isn't right, s/_scope/_loop/ ?

It is also confused between seqcount and seqlock. So perhaps it should
read:

read_seqcount_loop (&rename_lock.seqcount) {
...
}

instead.

Marco Elver

unread,

Nov 5, 2024, 4:51:29 AM11/5/24

to Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Mark Rutland, Dmitry Vyukov, kasa...@googlegroups.com, linux-...@vger.kernel.org, Linus Torvalds

On Tue, 5 Nov 2024 at 10:34, Peter Zijlstra <pet...@infradead.org> wrote:
>
> On Mon, Nov 04, 2024 at 04:43:09PM +0100, Marco Elver wrote:
> > During testing of the preceding changes, I noticed that in some cases,
> > current->kcsan_ctx.in_flat_atomic remained true until task exit. This is
> > obviously wrong, because _all_ accesses for the given task will be
> > treated as atomic, resulting in false negatives i.e. missed data races.
> >
> > Debugging led to fs/dcache.c, where we can see this usage of seqlock:
> >
> > struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
> > {
> > struct dentry *dentry;
> > unsigned seq;
> >
> > do {
> > seq = read_seqbegin(&rename_lock);
> > dentry = __d_lookup(parent, name);
> > if (dentry)
> > break;
> > } while (read_seqretry(&rename_lock, seq));
> > [...]
>
>
> How's something like this completely untested hack?
>
>
> struct dentry *dentry;
>
> read_seqcount_scope (&rename_lock) {
> dentry = __d_lookup(parent, name);
> if (dentry)
> break;
> }
>
>
> But perhaps naming isn't right, s/_scope/_loop/ ?

_loop seems straightforward.

> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -829,6 +829,33 @@ static inline unsigned read_seqretry(con
> return read_seqcount_retry(&sl->seqcount, start);
> }
>
> +
> +static inline unsigned read_seq_scope_begin(const struct seqlock_t *sl)
> +{
> + unsigned ret = read_seqcount_begin(&sl->seqcount);
> + kcsan_atomic_next(0);
> + kcsan_flat_atomic_begin();
> + return ret;
> +}
> +
> +static inline void read_seq_scope_end(unsigned *seq)
> +{
> + kcsan_flat_atomic_end();

If we are guaranteed to always have one _begin paired by a matching
_end, we can s/kcsan_flat_atomic/kcsan_nestable_atomic/ for these.

> +}
> +
> +static inline bool read_seq_scope_retry(const struct seqlock_t *sl, unsigned *seq)
> +{
> + bool done = !read_seqcount_retry(&sl->seqcount, *seq);
> + if (!done)
> + *seq = read_seqcount_begin(&sl->seqcount);
> + return done;
> +}
> +
> +#define read_seqcount_scope(sl) \
> + for (unsigned seq __cleanup(read_seq_scope_end) = \
> + read_seq_scope_begin(sl), done = 0; \
> + !done; done = read_seq_scope_retry(sl, &seq))
> +

That's nice! I think before we fully moved over to C11, I recall Mark
and I discussed something like that (on IRC?) but gave up until C11
landed and then we forgot. ;-)

Reply all

Reply to author

Forward

[PATCH] kcsan, seqlock: Support seqcount_latch_t

Marco Elver

Peter Zijlstra

Marco Elver

Peter Zijlstra

Marco Elver

Peter Zijlstra

Marco Elver

Peter Zijlstra

Peter Zijlstra

Marco Elver

Marco Elver

Marco Elver

Marco Elver

Marco Elver

Marco Elver

Peter Zijlstra

Marco Elver

Marco Elver

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Marco Elver