[syzbot] [kernel?] KCSAN: data-race in __fput / __tty_hangup (4)

54 views
Skip to first unread message

syzbot

unread,
Apr 21, 2023, 4:18:43 AM4/21/23
to gre...@linuxfoundation.org, jiri...@kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot found the following issue on:

HEAD commit: fcd476ea6a88 Merge tag 'urgent-rcu.2023.03.28a' of git://g..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=146618b9c80000
kernel config: https://syzkaller.appspot.com/x/.config?x=f7350c77b8056a38
dashboard link: https://syzkaller.appspot.com/bug?extid=b7c3ba8cdc2f6cf83c21
compiler: Debian clang version 15.0.7, GNU ld (GNU Binutils for Debian) 2.35.2

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/f3d8ce4faab0/disk-fcd476ea.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/fc53d9dee279/vmlinux-fcd476ea.xz
kernel image: https://storage.googleapis.com/syzbot-assets/22ad755d39b2/bzImage-fcd476ea.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+b7c3ba...@syzkaller.appspotmail.com

==================================================================
BUG: KCSAN: data-race in __fput / __tty_hangup

write to 0xffff88813b31e828 of 8 bytes by task 17702 on cpu 0:
__tty_hangup+0x1e2/0x510 drivers/tty/tty_io.c:622
tty_vhangup+0x17/0x20 drivers/tty/tty_io.c:701
pty_close+0x262/0x280 drivers/tty/pty.c:79
tty_release+0x204/0x930 drivers/tty/tty_io.c:1753
__fput+0x245/0x570 fs/file_table.c:321
____fput+0x15/0x20 fs/file_table.c:349
task_work_run+0x123/0x160 kernel/task_work.c:179
get_signal+0xe5c/0xfe0 kernel/signal.c:2635
arch_do_signal_or_restart+0x89/0x2b0 arch/x86/kernel/signal.c:306
exit_to_user_mode_loop+0x6d/0xe0 kernel/entry/common.c:168
exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297
do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff88813b31e828 of 8 bytes by task 17705 on cpu 1:
__fput+0x21c/0x570 fs/file_table.c:320
____fput+0x15/0x20 fs/file_table.c:349
task_work_run+0x123/0x160 kernel/task_work.c:179
get_signal+0xe5c/0xfe0 kernel/signal.c:2635
arch_do_signal_or_restart+0x89/0x2b0 arch/x86/kernel/signal.c:306
exit_to_user_mode_loop+0x6d/0xe0 kernel/entry/common.c:168
exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297
do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0xffffffff84e91ed0 -> 0xffffffff84e91dc0

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 17705 Comm: syz-executor.0 Not tainted 6.3.0-rc4-syzkaller-00034-gfcd476ea6a88 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/17/2023
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Dmitry Vyukov

unread,
Apr 21, 2023, 4:21:46 AM4/21/23
to syzbot, gre...@linuxfoundation.org, jiri...@kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Fri, 21 Apr 2023 at 10:18, syzbot
<syzbot+b7c3ba...@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: fcd476ea6a88 Merge tag 'urgent-rcu.2023.03.28a' of git://g..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=146618b9c80000
> kernel config: https://syzkaller.appspot.com/x/.config?x=f7350c77b8056a38
> dashboard link: https://syzkaller.appspot.com/bug?extid=b7c3ba8cdc2f6cf83c21
> compiler: Debian clang version 15.0.7, GNU ld (GNU Binutils for Debian) 2.35.2
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/f3d8ce4faab0/disk-fcd476ea.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/fc53d9dee279/vmlinux-fcd476ea.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/22ad755d39b2/bzImage-fcd476ea.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b7c3ba...@syzkaller.appspotmail.com

If I am reading this correctly, this race can lead to NULL derefs
among other things.
hung_up_tty_fops does not have splice_read, while other fops have.

So the following code in splice can execute NULL callback:

if (unlikely(!in->f_op->splice_read))
return warn_unsupported(in, "read");
return in->f_op->splice_read(in, ppos, pipe, len, flags);

Tetsuo Handa

unread,
Apr 21, 2023, 11:12:13 AM4/21/23
to Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/21 17:21, Dmitry Vyukov wrote:
> If I am reading this correctly, this race can lead to NULL derefs
> among other things.
> hung_up_tty_fops does not have splice_read, while other fops have.
>
> So the following code in splice can execute NULL callback:
>
> if (unlikely(!in->f_op->splice_read))
> return warn_unsupported(in, "read");
> return in->f_op->splice_read(in, ppos, pipe, len, flags);
>

__fput(file) is called when the last reference to file is released.
Since __tty_hangup() traverses tty->tty_files under tty->files_lock,
tty_add_file() needs to hold a ref before adding to tty->tty_files
in order to defer concurrent __fput() by other threads?

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 36fb945fdad4..2838703d48cf 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -197,7 +197,7 @@ void tty_add_file(struct tty_struct *tty, struct file *file)
struct tty_file_private *priv = file->private_data;

priv->tty = tty;
- priv->file = file;
+ priv->file = get_file(file);

spin_lock(&tty->files_lock);
list_add(&priv->list, &tty->tty_files);
@@ -228,6 +228,7 @@ static void tty_del_file(struct file *file)
spin_lock(&tty->files_lock);
list_del(&priv->list);
spin_unlock(&tty->files_lock);
+ fput(file);
tty_free_file(file);
}


Tetsuo Handa

unread,
Apr 21, 2023, 12:02:20 PM4/21/23
to Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/22 0:12, Tetsuo Handa wrote:
> __fput(file) is called when the last reference to file is released.
> Since __tty_hangup() traverses tty->tty_files under tty->files_lock,
> tty_add_file() needs to hold a ref before adding to tty->tty_files
> in order to defer concurrent __fput() by other threads?

Hmm, this did not work; get_file() prevents tty_release().

Tetsuo Handa

unread,
Apr 23, 2023, 9:29:12 AM4/23/23
to Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/21 17:21, Dmitry Vyukov wrote:
Hmm, it seems to me that we need multiple patches (which would become too big to
backport) for fixing this bug.

First step (which Dmitry mentioned) is to avoid potential NULL pointer dereferences
caused by

if (!f_op->$callbackname) {
return error;
}
return f_op->$callbackname($args);

pattern, for the next step will touch too many locations to change all at once whereas
the first step could be handled by implementing dummy function for all missing $callbackname.

Next step is to convert from

if (!f_op->$callbackname) {
return error;
}
return f_op->$callbackname($args);

to

fn = READ_ONCE(f_op->$callbackname);
if (!fn) {
return error;
}
return fn($args);

pattern.

Last step is to silence KCSAN by wrapping READ_ONCE()/WRITE_ONCE() using data_race() macro.
Each data_race() usage wants a comment, thus would want a macro if possible...

Greg Kroah-Hartman

unread,
Apr 23, 2023, 10:00:20 AM4/23/23
to Tetsuo Handa, Dmitry Vyukov, syzbot, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Fix the bug properly first. And then only start to worry about stable
kernels, don't let the existance of them affect your development efforts
at all please.

thanks,

greg k-h

Greg Kroah-Hartman

unread,
Apr 23, 2023, 10:03:33 AM4/23/23
to Tetsuo Handa, Dmitry Vyukov, syzbot, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Sun, Apr 23, 2023 at 10:28:58PM +0900, Tetsuo Handa wrote:
I do not understand, what "callbackname" is the problem here? Just
splice_read? Something else? And where would it need to be modified
and why can't we do it in one place only (i.e. install a default handler
instead.)

Also, pointer function operations like this are dirt slow, be wary of
adding additional ones if possible, on fast paths.

> Next step is to convert from
>
> if (!f_op->$callbackname) {
> return error;
> }
> return f_op->$callbackname($args);
>
> to
>
> fn = READ_ONCE(f_op->$callbackname);
> if (!fn) {
> return error;
> }
> return fn($args);
>
> pattern.

Why? What does this solve differently than the first one? What can
change the fops pointer between the check and call path? If something
can change it, then do NOT make that type of check in the first place
(or put a proper lock in place.)

thanks,

greg k-h

Tetsuo Handa

unread,
Apr 23, 2023, 10:18:10 AM4/23/23
to Greg Kroah-Hartman, Dmitry Vyukov, syzbot, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/23 23:03, Greg Kroah-Hartman wrote:
>> Next step is to convert from
>>
>> if (!f_op->$callbackname) {
>> return error;
>> }
>> return f_op->$callbackname($args);
>>
>> to
>>
>> fn = READ_ONCE(f_op->$callbackname);
>> if (!fn) {
>> return error;
>> }
>> return fn($args);
>>
>> pattern.
>
> Why? What does this solve differently than the first one? What can
> change the fops pointer between the check and call path? If something
> can change it, then do NOT make that type of check in the first place
> (or put a proper lock in place.)

__tty_hangup() is changing filp->f_op like

spin_lock(&tty->files_lock);
/* This breaks for file handles being sent over AF_UNIX sockets ? */
list_for_each_entry(priv, &tty->tty_files, list) {
filp = priv->file;
if (filp->f_op->write_iter == redirected_tty_write)
cons_filp = filp;
if (filp->f_op->write_iter != tty_write)
continue;
closecount++;
__tty_fasync(-1, filp, 0); /* can't block */
filp->f_op = &hung_up_tty_fops;
}
spin_unlock(&tty->files_lock);

but filp->f_op readers are (of course) not protected by tty->files_lock.
Like Dmitry mentioned

hung_up_tty_fops does not have splice_read, while other fops have.

, patterns like

if (unlikely(!in->f_op->splice_read))
return warn_unsupported(in, "read");
return in->f_op->splice_read(in, ppos, pipe, len, flags);

is not safe (if compiler generates a code that re-reads the pointer).

Using READ_ONCE() is for preventing the compiler to re-read, and using data_race()
is for teaching KCSAN that race while happens during READ_ONCE()/WRITE_ONCE()
is acceptable.

>> First step (which Dmitry mentioned) is to avoid potential NULL pointer dereferences
>> caused by
>>
>> if (!f_op->$callbackname) {
>> return error;
>> }
>> return f_op->$callbackname($args);
>>
>> pattern, for the next step will touch too many locations to change all at once whereas
>> the first step could be handled by implementing dummy function for all missing $callbackname.
>
> I do not understand, what "callbackname" is the problem here? Just
> splice_read? Something else?

I haven't checked.

> And where would it need to be modified
> and why can't we do it in one place only (i.e. install a default handler
> instead.)

Yes, adding a dummy splice_read callback handler to hung_up_tty_fops is
what I call the first step.

Al Viro

unread,
Apr 23, 2023, 7:34:48 PM4/23/23
to Tetsuo Handa, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Sat, Apr 22, 2023 at 12:12:02AM +0900, Tetsuo Handa wrote:
> On 2023/04/21 17:21, Dmitry Vyukov wrote:
> > If I am reading this correctly, this race can lead to NULL derefs
> > among other things.
> > hung_up_tty_fops does not have splice_read, while other fops have.
> >
> > So the following code in splice can execute NULL callback:
> >
> > if (unlikely(!in->f_op->splice_read))
> > return warn_unsupported(in, "read");
> > return in->f_op->splice_read(in, ppos, pipe, len, flags);
> >
>
> __fput(file) is called when the last reference to file is released.
> Since __tty_hangup() traverses tty->tty_files under tty->files_lock,
> tty_add_file() needs to hold a ref before adding to tty->tty_files
> in order to defer concurrent __fput() by other threads?
>
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index 36fb945fdad4..2838703d48cf 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -197,7 +197,7 @@ void tty_add_file(struct tty_struct *tty, struct file *file)
> struct tty_file_private *priv = file->private_data;
>
> priv->tty = tty;
> - priv->file = file;
> + priv->file = get_file(file);

This is broken. Simple open() + close() on a tty will leak the damn thing -
that extra reference will stick around, possibly all the way until reboot.

As for the original report - add a (failing) ->splice_read() in hung_ut_tty_fops
to deal with the original problem.

Tetsuo Handa

unread,
Apr 23, 2023, 7:56:04 PM4/23/23
to Al Viro, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/24 8:34, Al Viro wrote:
> As for the original report - add a (failing) ->splice_read() in hung_ut_tty_fops
> to deal with the original problem.

Yes, adding a dummy splice_read callback is OK for avoiding NULL pointer dereference.
But we need more changes for avoiding KCSAN race reporting.

Are you OK with https://lkml.kernel.org/r/6bec279c-07b3-d6f1...@I-love.SAKURA.ne.jp
which will require touching so many locations ?

If you want tty layer handle this race without rewriting all f_op dereferences,
we would need to replace

filp->f_op = &hung_up_tty_fops;

with

data_race(filp->some_flags_for_tty = true);

rather than

data_race(filp->f_op = &hung_up_tty_fops);

and check

if (data_race(filp->some_flags_for_tty)) {
return error;
}

from each "struct tty_operations" callback function.

Al Viro

unread,
Apr 23, 2023, 8:44:42 PM4/23/23
to Tetsuo Handa, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
What struct tty_operations? It's file_operations, unfortunately, and
their calls are on quite a few fast paths.

Do you mean doing that in method instances that are present in tty_fops
you mean doing that in method instances that are present in tty_fops
and different in hung_up_tty_fops?

Tetsuo Handa

unread,
Apr 23, 2023, 9:09:20 PM4/23/23
to Al Viro, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Jiri Slaby, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/24 9:44, Al Viro wrote:
> Do you mean doing that in method instances that are present in tty_fops
> you mean doing that in method instances that are present in tty_fops
> and different in hung_up_tty_fops?

I meant, remove hung_up_tty_fops, and embed callbacks defined in hung_up_tty_fops into
tty_fops. For example, tty_read() (from "struct file_operations tty_fops"->read_iter)
will be changed like

static ssize_t tty_read(struct kiocb *iocb, struct iov_iter *to)
{
int i;
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
struct tty_struct *tty = file_tty(file);
struct tty_ldisc *ld;

+ if (data_race(file->tty_hangup)) {
+ return hung_up_tty_read(iocb, to);
+ }
+
if (tty_paranoia_check(tty, inode, "tty_read"))
return -EIO;
if (!tty || tty_io_error(tty))
return -EIO;

/* We want to wait for the line discipline to sort out in this
* situation.
*/
ld = tty_ldisc_ref_wait(tty);
if (!ld)
return hung_up_tty_read(iocb, to);
i = -EIO;
if (ld->ops->read)
i = iterate_tty_read(ld, tty, file, to);
tty_ldisc_deref(ld);

if (i > 0)
tty_update_time(&inode->i_atime);

return i;
}

in order to avoid wrapping filp->f_op->* dereferences using data_race().

Tetsuo Handa

unread,
Apr 25, 2023, 10:47:25 AM4/25/23
to Al Viro, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
tty_open() is doing "filp->f_op = &tty_fops;".
__tty_hangup() is doing "filp->f_op = &hung_up_tty_fops;".
tty_hung_up_p() is doing "filp->f_op == &hung_up_tty_fops",
and most functions are checking tty_hung_up_p().

Therefore, if we can remember whether __tty_hangup() did
"filp->f_op = &hung_up_tty_fops;" without changing filp->f_op,
we can reuse tty_hung_up_p().

Since tty_open() allocates sizeof("struct tty_file_private") bytes
of memory associated with each "struct file", we can record whether
__tty_hangup() did "filp->f_op = &hung_up_tty_fops;" using this memory.

Therefore, diff will be something like this?
(Explanation and question follows this diff.)

drivers/tty/tty_io.c | 36 +++++++++++++++++-------------------
include/linux/tty.h | 1 +
2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/include/linux/tty.h b/include/linux/tty.h
index 093935e97f42..d7fa18f8c526 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -255,6 +255,7 @@ struct tty_file_private {
struct tty_struct *tty;
struct file *file;
struct list_head list;
+ bool hung;
};

/**
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 36fb945fdad4..bebd236679f0 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -182,7 +182,7 @@ int tty_alloc_file(struct file *file)
{
struct tty_file_private *priv;

- priv = kmalloc(sizeof(*priv), GFP_KERNEL);
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
return -ENOMEM;

@@ -491,17 +491,6 @@ static const struct file_operations console_fops = {
.fasync = tty_fasync,
};

-static const struct file_operations hung_up_tty_fops = {
- .llseek = no_llseek,
- .read_iter = hung_up_tty_read,
- .write_iter = hung_up_tty_write,
- .poll = hung_up_tty_poll,
- .unlocked_ioctl = hung_up_tty_ioctl,
- .compat_ioctl = hung_up_tty_compat_ioctl,
- .release = tty_release,
- .fasync = hung_up_tty_fasync,
-};
-
static DEFINE_SPINLOCK(redirect_lock);
static struct file *redirect;

@@ -619,7 +608,8 @@ static void __tty_hangup(struct tty_struct *tty, int exit_session)
continue;
closecount++;
__tty_fasync(-1, filp, 0); /* can't block */
- filp->f_op = &hung_up_tty_fops;
+ /* Accept race with tty_hung_up_p() test. */
+ data_race(priv->hung = true);
}
spin_unlock(&tty->files_lock);

@@ -743,7 +733,8 @@ void tty_vhangup_session(struct tty_struct *tty)
*/
int tty_hung_up_p(struct file *filp)
{
- return (filp && filp->f_op == &hung_up_tty_fops);
+ /* Accept race with __tty_hangup(). */
+ return filp && data_race(((struct tty_file_private *) filp->private_data)->hung);
}
EXPORT_SYMBOL(tty_hung_up_p);

@@ -911,6 +902,8 @@ static ssize_t tty_read(struct kiocb *iocb, struct iov_iter *to)
struct tty_struct *tty = file_tty(file);
struct tty_ldisc *ld;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_read(iocb, to);
if (tty_paranoia_check(tty, inode, "tty_read"))
return -EIO;
if (!tty || tty_io_error(tty))
@@ -1073,6 +1066,8 @@ static ssize_t file_tty_write(struct file *file, struct kiocb *iocb, struct iov_
struct tty_ldisc *ld;
ssize_t ret;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_write(iocb, from);
if (tty_paranoia_check(tty, file_inode(file), "tty_write"))
return -EIO;
if (!tty || !tty->ops->write || tty_io_error(tty))
@@ -2159,11 +2154,6 @@ static int tty_open(struct inode *inode, struct file *filp)
return retval;

schedule();
- /*
- * Need to reset f_op in case a hangup happened.
- */
- if (tty_hung_up_p(filp))
- filp->f_op = &tty_fops;
goto retry_open;
}
clear_bit(TTY_HUPPED, &tty->flags);
@@ -2197,6 +2187,8 @@ static __poll_t tty_poll(struct file *filp, poll_table *wait)
struct tty_ldisc *ld;
__poll_t ret = 0;

+ if (tty_hung_up_p(filp))
+ return hung_up_tty_poll(filp, wait);
if (tty_paranoia_check(tty, file_inode(filp), "tty_poll"))
return 0;

@@ -2249,6 +2241,8 @@ static int tty_fasync(int fd, struct file *filp, int on)
struct tty_struct *tty = file_tty(filp);
int retval = -ENOTTY;

+ if (tty_hung_up_p(filp))
+ return hung_up_tty_fasync(fd, filp, on);
tty_lock(tty);
if (!tty_hung_up_p(filp))
retval = __tty_fasync(fd, filp, on);
@@ -2658,6 +2652,8 @@ long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
int retval;
struct tty_ldisc *ld;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_ioctl(file, cmd, arg);
if (tty_paranoia_check(tty, file_inode(file), "tty_ioctl"))
return -EINVAL;

@@ -2943,6 +2939,8 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
return tty_ioctl(file, cmd, arg);
}

+ if (tty_hung_up_p(file))
+ return hung_up_tty_compat_ioctl(file, cmd, arg);
if (tty_paranoia_check(tty, file_inode(file), "tty_ioctl"))
return -EINVAL;


Since tty_read() is called via file->f_op->read_iter() from call_read_iter() from
generic_file_splice_read(), no change is needed for tty_fops->splice_read.

Since tty_write() is called via file->f_op->write_iter() from call_write_iter()
from do_iter_readv_writev() from do_iter_write() from vfs_iter_write() from
iter_file_splice_write(), no change is needed for tty_fops->splice_write.

Since both tty_fops and hung_up_tty_fops point to the same callback for
llseek/release, no change is needed for tty_fops->llseek and tty_fops->release.

Regarding tty_read()/tty_write()/tty_poll()/tty_ioctl()/tty_compat_ioctl(),
respectively embed hung_up_tty_read()/hung_up_tty_write()/hung_up_tty_poll()/
hung_up_tty_ioctl()/hung_up_tty_compat_ioctl() right before tty_paranoia_check().

Regarding tty_fasync(), embed hung_up_tty_fasync() right before tty_lock() in
tty_fasync() rather than tty_paranoia_check() in __tty_fasync(), for
tty_hung_up_p() is checked right after tty_lock() returned.

I don't know what changes are required for tty_open() and tty_show_fdinfo().
I assume that making no change for tty_show_fdinfo() is harmless.
But how does "hung_up_tty_fops does not call tty_open()" affects here?

Al Viro

unread,
Apr 25, 2023, 12:03:57 PM4/25/23
to Tetsuo Handa, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Tue, Apr 25, 2023 at 11:47:20PM +0900, Tetsuo Handa wrote:
> @@ -743,7 +733,8 @@ void tty_vhangup_session(struct tty_struct *tty)
> */
> int tty_hung_up_p(struct file *filp)
> {
> - return (filp && filp->f_op == &hung_up_tty_fops);
> + /* Accept race with __tty_hangup(). */
> + return filp && data_race(((struct tty_file_private *) filp->private_data)->hung);
> }
> EXPORT_SYMBOL(tty_hung_up_p);

Umm... Are you sure we never call that for non-TTY files? Seeing that
it's exported and all such... For internal uses (tty_read(), etc.) the
check for file being NULL is pointless; for general-purpose primitive
we probably want to check ->f_op as well...

> @@ -2159,11 +2154,6 @@ static int tty_open(struct inode *inode, struct file *filp)
> return retval;
>
> schedule();
> - /*
> - * Need to reset f_op in case a hangup happened.
> - */
> - if (tty_hung_up_p(filp))
> - filp->f_op = &tty_fops;

That needs a comment in commit message - hangup indication doesn't
need to be reset here, since tty_release() has freed the entire thing
that used to hang off the ->private_data.

> I don't know what changes are required for tty_open() and tty_show_fdinfo().
> I assume that making no change for tty_show_fdinfo() is harmless.
> But how does "hung_up_tty_fops does not call tty_open()" affects here?

It doesn't - file->f_op->open() is only called once, and only on fresh
struct file.

Tetsuo Handa

unread,
Apr 25, 2023, 6:09:39 PM4/25/23
to Al Viro, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/26 1:03, Al Viro wrote:
> On Tue, Apr 25, 2023 at 11:47:20PM +0900, Tetsuo Handa wrote:
>> @@ -743,7 +733,8 @@ void tty_vhangup_session(struct tty_struct *tty)
>> */
>> int tty_hung_up_p(struct file *filp)
>> {
>> - return (filp && filp->f_op == &hung_up_tty_fops);
>> + /* Accept race with __tty_hangup(). */
>> + return filp && data_race(((struct tty_file_private *) filp->private_data)->hung);
>> }
>> EXPORT_SYMBOL(tty_hung_up_p);
>
> Umm... Are you sure we never call that for non-TTY files? Seeing that
> it's exported and all such... For internal uses (tty_read(), etc.) the
> check for file being NULL is pointless; for general-purpose primitive
> we probably want to check ->f_op as well...

Indeed. Since drivers/tty/hvc/hvsi.c drivers/tty/n_hdlc.c drivers/tty/n_tty.c
are calling tty_hung_up_p(), this check should be

filp && filp->f_op == &tty_fops && data_race(((struct tty_file_private *) filp->private_data)->hung);

.

>> I don't know what changes are required for tty_open() and tty_show_fdinfo().
>> I assume that making no change for tty_show_fdinfo() is harmless.
>> But how does "hung_up_tty_fops does not call tty_open()" affects here?
>
> It doesn't - file->f_op->open() is only called once, and only on fresh
> struct file.

OK. Then, we can continue tty_hung_up_p() approach.

Tetsuo Handa

unread,
Apr 26, 2023, 7:05:11 AM4/26/23
to Al Viro, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
syzbot is reporting data race between __tty_hangup() and __fput(), for
filp->f_op readers are not holding tty->files_lock.

Dmitry Vyukov mentioned that this race has possibility of NULL pointer
dereference, for tty_fops implements e.g. splice_read callback whereas
hung_up_tty_fops does not.

CPU0 CPU1
---- ----
do_splice_to() {
__tty_hangup() {
// f_op->splice_read was generic_file_splice_read
if (unlikely(!in->f_op->splice_read))
return warn_unsupported(in, "read");
filp->f_op = &hung_up_tty_fops;
// f_op->splice_read is now NULL
return in->f_op->splice_read(in, ppos, pipe, len, flags);
}
}

If we care about only NULL pointer dereference, implementing missing
callbacks to hung_up_tty_fops is fine. But if we also care about KCSAN
reports, we will need to wrap all filp->f_op usages which are reachable
via tty_fops callbacks using data_race().

Such wrapping is overkill as a fix for tty_io code. Therefore, instead of
implementing missing callbacks, stop updating filp->f_op and remove
hung_up_tty_fops. Then, changes will be limited to within tty_io code.

tty_open() is doing "filp->f_op = &tty_fops;".
__tty_hangup() is doing "filp->f_op = &hung_up_tty_fops;".
tty_hung_up_p() is doing "filp->f_op == &hung_up_tty_fops", and
most functions are checking tty_hung_up_p().

Since tty_open() allocates "struct tty_file_private" for each
"struct file", we can remember whether __tty_hangup() was called
by adding a flag to "struct tty_file_private".

Below is detail of where/what to change.

Regarding __tty_hangup(), replace "filp->f_op = &hung_up_tty_fops;" with
setting the above-mentioned flag.

Regarding tty_hungup_p(), replace "filp->f_op == &hung_up_tty_fops" with
"filp->f_op == &tty_fops" and check the above-mentioned flag.

Regarding tty_open(), just remove "filp->f_op = &tty_fops;" because
"struct tty_file_private" was already released by tty_del_file() from
tty_release() when control reaches this line.

Regarding tty_{read,write,poll,ioctl,compat_ioctl}(), respectively embed
hung_up_tty_{read,write,poll,ioctl,compat_ioctl}() right before
tty_paranoia_check().

Regarding tty_fasync(), embed hung_up_tty_fasync() right before tty_lock()
in tty_fasync() rather than tty_paranoia_check() in __tty_fasync(), for
tty_hung_up_p() is checked right after tty_lock() returned.

Since tty_read() is called via file->f_op->read_iter() from
call_read_iter() from generic_file_splice_read(), no change is needed for
tty_fops->splice_read.

Since tty_write() is called via file->f_op->write_iter() from
call_write_iter() from do_iter_readv_writev() from do_iter_write() from
vfs_iter_write() from iter_file_splice_write(), no change is needed for
tty_fops->splice_write.

Since both tty_fops and hung_up_tty_fops point to the same callback for
llseek/release, no change is needed for tty_fops->{llseek,release}.

Finally, remove hung_up_tty_fops.

Reported-by: syzbot <syzbot+b7c3ba...@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=b7c3ba8cdc2f6cf83c21
Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Cc: Dmitry Vyukov <dvy...@google.com>
---
drivers/tty/tty_io.c | 37 ++++++++++++++++++-------------------
include/linux/tty.h | 1 +
2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 36fb945fdad4..c1748e4da233 100644
@@ -743,7 +733,9 @@ void tty_vhangup_session(struct tty_struct *tty)
*/
int tty_hung_up_p(struct file *filp)
{
- return (filp && filp->f_op == &hung_up_tty_fops);
+ return filp && filp->f_op == &tty_fops &&
+ /* Accept race with __tty_hangup(). */
+ data_race(((struct tty_file_private *) filp->private_data)->hung);
}
EXPORT_SYMBOL(tty_hung_up_p);

@@ -911,6 +903,8 @@ static ssize_t tty_read(struct kiocb *iocb, struct iov_iter *to)
struct tty_struct *tty = file_tty(file);
struct tty_ldisc *ld;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_read(iocb, to);
if (tty_paranoia_check(tty, inode, "tty_read"))
return -EIO;
if (!tty || tty_io_error(tty))
@@ -1073,6 +1067,8 @@ static ssize_t file_tty_write(struct file *file, struct kiocb *iocb, struct iov_
struct tty_ldisc *ld;
ssize_t ret;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_write(iocb, from);
if (tty_paranoia_check(tty, file_inode(file), "tty_write"))
return -EIO;
if (!tty || !tty->ops->write || tty_io_error(tty))
@@ -2159,11 +2155,6 @@ static int tty_open(struct inode *inode, struct file *filp)
return retval;

schedule();
- /*
- * Need to reset f_op in case a hangup happened.
- */
- if (tty_hung_up_p(filp))
- filp->f_op = &tty_fops;
goto retry_open;
}
clear_bit(TTY_HUPPED, &tty->flags);
@@ -2197,6 +2188,8 @@ static __poll_t tty_poll(struct file *filp, poll_table *wait)
struct tty_ldisc *ld;
__poll_t ret = 0;

+ if (tty_hung_up_p(filp))
+ return hung_up_tty_poll(filp, wait);
if (tty_paranoia_check(tty, file_inode(filp), "tty_poll"))
return 0;

@@ -2249,6 +2242,8 @@ static int tty_fasync(int fd, struct file *filp, int on)
struct tty_struct *tty = file_tty(filp);
int retval = -ENOTTY;

+ if (tty_hung_up_p(filp))
+ return hung_up_tty_fasync(fd, filp, on);
tty_lock(tty);
if (!tty_hung_up_p(filp))
retval = __tty_fasync(fd, filp, on);
@@ -2658,6 +2653,8 @@ long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
int retval;
struct tty_ldisc *ld;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_ioctl(file, cmd, arg);
if (tty_paranoia_check(tty, file_inode(file), "tty_ioctl"))
return -EINVAL;

@@ -2943,6 +2940,8 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
return tty_ioctl(file, cmd, arg);
}

+ if (tty_hung_up_p(file))
+ return hung_up_tty_compat_ioctl(file, cmd, arg);
if (tty_paranoia_check(tty, file_inode(file), "tty_ioctl"))
return -EINVAL;

diff --git a/include/linux/tty.h b/include/linux/tty.h
index 093935e97f42..d7fa18f8c526 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -255,6 +255,7 @@ struct tty_file_private {
struct tty_struct *tty;
struct file *file;
struct list_head list;
+ bool hung;
};

/**
--
2.34.1

Tetsuo Handa

unread,
Apr 28, 2023, 12:41:21 PM4/28/23
to Nathan Chancellor, Al Viro, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 2023/04/29 1:27, Nathan Chancellor wrote:
> I see this change has shown up in -next as commit 4c87e9e5479b ("tty:
> tty_io: remove hung_up_tty_fops"), where it causes the following warning
> for configurations without CONFIG_COMPAT (I used ARCH=arm defconfig):
>
> drivers/tty/tty_io.c:446:13: warning: 'hung_up_tty_compat_ioctl' defined but not used [-Wunused-function]
> 446 | static long hung_up_tty_compat_ioctl(struct file *file,
> | ^~~~~~~~~~~~~~~~~~~~~~~~
>
> I am not sure if you just added that patch for additional test coverage
> or for final acceptance but the following diff resolves this warning for
> me, perhaps it can be folded in for a v2?

Thank you for reporting. Yes, moving the definition will solve the warning.

I'm testing this patch using my volatile tree before getting merged to a
permanent tree.

Al Viro

unread,
Apr 28, 2023, 1:11:55 PM4/28/23
to Tetsuo Handa, Nathan Chancellor, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Sat, Apr 29, 2023 at 01:41:02AM +0900, Tetsuo Handa wrote:
> On 2023/04/29 1:27, Nathan Chancellor wrote:
> > I see this change has shown up in -next as commit 4c87e9e5479b ("tty:
> > tty_io: remove hung_up_tty_fops"), where it causes the following warning
> > for configurations without CONFIG_COMPAT (I used ARCH=arm defconfig):
> >
> > drivers/tty/tty_io.c:446:13: warning: 'hung_up_tty_compat_ioctl' defined but not used [-Wunused-function]
> > 446 | static long hung_up_tty_compat_ioctl(struct file *file,
> > | ^~~~~~~~~~~~~~~~~~~~~~~~
> >
> > I am not sure if you just added that patch for additional test coverage
> > or for final acceptance but the following diff resolves this warning for
> > me, perhaps it can be folded in for a v2?
>
> Thank you for reporting. Yes, moving the definition will solve the warning.

IDGI... Why do you need to keep that function at all? Compare it
with hung_up_tty_ioctl() - they are token-for-token identical; the only
difference is the function name...

Nathan Chancellor

unread,
Apr 28, 2023, 1:18:27 PM4/28/23
to Tetsuo Handa, Al Viro, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hi Tetsuo,
I see this change has shown up in -next as commit 4c87e9e5479b ("tty:
tty_io: remove hung_up_tty_fops"), where it causes the following warning
for configurations without CONFIG_COMPAT (I used ARCH=arm defconfig):

drivers/tty/tty_io.c:446:13: warning: 'hung_up_tty_compat_ioctl' defined but not used [-Wunused-function]
446 | static long hung_up_tty_compat_ioctl(struct file *file,
| ^~~~~~~~~~~~~~~~~~~~~~~~

I am not sure if you just added that patch for additional test coverage
or for final acceptance but the following diff resolves this warning for
me, perhaps it can be folded in for a v2?

Cheers,
Nathan

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 553182753098..9588ee356646 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -443,12 +443,6 @@ static long hung_up_tty_ioctl(struct file *file, unsigned int cmd,
return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
}

-static long hung_up_tty_compat_ioctl(struct file *file,
- unsigned int cmd, unsigned long arg)
-{
- return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
-}
-
static int hung_up_tty_fasync(int fd, struct file *file, int on)
{
return -ENOTTY;
@@ -2812,6 +2806,12 @@ struct serial_struct32 {
compat_int_t reserved;
};

+static long hung_up_tty_compat_ioctl(struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
+}
+
static int compat_tty_tiocsserial(struct tty_struct *tty,
struct serial_struct32 __user *ss)
{

Greg Kroah-Hartman

unread,
Apr 28, 2023, 1:31:40 PM4/28/23
to Tetsuo Handa, Nathan Chancellor, Al Viro, Jiri Slaby, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
How are you adding new stuff to linux-next right now when the tree
should be "closed"? And I haven't reviewed this yet, what tree has
picked it up?

Please wait until after -rc1 is out for new stuff, you know this...

thanks,

greg k-h

Tetsuo Handa

unread,
Apr 29, 2023, 6:44:03 AM4/29/23
to Al Viro, Nathan Chancellor, Jiri Slaby, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Indeed hung_up_tty_ioctl() and hung_up_tty_compat_ioctl() are identical.
We can remove hung_up_tty_compat_ioctl() if we don't consider

ld = tty_ldisc_ref_wait(tty);
if (!ld)
- return hung_up_tty_compat_ioctl(file, cmd, arg);
+ return hung_up_tty_ioctl(file, cmd, arg);
if (ld->ops->compat_ioctl)
retval = ld->ops->compat_ioctl(tty, cmd, arg);

at tty_compat_ioctl() as strange looking.

Maybe adding "inline" keyword to hung_up_tty_*() definitions
because hung_up_tty_fops is removed?

Guenter Roeck

unread,
Apr 29, 2023, 11:59:21 AM4/29/23
to Greg Kroah-Hartman, Tetsuo Handa, Nathan Chancellor, Al Viro, Jiri Slaby, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
That happens all the time, and quite often the "late" additions are applied
to mainline immediately afterwards and end up causing problems there.
I can only hope that this won't happen with this one.

> picked it up?
>
> Please wait until after -rc1 is out for new stuff, you know this...
>

This patch is supposed to fix a data race, so maybe it wasn't considered
"new". Just guessing, of course.

Guenter

Geert Uytterhoeven

unread,
May 2, 2023, 3:14:13 AM5/2/23
to Guenter Roeck, Greg Kroah-Hartman, Tetsuo Handa, Nathan Chancellor, Al Viro, Jiri Slaby, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Sat, Apr 29, 2023 at 5:56 PM Guenter Roeck <li...@roeck-us.net> wrote:
> On Fri, Apr 28, 2023 at 07:31:35PM +0200, Greg Kroah-Hartman wrote:
> > On Sat, Apr 29, 2023 at 01:41:02AM +0900, Tetsuo Handa wrote:
> > > On 2023/04/29 1:27, Nathan Chancellor wrote:
> > > > I see this change has shown up in -next as commit 4c87e9e5479b ("tty:
> > > > tty_io: remove hung_up_tty_fops"), where it causes the following warning
> > > > for configurations without CONFIG_COMPAT (I used ARCH=arm defconfig):
> > > >
> > > > drivers/tty/tty_io.c:446:13: warning: 'hung_up_tty_compat_ioctl' defined but not used [-Wunused-function]
> > > > 446 | static long hung_up_tty_compat_ioctl(struct file *file,
> > > > | ^~~~~~~~~~~~~~~~~~~~~~~~
> > > >
> > > > I am not sure if you just added that patch for additional test coverage
> > > > or for final acceptance but the following diff resolves this warning for
> > > > me, perhaps it can be folded in for a v2?
> > >
> > > Thank you for reporting. Yes, moving the definition will solve the warning.
> > >
> > > I'm testing this patch using my volatile tree before getting merged to a
> > > permanent tree.
> >
> > How are you adding new stuff to linux-next right now when the tree
> > should be "closed"? And I haven't reviewed this yet, what tree has
>
> That happens all the time, and quite often the "late" additions are applied
> to mainline immediately afterwards and end up causing problems there.
> I can only hope that this won't happen with this one.

And it might be picked up by stable ;-(

> > picked it up?
> >
> > Please wait until after -rc1 is out for new stuff, you know this...
>
> This patch is supposed to fix a data race, so maybe it wasn't considered
> "new". Just guessing, of course.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Tetsuo Handa

unread,
May 13, 2023, 9:02:49 PM5/13/23
to Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
Finally, remove hung_up_tty_fops and mark callbacks used by
hung_up_tty_fops as "inline".

Reported-by: syzbot <syzbot+b7c3ba...@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=b7c3ba8cdc2f6cf83c21
Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Cc: Dmitry Vyukov <dvy...@google.com>
---
Changes in v2:
Mark callbacks used by hung_up_tty_fops as "inline" in order to fix
-Wunused-function warning when CONFIG_COMPAT=n, reported by
Nathan Chancellor <nat...@kernel.org> and Arnd Bergmann <ar...@kernel.org>.

drivers/tty/tty_io.c | 49 ++++++++++++++++++++++----------------------
include/linux/tty.h | 1 +
2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index c84be40fb8df..bff0a2ffa68f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -182,7 +182,7 @@ int tty_alloc_file(struct file *file)
{
struct tty_file_private *priv;

- priv = kmalloc(sizeof(*priv), GFP_KERNEL);
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
return -ENOMEM;

@@ -421,35 +421,35 @@ struct tty_driver *tty_find_polling_driver(char *name, int *line)
EXPORT_SYMBOL_GPL(tty_find_polling_driver);
#endif

-static ssize_t hung_up_tty_read(struct kiocb *iocb, struct iov_iter *to)
+static inline ssize_t hung_up_tty_read(struct kiocb *iocb, struct iov_iter *to)
{
return 0;
}

-static ssize_t hung_up_tty_write(struct kiocb *iocb, struct iov_iter *from)
+static inline ssize_t hung_up_tty_write(struct kiocb *iocb, struct iov_iter *from)
{
return -EIO;
}

/* No kernel lock held - none needed ;) */
-static __poll_t hung_up_tty_poll(struct file *filp, poll_table *wait)
+static inline __poll_t hung_up_tty_poll(struct file *filp, poll_table *wait)
{
return EPOLLIN | EPOLLOUT | EPOLLERR | EPOLLHUP | EPOLLRDNORM | EPOLLWRNORM;
}

-static long hung_up_tty_ioctl(struct file *file, unsigned int cmd,
+static inline long hung_up_tty_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
}

-static long hung_up_tty_compat_ioctl(struct file *file,
+static inline long hung_up_tty_compat_ioctl(struct file *file,
unsigned int cmd, unsigned long arg)
{
return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
}

-static int hung_up_tty_fasync(int fd, struct file *file, int on)
+static inline int hung_up_tty_fasync(int fd, struct file *file, int on)
{
return -ENOTTY;
}
index e8d5d9997aca..9ea30598c576 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -255,6 +255,7 @@ struct tty_file_private {
struct tty_struct *tty;
struct file *file;
struct list_head list;
+ bool hung;
};

/**
--
2.18.4

Greg Kroah-Hartman

unread,
May 30, 2023, 6:44:45 AM5/30/23
to Tetsuo Handa, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Sun, May 14, 2023 at 10:02:26AM +0900, Tetsuo Handa wrote:
> syzbot is reporting data race between __tty_hangup() and __fput(), for
> filp->f_op readers are not holding tty->files_lock.
>
> Dmitry Vyukov mentioned that this race has possibility of NULL pointer
> dereference, for tty_fops implements e.g. splice_read callback whereas
> hung_up_tty_fops does not.
>
> CPU0 CPU1
> ---- ----
> do_splice_to() {
> __tty_hangup() {
> // f_op->splice_read was generic_file_splice_read
> if (unlikely(!in->f_op->splice_read))
> return warn_unsupported(in, "read");
> filp->f_op = &hung_up_tty_fops;
> // f_op->splice_read is now NULL
> return in->f_op->splice_read(in, ppos, pipe, len, flags);
> }
> }
>
> If we care about only NULL pointer dereference, implementing missing
> callbacks to hung_up_tty_fops is fine. But if we also care about KCSAN
> reports, we will need to wrap all filp->f_op usages which are reachable
> via tty_fops callbacks using data_race().

I'm missing something here. Why would KCSAN report problems if we
implement the needed callbacks in hung_up_tty_fops? And what reports
would they be?

And why would data_race() help here?
Why is this zeroing out everything now? Just because you added one
bool? Why not just set the bool properly instead?

> if (!priv)
> return -ENOMEM;
>
> @@ -421,35 +421,35 @@ struct tty_driver *tty_find_polling_driver(char *name, int *line)
> EXPORT_SYMBOL_GPL(tty_find_polling_driver);
> #endif
>
> -static ssize_t hung_up_tty_read(struct kiocb *iocb, struct iov_iter *to)
> +static inline ssize_t hung_up_tty_read(struct kiocb *iocb, struct iov_iter *to)
> {
> return 0;
> }
>
> -static ssize_t hung_up_tty_write(struct kiocb *iocb, struct iov_iter *from)
> +static inline ssize_t hung_up_tty_write(struct kiocb *iocb, struct iov_iter *from)
> {
> return -EIO;
> }
>
> /* No kernel lock held - none needed ;) */
> -static __poll_t hung_up_tty_poll(struct file *filp, poll_table *wait)
> +static inline __poll_t hung_up_tty_poll(struct file *filp, poll_table *wait)
> {
> return EPOLLIN | EPOLLOUT | EPOLLERR | EPOLLHUP | EPOLLRDNORM | EPOLLWRNORM;
> }
>
> -static long hung_up_tty_ioctl(struct file *file, unsigned int cmd,
> +static inline long hung_up_tty_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
> return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
> }
>
> -static long hung_up_tty_compat_ioctl(struct file *file,
> +static inline long hung_up_tty_compat_ioctl(struct file *file,
> unsigned int cmd, unsigned long arg)
> {
> return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
> }

Marking these as inline, and then treating them as a function pointer,
seems like a horrid way to work around a compiler warning. As they
really are not inline functions anymore, but yet the compiler doesn't
know that. Odds are once the compiler gets smarter, the warnings will
return, so please, solve this properly.
Why accept it? Say why it's not really an issue here.

> }
> spin_unlock(&tty->files_lock);
>
> @@ -743,7 +733,9 @@ void tty_vhangup_session(struct tty_struct *tty)
> */
> int tty_hung_up_p(struct file *filp)
> {
> - return (filp && filp->f_op == &hung_up_tty_fops);
> + return filp && filp->f_op == &tty_fops &&
> + /* Accept race with __tty_hangup(). */
> + data_race(((struct tty_file_private *) filp->private_data)->hung);

Same here.

> }
> EXPORT_SYMBOL(tty_hung_up_p);
>
> @@ -911,6 +903,8 @@ static ssize_t tty_read(struct kiocb *iocb, struct iov_iter *to)
> struct tty_struct *tty = file_tty(file);
> struct tty_ldisc *ld;
>
> + if (tty_hung_up_p(file))
> + return hung_up_tty_read(iocb, to);

What happens if you hang up _right_ after this check? There's no
locking here, right? Same everywhere else you have this pattern, you
made the race window smaller, but it's still there from what I can see.

> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -255,6 +255,7 @@ struct tty_file_private {
> struct tty_struct *tty;
> struct file *file;
> struct list_head list;
> + bool hung;

No hint as to what "hung" means here?

thanks,

greg k-h

Tetsuo Handa

unread,
May 30, 2023, 7:58:01 AM5/30/23
to Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On 2023/05/30 19:44, Greg Kroah-Hartman wrote:
> On Sun, May 14, 2023 at 10:02:26AM +0900, Tetsuo Handa wrote:
>> If we care about only NULL pointer dereference, implementing missing
>> callbacks to hung_up_tty_fops is fine. But if we also care about KCSAN
>> reports, we will need to wrap all filp->f_op usages which are reachable
>> via tty_fops callbacks using data_race().
>
> I'm missing something here. Why would KCSAN report problems if we
> implement the needed callbacks in hung_up_tty_fops? And what reports
> would they be?

Unlike atomic operations such as atomic_read()/atomic_set(), normal read/write
operations are not atomic for KCSAN. KCSAN reports some value being changed
during a read/write.

In this report, KCSAN detected that __tty_hangup() changed the value of
filp->f_op from 0xffffffff84e91ed0 to 0xffffffff84e91dc0 at

filp->f_op = &hung_up_tty_fops;

line when __fput() was reading the value of filp->f_op at

if (file->f_op->release)

line.

Even if we implement the needed callbacks in hung_up_tty_fops,
KCSAN will continue reporting that the value of filp->f_op changes.

>
> And why would data_race() help here?

data_race() tells KCSAN not to report.
data_race() is used when the race KCSAN checks is harmless.



>> @@ -182,7 +182,7 @@ int tty_alloc_file(struct file *file)
>> {
>> struct tty_file_private *priv;
>>
>> - priv = kmalloc(sizeof(*priv), GFP_KERNEL);
>> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>
> Why is this zeroing out everything now? Just because you added one
> bool? Why not just set the bool properly instead?

Because I consider that this function is not performance critical where
avoid increasing code size by zeroing out everything is acceptable.



>> -static long hung_up_tty_compat_ioctl(struct file *file,
>> +static inline long hung_up_tty_compat_ioctl(struct file *file,
>> unsigned int cmd, unsigned long arg)
>> {
>> return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
>> }
>
> Marking these as inline, and then treating them as a function pointer,
> seems like a horrid way to work around a compiler warning. As they
> really are not inline functions anymore, but yet the compiler doesn't
> know that. Odds are once the compiler gets smarter, the warnings will
> return, so please, solve this properly.

Since this patch removes "struct file_operations hung_up_tty_fops"
which was the only source of treating as a function pointer,
these inlined functions are no longer treated as a function pointer.



>> @@ -619,7 +608,8 @@ static void __tty_hangup(struct tty_struct *tty, int exit_session)
>> continue;
>> closecount++;
>> __tty_fasync(-1, filp, 0); /* can't block */
>> - filp->f_op = &hung_up_tty_fops;
>> + /* Accept race with tty_hung_up_p() test. */
>> + data_race(priv->hung = true);
>
> Why accept it? Say why it's not really an issue here.

Because whether tty_hung_up_p() sees true or false due to concurrent
access does not matter. The race KCSAN reported is harmless (unless
callbacks suddenly disappear).



>> @@ -743,7 +733,9 @@ void tty_vhangup_session(struct tty_struct *tty)
>> */
>> int tty_hung_up_p(struct file *filp)
>> {
>> - return (filp && filp->f_op == &hung_up_tty_fops);
>> + return filp && filp->f_op == &tty_fops &&
>> + /* Accept race with __tty_hangup(). */
>> + data_race(((struct tty_file_private *) filp->private_data)->hung);
>
> Same here.

Because whether __tty_hangup() already changed from false to true due to
concurrent access does not matter. The race KCSAN reported is harmless (unless
callbacks suddenly disappear).



>> @@ -911,6 +903,8 @@ static ssize_t tty_read(struct kiocb *iocb, struct iov_iter *to)
>> struct tty_struct *tty = file_tty(file);
>> struct tty_ldisc *ld;
>>
>> + if (tty_hung_up_p(file))
>> + return hung_up_tty_read(iocb, to);
>
> What happens if you hang up _right_ after this check? There's no
> locking here, right? Same everywhere else you have this pattern, you
> made the race window smaller, but it's still there from what I can see.

We cannot close the race window without introducing locking,
but we don't need to close the race window.

The race KCSAN found in this report is harmless, as long as callbacks
reachable via filp->f_op does not disappear.

This patch prevents filp->f_op from suddenly disappearing callbacks,
by not changing the value of filp->f_op.



>> @@ -255,6 +255,7 @@ struct tty_file_private {
>> struct tty_struct *tty;
>> struct file *file;
>> struct list_head list;
>> + bool hung;
>
> No hint as to what "hung" means here?

Whether __tty_hangup() was called or not.

Greg Kroah-Hartman

unread,
May 30, 2023, 8:51:10 AM5/30/23
to Tetsuo Handa, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Tue, May 30, 2023 at 08:57:42PM +0900, Tetsuo Handa wrote:
> On 2023/05/30 19:44, Greg Kroah-Hartman wrote:
> > On Sun, May 14, 2023 at 10:02:26AM +0900, Tetsuo Handa wrote:
> >> If we care about only NULL pointer dereference, implementing missing
> >> callbacks to hung_up_tty_fops is fine. But if we also care about KCSAN
> >> reports, we will need to wrap all filp->f_op usages which are reachable
> >> via tty_fops callbacks using data_race().
> >
> > I'm missing something here. Why would KCSAN report problems if we
> > implement the needed callbacks in hung_up_tty_fops? And what reports
> > would they be?
>
> Unlike atomic operations such as atomic_read()/atomic_set(), normal read/write
> operations are not atomic for KCSAN. KCSAN reports some value being changed
> during a read/write.
>
> In this report, KCSAN detected that __tty_hangup() changed the value of
> filp->f_op from 0xffffffff84e91ed0 to 0xffffffff84e91dc0 at
>
> filp->f_op = &hung_up_tty_fops;
>
> line when __fput() was reading the value of filp->f_op at
>
> if (file->f_op->release)
>
> line.
>
> Even if we implement the needed callbacks in hung_up_tty_fops,
> KCSAN will continue reporting that the value of filp->f_op changes.

That sounds like a bug in KCSAN, let's not add loads of infrastructure
just because we have bad tools.

> > And why would data_race() help here?
>
> data_race() tells KCSAN not to report.
> data_race() is used when the race KCSAN checks is harmless.

Again, document it, and also perhaps, not use KCSAN? :)

> >> @@ -182,7 +182,7 @@ int tty_alloc_file(struct file *file)
> >> {
> >> struct tty_file_private *priv;
> >>
> >> - priv = kmalloc(sizeof(*priv), GFP_KERNEL);
> >> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> >
> > Why is this zeroing out everything now? Just because you added one
> > bool? Why not just set the bool properly instead?
>
> Because I consider that this function is not performance critical where
> avoid increasing code size by zeroing out everything is acceptable.

It happens on open() which yes, is not performance critical, but you are
now requiring it where before this was not required. Which isn't always
so obvious, right?

> >> @@ -911,6 +903,8 @@ static ssize_t tty_read(struct kiocb *iocb, struct iov_iter *to)
> >> struct tty_struct *tty = file_tty(file);
> >> struct tty_ldisc *ld;
> >>
> >> + if (tty_hung_up_p(file))
> >> + return hung_up_tty_read(iocb, to);
> >
> > What happens if you hang up _right_ after this check? There's no
> > locking here, right? Same everywhere else you have this pattern, you
> > made the race window smaller, but it's still there from what I can see.
>
> We cannot close the race window without introducing locking,
> but we don't need to close the race window.
>
> The race KCSAN found in this report is harmless, as long as callbacks
> reachable via filp->f_op does not disappear.

Which we can fix. So let's fix that and then not worry about these
false-positives with KCSAN as it's obviously wrong. That would make for
a much smaller and simpler and easier-to-maintain-over-time change.

Please do that instead.

> This patch prevents filp->f_op from suddenly disappearing callbacks,
> by not changing the value of filp->f_op.
>
>
>
> >> @@ -255,6 +255,7 @@ struct tty_file_private {
> >> struct tty_struct *tty;
> >> struct file *file;
> >> struct list_head list;
> >> + bool hung;
> >
> > No hint as to what "hung" means here?
>
> Whether __tty_hangup() was called or not.

How will you know this in 5 years when you see this new field?
Documentation matters.

thanks,

greg k-h

Tetsuo Handa

unread,
Apr 27, 2024, 2:20:58 AMApr 27
to Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby, Linus Torvalds
syzbot is reporting data race between __tty_hangup() and __fput(), for
filp->f_op readers are not holding tty->files_lock.

Dmitry Vyukov mentioned that this race has possibility of NULL pointer
dereference, for tty_fops implements e.g. splice_read callback whereas
hung_up_tty_fops does not.

CPU0 CPU1
---- ----
do_splice_to() {
__tty_hangup() {
// f_op->splice_read was generic_file_splice_read
if (unlikely(!in->f_op->splice_read))
return warn_unsupported(in, "read");
filp->f_op = &hung_up_tty_fops;
// f_op->splice_read is now NULL
return in->f_op->splice_read(in, ppos, pipe, len, flags);
}
}

If we care about only NULL pointer dereference, implementing missing
callbacks to hung_up_tty_fops is fine. But if we also care about KCSAN
reports, we will need to wrap all filp->f_op usages which are reachable
via tty_fops callbacks using data_race().

Changes in v3:
Accept some of comments from Greg Kroah-Hartman
at https://lkml.kernel.org/r/2023053048-saved-undated-9adf@gregkh .

> That sounds like a bug in KCSAN, let's not add loads of infrastructure
> just because we have bad tools.
Not a bug in KCSAN. KCSAN is reporting that current code has race window
for NULL pointer dereference. This patch closes the race window.

> Again, document it, and also perhaps, not use KCSAN? :)
data_race() is already explained. Not using KCSAN is not a solution.

> It happens on open() which yes, is not performance critical, but you are
> now requiring it where before this was not required. Which isn't always
> so obvious, right?
Updated to do explicit initialization.

> Which we can fix. So let's fix that and then not worry about these
> false-positives with KCSAN as it's obviously wrong. That would make for
> a much smaller and simpler and easier-to-maintain-over-time change.
I disagree. We will surely forget to add corresponding dummy callback to
hung_up_tty_fops when a new callback is added to "struct file_operations".
Not changing filp->f_op after once published is the safer approach.

> How will you know this in 5 years when you see this new field?
> Documentation matters.
Added comment.

Changes in v2:
Mark callbacks used by hung_up_tty_fops as "inline" in order to fix
-Wunused-function warning when CONFIG_COMPAT=n, reported by
Nathan Chancellor <nat...@kernel.org> and Arnd Bergmann <ar...@kernel.org>.

drivers/tty/tty_io.c | 48 ++++++++++++++++++++++----------------------
include/linux/tty.h | 1 +
2 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 407b0d87b7c1..88b0e3221195 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -187,6 +187,7 @@ int tty_alloc_file(struct file *file)
if (!priv)
return -ENOMEM;

+ priv->hung = false;
file->private_data = priv;

return 0;
@@ -420,35 +421,35 @@ struct tty_driver *tty_find_polling_driver(char *name, int *line)
EXPORT_SYMBOL_GPL(tty_find_polling_driver);
#endif

-static ssize_t hung_up_tty_read(struct kiocb *iocb, struct iov_iter *to)
+static inline ssize_t hung_up_tty_read(struct kiocb *iocb, struct iov_iter *to)
{
return 0;
}

-static ssize_t hung_up_tty_write(struct kiocb *iocb, struct iov_iter *from)
+static inline ssize_t hung_up_tty_write(struct kiocb *iocb, struct iov_iter *from)
{
return -EIO;
}

/* No kernel lock held - none needed ;) */
-static __poll_t hung_up_tty_poll(struct file *filp, poll_table *wait)
+static inline __poll_t hung_up_tty_poll(struct file *filp, poll_table *wait)
{
return EPOLLIN | EPOLLOUT | EPOLLERR | EPOLLHUP | EPOLLRDNORM | EPOLLWRNORM;
}

-static long hung_up_tty_ioctl(struct file *file, unsigned int cmd,
+static inline long hung_up_tty_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
}

-static long hung_up_tty_compat_ioctl(struct file *file,
+static inline long hung_up_tty_compat_ioctl(struct file *file,
unsigned int cmd, unsigned long arg)
{
return cmd == TIOCSPGRP ? -ENOTTY : -EIO;
}

-static int hung_up_tty_fasync(int fd, struct file *file, int on)
+static inline int hung_up_tty_fasync(int fd, struct file *file, int on)
{
return -ENOTTY;
}
@@ -490,17 +491,6 @@ static const struct file_operations console_fops = {
.fasync = tty_fasync,
};

-static const struct file_operations hung_up_tty_fops = {
- .llseek = no_llseek,
- .read_iter = hung_up_tty_read,
- .write_iter = hung_up_tty_write,
- .poll = hung_up_tty_poll,
- .unlocked_ioctl = hung_up_tty_ioctl,
- .compat_ioctl = hung_up_tty_compat_ioctl,
- .release = tty_release,
- .fasync = hung_up_tty_fasync,
-};
-
static DEFINE_SPINLOCK(redirect_lock);
static struct file *redirect;

@@ -618,7 +608,8 @@ static void __tty_hangup(struct tty_struct *tty, int exit_session)
continue;
closecount++;
__tty_fasync(-1, filp, 0); /* can't block */
- filp->f_op = &hung_up_tty_fops;
+ /* Accept race with tty_hung_up_p() test. */
+ data_race(priv->hung = true);
}
spin_unlock(&tty->files_lock);

@@ -742,7 +733,9 @@ void tty_vhangup_session(struct tty_struct *tty)
*/
int tty_hung_up_p(struct file *filp)
{
- return (filp && filp->f_op == &hung_up_tty_fops);
+ return filp && filp->f_op == &tty_fops &&
+ /* Accept race with __tty_hangup(). */
+ data_race(((struct tty_file_private *) filp->private_data)->hung);
}
EXPORT_SYMBOL(tty_hung_up_p);

@@ -921,6 +914,8 @@ static ssize_t tty_read(struct kiocb *iocb, struct iov_iter *to)
struct tty_ldisc *ld;
ssize_t ret;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_read(iocb, to);
if (tty_paranoia_check(tty, inode, "tty_read"))
return -EIO;
if (!tty || tty_io_error(tty))
@@ -1080,6 +1075,8 @@ static ssize_t file_tty_write(struct file *file, struct kiocb *iocb, struct iov_
struct tty_ldisc *ld;
ssize_t ret;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_write(iocb, from);
if (tty_paranoia_check(tty, file_inode(file), "tty_write"))
return -EIO;
if (!tty || !tty->ops->write || tty_io_error(tty))
@@ -2166,11 +2163,6 @@ static int tty_open(struct inode *inode, struct file *filp)
return retval;

schedule();
- /*
- * Need to reset f_op in case a hangup happened.
- */
- if (tty_hung_up_p(filp))
- filp->f_op = &tty_fops;
goto retry_open;
}
clear_bit(TTY_HUPPED, &tty->flags);
@@ -2204,6 +2196,8 @@ static __poll_t tty_poll(struct file *filp, poll_table *wait)
struct tty_ldisc *ld;
__poll_t ret = 0;

+ if (tty_hung_up_p(filp))
+ return hung_up_tty_poll(filp, wait);
if (tty_paranoia_check(tty, file_inode(filp), "tty_poll"))
return 0;

@@ -2256,6 +2250,8 @@ static int tty_fasync(int fd, struct file *filp, int on)
struct tty_struct *tty = file_tty(filp);
int retval = -ENOTTY;

+ if (tty_hung_up_p(filp))
+ return hung_up_tty_fasync(fd, filp, on);
tty_lock(tty);
if (!tty_hung_up_p(filp))
retval = __tty_fasync(fd, filp, on);
@@ -2684,6 +2680,8 @@ long tty_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
int retval;
struct tty_ldisc *ld;

+ if (tty_hung_up_p(file))
+ return hung_up_tty_ioctl(file, cmd, arg);
if (tty_paranoia_check(tty, file_inode(file), "tty_ioctl"))
return -EINVAL;

@@ -2969,6 +2967,8 @@ static long tty_compat_ioctl(struct file *file, unsigned int cmd,
return tty_ioctl(file, cmd, arg);
}

+ if (tty_hung_up_p(file))
+ return hung_up_tty_compat_ioctl(file, cmd, arg);
if (tty_paranoia_check(tty, file_inode(file), "tty_ioctl"))
return -EINVAL;

diff --git a/include/linux/tty.h b/include/linux/tty.h
index 2b2e6f0a54d6..4bf958efd6be 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -254,6 +254,7 @@ struct tty_file_private {
struct tty_struct *tty;
struct file *file;
struct list_head list;
+ bool hung; /* Whether __tty_hangup() was called or not. */
};

/**
--
2.18.4

Linus Torvalds

unread,
Apr 27, 2024, 3:02:53 PMApr 27
to Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Fri, 26 Apr 2024 at 23:21, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> syzbot is reporting data race between __tty_hangup() and __fput(), for
> filp->f_op readers are not holding tty->files_lock.

Hmm. I looked round, and we actually have another case of this:
snd_card_disconnect() also does

mfile->file->f_op = &snd_shutdown_f_ops;

and I don't think tty->files_lock (or, in the sound case,
&card->files_lock) is at all relevant, since the users of f_ops don't
use it or care.

That said, I really think we'd be better off just keeping the current
model, and have the "you get one or the other". For the two cases that
do this, do that f_op replacement with a WRITE_ONCE(), and just make
the rule be that you have to have all the same ops in both the
original and the shutdown version.

I do *not* think it's at all better to replace (in two different
places) the racy f_op thing with another racy 'hungup' flag.

The sound case is actually a bit more involved, since it tries to deal
with module counts. That looks potentially bogus. It does

fops_get(mfile->file->f_op);

after it has installed the snd_shutdown_f_ops, but in snd_open() it
has done the proper

replace_fops(file, new_fops);

which actually drops the module count for the old one. So the sound
case seems to possibly leak a module ref on disconnect. That's a
separate issue, though.

Linus

Linus

Tetsuo Handa

unread,
Apr 28, 2024, 6:20:14 AMApr 28
to Linus Torvalds, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On 2024/04/28 4:02, Linus Torvalds wrote:
> On Fri, 26 Apr 2024 at 23:21, Tetsuo Handa
> <penguin...@i-love.sakura.ne.jp> wrote:
>>
>> syzbot is reporting data race between __tty_hangup() and __fput(), for
>> filp->f_op readers are not holding tty->files_lock.
>
> Hmm. I looked round, and we actually have another case of this:
> snd_card_disconnect() also does
>
> mfile->file->f_op = &snd_shutdown_f_ops;

OK. That one needs to be fixed as well.

>
> and I don't think tty->files_lock (or, in the sound case,
> &card->files_lock) is at all relevant, since the users of f_ops don't
> use it or care.

More precisely, the users of f_op can't access it. For example,
do_splice_read() cannot understand that "in" argument refers to a tty
device and therefore will not know about tty->files_lock.

>
> That said, I really think we'd be better off just keeping the current
> model, and have the "you get one or the other". For the two cases that
> do this, do that f_op replacement with a WRITE_ONCE(), and just make
> the rule be that you have to have all the same ops in both the
> original and the shutdown version.

If we keep the current model, WRITE_ONCE() is not sufficient.

My understanding is that KCSAN's report like
https://elixir.bootlin.com/linux/v6.9-rc5/source/Documentation/dev-tools/kcsan.rst#L56
will remain unless we wrap all f_op readers using data_race() macro. That is,
we will need to define a wrapper like

static inline struct file_operations *f_op(struct file *file)
{
/*
* Ignore race in order to silence KCSAN, for __tty_hangup() or
* snd_card_disconnect() might update f_op while file is in use.
*/
return data_race(file->f_op);
}

and do for example

- if (unlikely(!in->f_op->splice_read))
+ if (unlikely(!f_op(in)->splice_read))

for https://elixir.bootlin.com/linux/v6.9-rc5/source/fs/splice.c#L977 and

- return in->f_op->splice_read(in, ppos, pipe, len, flags);
+ return f_op(in)->splice_read(in, ppos, pipe, len, flags);

for https://elixir.bootlin.com/linux/v6.9-rc5/source/fs/splice.c#L985 .

Are VFS people happy with such change? I guess that VFS people assume that
file->f_op does not get updated while file is in use. Also, such data_race()
usage does not match one of situations listed in
https://elixir.bootlin.com/linux/v6.9-rc5/source/tools/memory-model/Documentation/access-marking.txt#L58 .

>
> I do *not* think it's at all better to replace (in two different
> places) the racy f_op thing with another racy 'hungup' flag.

This approach allows VFS people to assume that file->f_op does not
get updated while file is in use.

>
> The sound case is actually a bit more involved, since it tries to deal
> with module counts. That looks potentially bogus. It does
>
> fops_get(mfile->file->f_op);
>
> after it has installed the snd_shutdown_f_ops, but in snd_open() it
> has done the proper
>
> replace_fops(file, new_fops);

replace_fops() is intended to be used *ONLY* from ->open() instances.

Linus Torvalds

unread,
Apr 28, 2024, 2:50:35 PMApr 28
to Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Sun, 28 Apr 2024 at 03:20, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
>
> If we keep the current model, WRITE_ONCE() is not sufficient.
>
> My understanding is that KCSAN's report like

I find it obnoxious that these are NOT REAL PROBLEMS.

It's KCSAN that is broken and doesn't allow us to just tell it to
sanely ignore things.

I don't want to add stupid and pointless annotations for a broken tooling.

Can you instead just ask the KCSAN people to have some mode where we
can annotate a pointer as a "use one or the other", and just shut that
thing up that way?

Because no, we're not adding some idiotic "f_op()" wrapper just to
shut KCSAN up about a non-issue.

Linus

Marco Elver

unread,
Apr 29, 2024, 9:56:06 AMApr 29
to Linus Torvalds, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby, Paul E. McKenney
On Sun, 28 Apr 2024 at 20:50, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Sun, 28 Apr 2024 at 03:20, Tetsuo Handa
> <penguin...@i-love.sakura.ne.jp> wrote:
> >
> > If we keep the current model, WRITE_ONCE() is not sufficient.

FWIW, the original report here came from syzbot, which is configured
so that the WRITE_ONCE() is sufficient
(CONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=n,
CONFIG_KCSAN_IGNORE_ATOMICS=y ... long names, I know), because this
was an idiom we ran into in the past and just wanted to filter them
out (for better or worse).

That being said, the reader side still has a real problem, even if
hidden in some KCSAN configs. Up to you if the WRITE_ONCE() is
sufficient or not, at least on syzbot this case wouldn't resurface
(for now).

> > My understanding is that KCSAN's report like
>
> I find it obnoxious that these are NOT REAL PROBLEMS.
>
> It's KCSAN that is broken and doesn't allow us to just tell it to
> sanely ignore things.
>
> I don't want to add stupid and pointless annotations for a broken tooling.

KCSAN is the messenger in this case that our mental model vs. what our
language/memory model gives us is wrong: we already have our own
memory model (!= C11), but we still have data races, and have to deal
with the fallout. Data races (here: 2 plain unmarked accesses of the
pointers) still exist, and the compiler is still free to optimize them
(miscompile them according to our mental model).

Assuming the data race is not a problem assumes all compilers on all
architectures won't mess up the accesses.

This comes up over and over, and the problem hasn't gone away. Our
compilers still don't know about the kernel doing things outside the
scope of standard C - we can beat the compiler into submission with
lots of flags, but we know [1] compilers break our assumptions. What's
the long-term fix? I don't know, besides trying to teach compilers
more of what Linux wants.

[1] https://lpc.events/event/16/contributions/1174/attachments/1108/2121/Status%20Report%20-%20Broken%20Dependency%20Orderings%20in%20the%20Linux%20Kernel.pdf

> Can you instead just ask the KCSAN people to have some mode where we
> can annotate a pointer as a "use one or the other", and just shut that
> thing up that way?

Saying "use one or the other" pointer, implies atomicity of the load
and store. Telling KCSAN about our assumption of what we think the
compiler should do is the wrong way around - we should tell the
compiler. Which part of the code is telling the compiler we want
atomicity of the loaded pointers?

A WRITE_ONCE() / READ_ONCE() pair would do it here. What should we use instead?


Thanks,
-- Marco

Linus Torvalds

unread,
Apr 29, 2024, 11:38:49 AMApr 29
to Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby, Paul E. McKenney
On Mon, 29 Apr 2024 at 06:56, Marco Elver <el...@google.com> wrote:
>
> A WRITE_ONCE() / READ_ONCE() pair would do it here. What should we use instead?

Why would we annotate a "any other code generation is insane" issues at all?

When we do chained pointer loads in

file->f_op->op()

and we say "I don't care what value I get for the middle one", I don't
see the value in annotating that at all.

There is no compiler that will sanely and validly do a pointer chain
load by *anything* but a load. And it doesn't matter to us if it then
spills and reloads, it will *STILL* be a load.

We're not talking about "extract different bits in separate
operations". We're talking about following one pointer that can point
to two separate static values.

Reality matters. A *lot* more than some "C standard" that we already
have ignored for decades because it's not strong enough.

Linus

Paul E. McKenney

unread,
May 1, 2024, 2:46:00 PMMay 1
to Linus Torvalds, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Mon, Apr 29, 2024 at 08:38:28AM -0700, Linus Torvalds wrote:
> On Mon, 29 Apr 2024 at 06:56, Marco Elver <el...@google.com> wrote:
> >
> > A WRITE_ONCE() / READ_ONCE() pair would do it here. What should we use instead?
>
> Why would we annotate a "any other code generation is insane" issues at all?
>
> When we do chained pointer loads in
>
> file->f_op->op()
>
> and we say "I don't care what value I get for the middle one", I don't
> see the value in annotating that at all.

In code that isn't being actively developed, sticks to known patterns (as
above), or hides the lockless accesses behind a good API, this can make
a lot of sense. And I certainly have talked to a few people who feel
that KCSAN is nothing but an irritant, and are not willing to make any
concessions whatsoever to it. In fact, many of them seem to wish that
it would disappear completely. Of course, that wish is their privilege.

But in RCU, new patterns are often being created, and so I am quite
happy to give KCSAN additional information in order to help it help me.
I am also quite happy to run KCSAN with its most aggressive settings,
also to help it help me. In my experience, it is way easier having KCSAN
spot a data-race bug than to have to find it the hard way, but perhaps I
am just showing my age. In addition, KCSAN does a tireless and thorough
(if somewhat simple-minded) code review of the full RCU code base,
and can easily be persuaded to do so each and every day, if desired.
Just *you* try doing that manually, whatever your age! ;-)

Plus, the documentation benefits are significant. "Wait, which of
these unmarked accesses is to a shared variable?" In the wee hours of
the morning while chasing a bug. Especially if the person doing the
chasing is an innocent bystander who is not already an expert on the
code currently being investigated. :-/

Oddly enough, the simplest concurrency designs also want a maximally
aggressive KCSAN. If you are using pure locking with absolutely no
lockless accesses, then any data race at all is a bug. Again, it is
a lot easier for KCSAN to tell you that you forgot to acquire the lock
than to find out the hard way.

> There is no compiler that will sanely and validly do a pointer chain
> load by *anything* but a load. And it doesn't matter to us if it then
> spills and reloads, it will *STILL* be a load.
>
> We're not talking about "extract different bits in separate
> operations". We're talking about following one pointer that can point
> to two separate static values.
>
> Reality matters. A *lot* more than some "C standard" that we already
> have ignored for decades because it's not strong enough.

Agreed, but it also appears that different developers and maintainers in
different parts of the kernel are looking for different things from KCSAN.

To illustrate my personal concerns, I confess to being a bit disgusted by
those pontificating on software reliability, especially when they compare
it unfavorably to things like house construction. The difference is of
course that the average house is not under active attack by nation states.
In contrast, whether we like it or not, the Linux kernel is under active
attack by nation states, organized crime, and who knows what all else.
For RCU at least, I will take all the help I can get, even if it requires
me to do a little bit of work up front.

In short, I for one do greatly value KCSAN's help. Along with that of
a great many other tools, none of which are perfect, but all of which
are helpful.

Thanx, Paul

Linus Torvalds

unread,
May 1, 2024, 2:56:47 PMMay 1
to pau...@kernel.org, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Wed, 1 May 2024 at 11:46, Paul E. McKenney <pau...@kernel.org> wrote:
>
> In short, I for one do greatly value KCSAN's help. Along with that of
> a great many other tools, none of which are perfect, but all of which
> are helpful.

It's not that I don't value what KCSAN does, but I really think this
is a KCSAN issue.

I absolutely *detest* these crazy "randomly add data race annotations".

Could we instead annotate particular structure fields? I don't want to
mark things actually "volatile", because that then causes the compiler
to generate absolutely horrendous code. But some KCSAN equivalent of
"this field has data races, and we don't care" kind of annotation
would be lovely..

Linus

Paul E. McKenney

unread,
May 1, 2024, 3:02:35 PMMay 1
to Linus Torvalds, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
That would give the poor sleep-deprived innocent bystander some way
to figure out which fields were shared, so that does sound like a good
improvement!

I would naively expect that KCSAN's ability to handle volatile fields
would make this doable, but there is much that I do not know about
KCSAN internals. So I must defer to Marco on this one.

Thanx, Paul

Marco Elver

unread,
May 1, 2024, 4:15:32 PMMay 1
to pau...@kernel.org, Linus Torvalds, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
This is relatively trivial:

#ifdef __SANITIZE_THREAD__
#define __data_racy volatile
#endif

KCSAN will just ignore racy access to them (it will pretend they are "marked").

In some cases it might cause the compiler to complain if converting a
volatile pointer to a non-volatile pointer, but I suspect that the
corresponding pointer should then similarly be marked as __data_racy.
The fact that without casting the attribute is "viral" is probably WAI
even in this case.

Do we want this kind of attribute?

Thanks,
-- Marco

Linus Torvalds

unread,
May 1, 2024, 5:06:51 PMMay 1
to Marco Elver, pau...@kernel.org, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Wed, 1 May 2024 at 13:15, Marco Elver <el...@google.com> wrote:
>
> This is relatively trivial:
>
> #ifdef __SANITIZE_THREAD__
> #define __data_racy volatile
> #endif

I really wouldn't want to make a code generation difference, but I
guess when the sanitizer is on, the compiler generating crap code
isn't a huge deal.

> In some cases it might cause the compiler to complain if converting a
> volatile pointer to a non-volatile pointer

No. Note that it's not the *pointer* that is volatile, it's the
structure member.

So it would be something like

const struct file_operations * __data_racy f_op;

and only the load of f_op would be volatile - not the pointer itself.

Of course, if somebody then does "&file->f_op" to get a pointer to a
pointer, *that* would now be a volatile pointer, but I don't see
people doing that.

So I guess this might be a way forward. Anybody want to verify?

Now, the "hung_up_tty_fops" *do* need to be expanded to have hung up
ops for every op that is non-NULL in the normal tty ops. That was a
real bug. We'd also want to add a big comment to the tty fops to make
sure anybody who adds a new tty f_op member to make sure to populate
the hung up version too.

Linus

Linus Torvalds

unread,
May 1, 2024, 5:20:57 PMMay 1
to Marco Elver, pau...@kernel.org, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Wed, 1 May 2024 at 14:06, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> So it would be something like
>
> const struct file_operations * __data_racy f_op;
>
> and only the load of f_op would be volatile - not the pointer itself.

Noe that in reality, we'd actually prefer the compiler to treat that
"__data_racy" as volatile in the sense of "don't reload this value",
but at the same time be the opposite of volatile in the sense that
using one read multiple times is actually a good idea.

IOW, the problem is rematerialization ("read the value more than once
when there is just one access in the source"), not strictly a "read
the value separately each time it is accessed".

We've actually had that before: it's not that we want each access to
force a read from memory, we want to avoid a TOCTOU race.

Many of our "READ_ONCE()" uses are of that kind, and using "volatile"
sadly generates horrible code, but is the only way to tell the
compiler to not ever rematerialize the value by loading it _twice_.

I'd love to see an extension where "const volatile" basically means
exactly that: the volatile tells the compiler that it can't
rematerialize by doing the load multiple times, but the "const" would
say that if the compiler sees two or more accesses, it can still CSE
them.

Oh well. Thankfully it's not a hugely common code generation problem.
It comes up every once in a while, and I think the last time this
worry came up, I think we had gcc people tell us that they don't
actually ever rematerialize loads from memory.

Of course, that was an implementation issue, not a guarantee.

Linus

Paul E. McKenney

unread,
May 1, 2024, 5:49:19 PMMay 1
to Linus Torvalds, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Wed, May 01, 2024 at 02:20:35PM -0700, Linus Torvalds wrote:
> On Wed, 1 May 2024 at 14:06, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > So it would be something like
> >
> > const struct file_operations * __data_racy f_op;
> >
> > and only the load of f_op would be volatile - not the pointer itself.
>
> Noe that in reality, we'd actually prefer the compiler to treat that
> "__data_racy" as volatile in the sense of "don't reload this value",
> but at the same time be the opposite of volatile in the sense that
> using one read multiple times is actually a good idea.
>
> IOW, the problem is rematerialization ("read the value more than once
> when there is just one access in the source"), not strictly a "read
> the value separately each time it is accessed".
>
> We've actually had that before: it's not that we want each access to
> force a read from memory, we want to avoid a TOCTOU race.
>
> Many of our "READ_ONCE()" uses are of that kind, and using "volatile"
> sadly generates horrible code, but is the only way to tell the
> compiler to not ever rematerialize the value by loading it _twice_.
>
> I'd love to see an extension where "const volatile" basically means
> exactly that: the volatile tells the compiler that it can't
> rematerialize by doing the load multiple times, but the "const" would
> say that if the compiler sees two or more accesses, it can still CSE
> them.

No promises, other than that if we don't ask, they won't say "yes".

Let me see what can be done.

Thanx, Paul

Paul E. McKenney

unread,
May 1, 2024, 6:32:36 PMMay 1
to Linus Torvalds, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Wed, May 01, 2024 at 02:49:17PM -0700, Paul E. McKenney wrote:
> On Wed, May 01, 2024 at 02:20:35PM -0700, Linus Torvalds wrote:
> > On Wed, 1 May 2024 at 14:06, Linus Torvalds
> > <torv...@linux-foundation.org> wrote:

[ . . . ]

> > I'd love to see an extension where "const volatile" basically means
> > exactly that: the volatile tells the compiler that it can't
> > rematerialize by doing the load multiple times, but the "const" would
> > say that if the compiler sees two or more accesses, it can still CSE
> > them.

Except that "const volatile" already means "you cannot write to it,
and reads will not be fused". :-/

> No promises, other than that if we don't ask, they won't say "yes".
>
> Let me see what can be done.

From a semantics viewpoint __atomic_load_n(&x, __ATOMIC_RELAXED) would
work for loading from x. The compilers that I tried currently do not
fuse loads, but they are allowed to do so.

Or is there something I am missing that would make this not work?
Aside from compilers not yet optimizing this case.

Thanx, Paul

Marco Elver

unread,
May 2, 2024, 10:15:07 AMMay 2
to Linus Torvalds, pau...@kernel.org, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Wed, 1 May 2024 at 23:06, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Wed, 1 May 2024 at 13:15, Marco Elver <el...@google.com> wrote:
> >
> > This is relatively trivial:
> >
> > #ifdef __SANITIZE_THREAD__
> > #define __data_racy volatile
> > #endif
>
> I really wouldn't want to make a code generation difference, but I
> guess when the sanitizer is on, the compiler generating crap code
> isn't a huge deal.
>
> > In some cases it might cause the compiler to complain if converting a
> > volatile pointer to a non-volatile pointer
>
> No. Note that it's not the *pointer* that is volatile, it's the
> structure member.
>
> So it would be something like
>
> const struct file_operations * __data_racy f_op;
>
> and only the load of f_op would be volatile - not the pointer itself.
>
> Of course, if somebody then does "&file->f_op" to get a pointer to a
> pointer, *that* would now be a volatile pointer, but I don't see
> people doing that.

This is the case I thought of. I still think everything is working as
intended then, since passing a pointer to a __data_racy variable
should be done with pointers to __data_racy (just like other type
qualifiers - the rules are by virtue of implementation equivalent to
volatile). Not a problem, just an observation.

> So I guess this might be a way forward. Anybody want to verify?

I sent a patch to add the type qualifier - in a simple test I added it
does what we want:
https://lore.kernel.org/all/20240502141242....@google.com/T/#u

I'll leave it to Tetsuo to amend the original patch if __data_racy makes sense.

Thanks,
-- Marco

Boqun Feng

unread,
May 2, 2024, 12:37:51 PMMay 2
to Paul E. McKenney, Linus Torvalds, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Wed, May 01, 2024 at 03:32:34PM -0700, Paul E. McKenney wrote:
> On Wed, May 01, 2024 at 02:49:17PM -0700, Paul E. McKenney wrote:
> > On Wed, May 01, 2024 at 02:20:35PM -0700, Linus Torvalds wrote:
> > > On Wed, 1 May 2024 at 14:06, Linus Torvalds
> > > <torv...@linux-foundation.org> wrote:
>
> [ . . . ]
>
> > > I'd love to see an extension where "const volatile" basically means
> > > exactly that: the volatile tells the compiler that it can't
> > > rematerialize by doing the load multiple times, but the "const" would
> > > say that if the compiler sees two or more accesses, it can still CSE
> > > them.
>
> Except that "const volatile" already means "you cannot write to it,
> and reads will not be fused". :-/
>
> > No promises, other than that if we don't ask, they won't say "yes".
> >
> > Let me see what can be done.
>
> >From a semantics viewpoint __atomic_load_n(&x, __ATOMIC_RELAXED) would
> work for loading from x. The compilers that I tried currently do not
> fuse loads, but they are allowed to do so.
>

Yeah, I wonder the same, from what I read, "const volatile" seems to
be just a (non-volatile) relaxed atomic load.

Regards,
Boqun

Tetsuo Handa

unread,
May 2, 2024, 12:42:41 PMMay 2
to Marco Elver, Linus Torvalds, pau...@kernel.org, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On 2024/05/02 23:14, Marco Elver wrote:
> I sent a patch to add the type qualifier - in a simple test I added it
> does what we want:
> https://lore.kernel.org/all/20240502141242....@google.com/T/#u

Want some updates to Documentation/process/volatile-considered-harmful.rst
because __data_racy is for patches to add volatile variables ?

Patches to remove volatile variables are generally welcome - as long as
they come with a justification which shows that the concurrency issues have
been properly thought through.

>
> I'll leave it to Tetsuo to amend the original patch if __data_racy makes sense.

OK if below change is acceptable.

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1012,7 +1012,7 @@ struct file {
struct file_ra_state f_ra;
struct path f_path;
struct inode *f_inode; /* cached value */
- const struct file_operations *f_op;
+ const __data_racy struct file_operations *f_op;

u64 f_version;
#ifdef CONFIG_SECURITY

Hmm, debugfs assumes that f_op does not change?

fs/debugfs/file.c: In function 'full_proxy_release':
fs/debugfs/file.c:357:45: warning: initialization discards 'volatile' qualifier from pointer target type [-Wdiscarded-qualifiers]
const struct file_operations *proxy_fops = filp->f_op;
^~~~

Marco Elver

unread,
May 2, 2024, 1:21:09 PMMay 2
to Tetsuo Handa, Linus Torvalds, pau...@kernel.org, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Thu, 2 May 2024 at 18:42, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> On 2024/05/02 23:14, Marco Elver wrote:
> > I sent a patch to add the type qualifier - in a simple test I added it
> > does what we want:
> > https://lore.kernel.org/all/20240502141242....@google.com/T/#u
>
> Want some updates to Documentation/process/volatile-considered-harmful.rst
> because __data_racy is for patches to add volatile variables ?

This has nothing to do with volatile. It's merely an implementation
artifact that in CONFIG_KCSAN builds __data_racy translates to
"volatile": the compiler will emit special instrumentation for
volatile accesses so that KCSAN thinks they are "marked". However,
volatile is and has been an implementation detail of certain
primitives like READ_ONCE()/WRITE_ONCE(), although as a developer
using this interface we should not be concerned with the fact that
there's volatile underneath. In a perfect world the compiler would
give us a better "tool" than volatile, but we have to make do with the
tools we have at our disposal today.

> Patches to remove volatile variables are generally welcome - as long as
> they come with a justification which shows that the concurrency issues have
> been properly thought through.

My suggestion is to forget about "volatile" and simply pretend it's
data_race() but as a type qualifier, like the bit of documentation I
added to Documentation/dev-tools/kcsan.rst in the patch.

> >
> > I'll leave it to Tetsuo to amend the original patch if __data_racy makes sense.
>
> OK if below change is acceptable.
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1012,7 +1012,7 @@ struct file {
> struct file_ra_state f_ra;
> struct path f_path;
> struct inode *f_inode; /* cached value */
> - const struct file_operations *f_op;
> + const __data_racy struct file_operations *f_op;
>
> u64 f_version;
> #ifdef CONFIG_SECURITY
>
> Hmm, debugfs assumes that f_op does not change?
>
> fs/debugfs/file.c: In function 'full_proxy_release':
> fs/debugfs/file.c:357:45: warning: initialization discards 'volatile' qualifier from pointer target type [-Wdiscarded-qualifiers]
> const struct file_operations *proxy_fops = filp->f_op;
> ^~~~

Exactly as I pointed out elsewhere: pointers to __data_racy fields now
have to become __data_racy as well:

const struct file_operations __data_racy *proxy_fops = filp->f_op;

should be what you want there. The type system is in fact helping us
here as intended. :-)

Linus Torvalds

unread,
May 2, 2024, 1:30:14 PMMay 2
to Tetsuo Handa, Marco Elver, pau...@kernel.org, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Thu, 2 May 2024 at 09:42, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> OK if below change is acceptable.
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1012,7 +1012,7 @@ struct file {
> struct file_ra_state f_ra;
> struct path f_path;
> struct inode *f_inode; /* cached value */
> - const struct file_operations *f_op;
> + const __data_racy struct file_operations *f_op;

No, this is very wrong.

It's not the *pointer* that is __data_racy. It's the structure *fied*.

So that should be

const struct file_operations *__data_racy f_op;

which is very different.

> Hmm, debugfs assumes that f_op does not change?
>
> fs/debugfs/file.c: In function 'full_proxy_release':
> fs/debugfs/file.c:357:45: warning: initialization discards 'volatile' qualifier from pointer target type [-Wdiscarded-qualifiers]
> const struct file_operations *proxy_fops = filp->f_op;
> ^~~~

This error is a direct result of placing the __data_racy in the wrong place.

It's not that the _result_ of reading filp->f_op is racy. It's the
read of filp->f_op itself that is.

Yes, this is unusual. The *common* thing is to mark pointers as being
volatile. But this really is something entirely different from that.

Linus

Al Viro

unread,
May 2, 2024, 2:14:58 PMMay 2
to Linus Torvalds, Tetsuo Handa, Marco Elver, pau...@kernel.org, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Jiri Slaby
On Thu, May 02, 2024 at 10:29:52AM -0700, Linus Torvalds wrote:

> Yes, this is unusual. The *common* thing is to mark pointers as being
> volatile. But this really is something entirely different from that.

The common thing is to mark pointers are pointers to volatile;
calling them "volatile pointers" is common and incorrect, and the only
reason why that sloppy turn of phrase persists is that real "volatile
pointers" are rare...

Marco,
struct foo volatile *p;
declares p as a (non-volatile) pointer to volatile struct foo.
struct foo * volatile p;
declares p as volatile pointer to (non-volatile) struct foo.

The former is a statement about the objects whose addresses might
be stored in p; the latter is a statement about the object p itself.

Replace volatile with const and it becomes easier to experiment with:
char const *p;
char s[] = "barf";
char * const q = s;
...
p = "yuck"; - fine, p itself can be modified
*p = 'a'; - error *p can not be modified, it's an l-value of type const char
q = s + 1; - error, can't modify q
*q = 'a'; - fine, *q is l-value of type char
p = q; - fine, right-hand side of assignment loses the top
qualifier, so q (const pointer to char as l-value)
becomes a plain pointer to char, which can be
converted to pointer to const char, and stored in
p (l-value of type pointer to const char)
strlen(q); - almost the same story, except that it's passing
an argument rather than assignment; they act the
same way.
strcpy(q, "s"); - almost the same, except that here the type of
argument is pointer to char rather than pointer to
const char (strlen() promises not to modify the
string passed to it, strcpy() obviously doesn't)
strcpy(p, "s"); - error; pointer to char converts to a pointer
to const char, but not the other way round.

The situations where you want a const (or volatile) pointer (as opposed to
pointer to const or volatile object) are rare, but this is exactly what
you are asking for - you want to say that the value of 'f_op' member
in any struct file can change at any time. That value is an address of
some instance of struct file_operations and what you want to express is
the property of f_op member itself, not that of the objects whose addresses
might end up stored there.

So having a driver do
const struct file_operations *ops = file->f_op;
is fine - it's basically "take the value of 'file'; it will be an address
of some struct file instance. Fetch 'f_op' from that instance, without
any assumptions of the stability of that member. Use whatever value
you find there as initial value of 'ops'".

That's fine, and since nobody is going to change 'ops' itself behind your
back, you don't need any qualifiers on it. The type of 'ops' here is
"(unqualified) pointer to const struct file_operations".

Marco Elver

unread,
May 2, 2024, 3:29:47 PMMay 2
to Al Viro, Linus Torvalds, Tetsuo Handa, pau...@kernel.org, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Jiri Slaby
On Thu, 2 May 2024 at 20:14, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> On Thu, May 02, 2024 at 10:29:52AM -0700, Linus Torvalds wrote:
>
> > Yes, this is unusual. The *common* thing is to mark pointers as being
> > volatile. But this really is something entirely different from that.
>
> The common thing is to mark pointers are pointers to volatile;
> calling them "volatile pointers" is common and incorrect, and the only
> reason why that sloppy turn of phrase persists is that real "volatile
> pointers" are rare...
>
> Marco,

I think we agree on what we want. I misread the intention of Tetsuo in
[1], and provided incorrect feedback.

[1] https://lore.kernel.org/all/CANpmjNPtoKf1ysbKd=E8o753JT0DzBanzF...@mail.gmail.com/T/#u
Is this feedback for the __data_racy attribute patch [2], or a comment
for the patch "tty: tty_io: remove hung_up_tty_fops"? [ With the
former I can help, with the latter Tetsuo can help. ]

[2] https://lore.kernel.org/all/20240502141242....@google.com/

The __data_racy attribute should behave like any other type qualifier
(be it const, volatile), and what you point out above applies equally,
no doubt about it. But I think it's important to treat it as a
completely distinct type qualifier - volatile is an implementation
detail (in non-KCSAN kernels it's a no-op).

Thanks,
-- Marco

Tetsuo Handa

unread,
May 2, 2024, 7:54:30 PMMay 2
to Linus Torvalds, Marco Elver, pau...@kernel.org, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On 2024/05/03 2:29, Linus Torvalds wrote:
>> Hmm, debugfs assumes that f_op does not change?
>>
>> fs/debugfs/file.c: In function 'full_proxy_release':
>> fs/debugfs/file.c:357:45: warning: initialization discards 'volatile' qualifier from pointer target type [-Wdiscarded-qualifiers]
>> const struct file_operations *proxy_fops = filp->f_op;
>> ^~~~
>
> This error is a direct result of placing the __data_racy in the wrong place.

Oops. But it seems to me that the problem is that

>
> It's not that the _result_ of reading filp->f_op is racy. It's the
> read of filp->f_op itself that is.

debugfs (or maybe any filesystems that wraps f_op) caches filp->f_op into
private data

struct debugfs_fsdata {
const struct file_operations *real_fops; // <= may not be up to dated?
(...snip...)
}

and cannot update cached real_fops value when

filp->f_op = &hung_up_tty_fops;

or

mfile->file->f_op = &snd_shutdown_f_ops;

is called.

Such filesystems need to be updated to cache "struct file *" rather than
"struct file_operations *" ?

Linus Torvalds

unread,
May 2, 2024, 9:12:33 PMMay 2
to Tetsuo Handa, Marco Elver, pau...@kernel.org, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Thu, 2 May 2024 at 16:54, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> debugfs (or maybe any filesystems that wraps f_op) caches filp->f_op into
> private data

That's just for debugfs uses. If you somehow try to do that on a tty,
you are quite welcome to keep both broken pieces.

Linus

Paul E. McKenney

unread,
May 3, 2024, 7:59:15 PMMay 3
to Boqun Feng, Linus Torvalds, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Thu, May 02, 2024 at 09:37:17AM -0700, Boqun Feng wrote:
> On Wed, May 01, 2024 at 03:32:34PM -0700, Paul E. McKenney wrote:
> > On Wed, May 01, 2024 at 02:49:17PM -0700, Paul E. McKenney wrote:
> > > On Wed, May 01, 2024 at 02:20:35PM -0700, Linus Torvalds wrote:
> > > > On Wed, 1 May 2024 at 14:06, Linus Torvalds
> > > > <torv...@linux-foundation.org> wrote:
> >
> > [ . . . ]
> >
> > > > I'd love to see an extension where "const volatile" basically means
> > > > exactly that: the volatile tells the compiler that it can't
> > > > rematerialize by doing the load multiple times, but the "const" would
> > > > say that if the compiler sees two or more accesses, it can still CSE
> > > > them.
> >
> > Except that "const volatile" already means "you cannot write to it,
> > and reads will not be fused". :-/
> >
> > > No promises, other than that if we don't ask, they won't say "yes".
> > >
> > > Let me see what can be done.
> >
> > >From a semantics viewpoint __atomic_load_n(&x, __ATOMIC_RELAXED) would
> > work for loading from x. The compilers that I tried currently do not
> > fuse loads, but they are allowed to do so.
>
> Yeah, I wonder the same, from what I read, "const volatile" seems to
> be just a (non-volatile) relaxed atomic load.

Hmmm... Maybe something like this very lightly tested patch?

(I did not immediately see a use case for WRITE_ONCE_MERGEABLE(),
but that is likely a failure of imagination on my part.)

------------------------------------------------------------------------

diff --git a/include/asm-generic/rwonce.h b/include/asm-generic/rwonce.h
index 8d0a6280e9824..55e87a8aec56f 100644
--- a/include/asm-generic/rwonce.h
+++ b/include/asm-generic/rwonce.h
@@ -79,6 +79,15 @@ unsigned long __read_once_word_nocheck(const void *addr)
(typeof(x))__read_once_word_nocheck(&(x)); \
})

+/*
+ * Use READ_ONCE_MERGEABLE() and WRITE_ONCE_MERGEABLE() when you need to
+ * avoid duplicating or tearing a load or store, respectively, but when
+ * it is OK to merge nearby loads and stores. It must also be OK for a
+ * later nearby load to take its value directly from a prior store.
+ */
+#define READ_ONCE_MERGEABLE(x) __atomic_load_n(&x, __ATOMIC_RELAXED)
+#define WRITE_ONCE_MERGEABLE(x, val) __atomic_store_n(&x, val, __ATOMIC_RELAXED)
+
static __no_kasan_or_inline
unsigned long read_word_at_a_time(const void *addr)
{
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d5507ac1bbf19..b37c0dbde8cde 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -459,7 +459,7 @@ static void adjust_jiffies_till_sched_qs(void)
return;
}
/* Otherwise, set to third fqs scan, but bound below on large system. */
- j = READ_ONCE(jiffies_till_first_fqs) +
+ j = READ_ONCE_MERGEABLE(jiffies_till_first_fqs) +
2 * READ_ONCE(jiffies_till_next_fqs);
if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;

Linus Torvalds

unread,
May 3, 2024, 8:14:43 PMMay 3
to pau...@kernel.org, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Fri, 3 May 2024 at 16:59, Paul E. McKenney <pau...@kernel.org> wrote:
>
> Hmmm... Maybe something like this very lightly tested patch?

I'm a bit nervous about using the built-in atomics, when it's not
clear what the compiler will do on various architectures.

Gcc documentation talks about __atomic_is_lock_free(), which makes me
think that on various architectures it might end up doing some "fall
back to helper functions" cases (possibly for odd architectures).

IOW: I don't think the patch is wrong, but I do think we need to
verify that all compilers we support generate the obvious code for
this, and we don't have some "fall back to function calls".

Linus

Paul E. McKenney

unread,
May 4, 2024, 1:08:43 AMMay 4
to Linus Torvalds, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Fri, May 03, 2024 at 05:14:22PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 16:59, Paul E. McKenney <pau...@kernel.org> wrote:
> >
> > Hmmm... Maybe something like this very lightly tested patch?
>
> I'm a bit nervous about using the built-in atomics, when it's not
> clear what the compiler will do on various architectures.
>
> Gcc documentation talks about __atomic_is_lock_free(), which makes me
> think that on various architectures it might end up doing some "fall
> back to helper functions" cases (possibly for odd architectures).

Right now, both GCC and Clang complain if you give __atomic_load_n()
something other than a pointer or a sufficiently small scalar on x86.

Let's see, starting with READ_ONCE()...

ARM7-a Clang complains about even single bytes (ARM7-a GCC is
fine).

ARMv8 works like x86 for both GCC and Clang,

AVR GCC and M68K Clang generate calls to helper functions, so they
need to implement {READ,WRITE}_ONCE_MERGEABLE() in terms of the
current {READ,WRITE}_ONCE() macros.

M68K GCC works like x86, but generates a call to a helper function
for an 8-byte load. Which means that the 8-byte case needs to
generate a build error.

Hexagon Clang works like x86.

Loongarch GCC works like x86. Ditto S390, sh, and xtensa GCC.

MIPS Clang works like x86, but throws a build error for long long,
which might be OK given 32-bit. MIPS GCC handles long long also.

MIPS64 and MIPS EL GCC and Clang work like x86, as do both compilers
for POWERPC and POWERPC LE. And for RISC-V 32 and 64 bit.

I based these on this godbolt: https://godbolt.org/z/rrKnnE8nb
The #ifs on lines select the 8-byte and structure case, respectively,
and you can pick your compiler. I just used the latest versions
of each compiler for each architecture, so there might well be
a few more surprises.

> IOW: I don't think the patch is wrong, but I do think we need to
> verify that all compilers we support generate the obvious code for
> this, and we don't have some "fall back to function calls".

You are right, this is going to need some arch-specific code for a few
of the architectures. Hey, I was hoping!!!

The compilers do not currently optimize these things, but things appear
to me to be heading in that direction.

Thanx, Paul

Linus Torvalds

unread,
May 4, 2024, 1:50:51 PMMay 4
to pau...@kernel.org, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Fri, 3 May 2024 at 22:08, Paul E. McKenney <pau...@kernel.org> wrote:
>
> You are right, this is going to need some arch-specific code for a few
> of the architectures. Hey, I was hoping!!!
>
> The compilers do not currently optimize these things, but things appear
> to me to be heading in that direction.

Ok, so it sounds like right now it makes no sense - presumably
__atomic_load_n() doesn't actually generate better code than
READ_ONCE() does as-is, and we have the issue with having to make it
per-architecture anyway.

But maybe in a couple of years we can revisit this when / if it
actually generates better code and is more widely applicable.

Linus

Paul E. McKenney

unread,
May 4, 2024, 2:18:49 PMMay 4
to Linus Torvalds, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
Completely agreed.

Here is my current thoughts for possible optimizations of non-volatile
memory_order_relaxed atomics:

o Loads from the same variable that can legitimately be
reordered to be adjacent to one another can be fused
into a single load.

o Stores to the same variable that can legitimately be
reordered to be adjacent to one another can be replaced
by the last store in the series.

o Loads and stores may not be invented.

o The only way that a computation based on the value from
a given load can instead use some other load is if the
two loads are fused into a single load.

Anything that I am missing?

Thanx, Paul

Linus Torvalds

unread,
May 4, 2024, 3:11:32 PMMay 4
to pau...@kernel.org, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Sat, 4 May 2024 at 11:18, Paul E. McKenney <pau...@kernel.org> wrote:
>
> Here is my current thoughts for possible optimizations of non-volatile
> memory_order_relaxed atomics:
>
> o Loads from the same variable that can legitimately be
> reordered to be adjacent to one another can be fused
> into a single load.

Let's call this "Rule 1"

I think you can extend this to also "can be forwarded from a previous store".

I also think you're too strict in saying "fused into a single load".
Let me show an example below.

> o Stores to the same variable that can legitimately be
> reordered to be adjacent to one another can be replaced
> by the last store in the series.

Rule 2.

Ack, although again, I think you're being a bit too strict in your
language, and the rule can be relaxed.

> o Loads and stores may not be invented.

Rule 3.

I think you can and should relax this. You can invent loads all you want.

> o The only way that a computation based on the value from
> a given load can instead use some other load is if the
> two loads are fused into a single load.

Rule 4.

I'm not convinced that makes sense, and I don't think it's true as written.

I think I understand what you are trying to say, but I think you're
saying it in a way that only confuses a compiler person.

In particular, the case I do not think is true is very much the
"spill" case: if you have code like this:

a = expression involving '__atomic_load_n(xyz, RELAXED)'

then it's perfectly fine to spill the result of that load and reload
the value. So the "computation based on the value" *is* actually based
on "some other load" (the reload).

I really *REALLY* think you need to explain the semantics in concrete
terms that a compiler writer would understand and agree with.

So to explain your rules to an actual compiler person (and relax the
semantics a bit) I would rewrite your rules as:

Rule 1: a strictly dominating load can be replaced by the value of a
preceding load or store

Ruie 2: a strictly dominating store can remove preceding stores

Rule 3: stores cannot be done speculatively (put another way: a
subsequent dominating store can only *remove* a store entirely, it
can't turn the store into one with speculative data)

Rule 4: loads cannot be rematerialized (ie a load can be *combined*
as per Rule 1, but a load cannot be *split* into two loads)

Anyway, let's get to the examples of *why* I think your language was
bad and your rules were too strict.

Let's start with your Rule 3, where you said:

- Loads and stores may not be invented

and while I think this should be very very true for stores, I think
inventing loads is not only valid, but a good idea. Example:

if (a)
b = __atomic_load_n(ptr) + 1;

can perfectly validly just be written as

tmp = __atomic_load_n(ptr);
if (a)
b = tmp+1;

which in turn may allow other optimizations (ie depending on how 'b'
is used, the conditional may go away entirely, and you just end up
with 'b = tmp+!!a').

There's nothing wrong with extra loads that aren't used.

And to make that more explicit, let's look at Rule 1:

Example of Rule 1 (together with the above case):

if (a)
b = __atomic_load_n(ptr) + 1;
else
b = __atomic_load_n(ptr) + 2;
c = __atomic_load_n(ptr) + 3;

then that can perfectly validly re-write this all as

tmp = __atomic_load_n(ptr);
b = a ? tmp+1 : tmp+2;
c = tmp + 3;

because my version of Rule 1 allows the dominating load used for 'c'
to be replaced by the value of a preceding load that was used for 'a'
and 'b'.

And to give an example of Rule 2, where you said "reordered to be
adjacent", I'm saying that all that matters is being strictly
dominant, so

if (a)
__atomic_store_n(ptr,1);
else
__atomic_store_n(ptr,2);
__atomic_store_n(ptr,3);

can be perfectly validly be combined into just

__atomic_store_n(ptr,3);

because the third store completely dominates the two others, even if
in the intermediate form they are not necessarily ever "adjacent".

(Your "adjacency" model might still be valid in how you could turn
first of the first stores to be a fall-through, then remove it, and
then turn the other to be a fall-through and then remove it, so maybe
your language isn't _tecnically_ wrong, But I think the whole
"dominating store" is how a compiler writer would think about it).

Anyway, the above are all normal optimizations that compilers
*already* do, and the only real issue is that with memory ordering,
the "dominance" thing will need to take into account the ordering
requirements of other memory operations with stricter memory ordering
in between. So, for example, if you have

__atomic_store_n(ptr,1, RELAXED);
__atomic_load_n(otherptr,2, ACQUIRE);
__atomic_store_n(ptr,2, RELAXED);

then the second store doesn't dominate the first store, because
there's a stricter memory ordering instruction in between.

But I think the *important* rule is that "Rule 4", which is what
actually gives us the true "atomicity" rule:

Loads cannot be rematerialized

IOW, when we do a

a = __atomic_load_n(ptr);

and use the value of 'a' at any point later, it can *never* have
turned into a re-load of that ptr value. Yes, the load might have been
combined with an earlier one, and the load may have been moved up
earlier (by the "you can invent loads and then combine them with the
later real one" rule).

So even atomic loads can be combined, they can be moved to outside of
loops etc. BUT THEY CANNOT BE RE-DONE.

I think that's really the fundamnetally different rule, and the one
that makes us use 'volatile' in many cases.

So we have a "READ_ONCE()" macro, but in many cases it's really
"READ_AT_MOST_ONCE()". It's usually not that you can't optimize the
read away - you can - but you absolutely must not turn it into two
reads.

Turning a read into two reads is what makes those TOCTOU issues problematic.

Honestly, I'd love to just have that "Rule 4" be there for *all*
variable accesses. Not just __atomic ones. If we had a compiler flag
like "-fno-TUCTOU", I'd enable it.

Linus

Linus Torvalds

unread,
May 4, 2024, 3:25:33 PMMay 4
to pau...@kernel.org, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Sat, 4 May 2024 at 12:11, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Anyway, the above are all normal optimizations that compilers
> *already* do, and the only real issue is that with memory ordering,
> the "dominance" thing will need to take into account the ordering
> requirements of other memory operations with stricter memory ordering
> in between. So, for example, if you have
>
> __atomic_store_n(ptr,1, RELAXED);
> __atomic_load_n(otherptr,2, ACQUIRE);
> __atomic_store_n(ptr,2, RELAXED);
>
> then the second store doesn't dominate the first store, because
> there's a stricter memory ordering instruction in between.

Actually, that was a bad example, because in that pattern, the second
store does actually dominate (the second store can not move up beyond
the ACQUIRE, but the first store can indeed move down after it, so
dominance analysis does actually allow the second store to strictly
dominate the first one).

So the ACQUIRE would need to be SEQ for my example to be valid.

Of course, usually the barrier that stops domination is something
entirely different. Even without an actual conditional (which is
almost certainly the most common reason why dominance goes away), it
might be a function call (which could do any arbitrary memory ordering
operations - there was a clang bug in this very case) or something
like an explicit full memory barrier.

Anyway, take that email as a "written in the MUA on the fly". There
might be other thinkos in there. But I think the big picture was
right.

Linus

Paul E. McKenney

unread,
May 4, 2024, 6:04:52 PMMay 4
to Linus Torvalds, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Sat, May 04, 2024 at 12:11:10PM -0700, Linus Torvalds wrote:
> On Sat, 4 May 2024 at 11:18, Paul E. McKenney <pau...@kernel.org> wrote:
> >
> > Here is my current thoughts for possible optimizations of non-volatile
> > memory_order_relaxed atomics:
> >
> > o Loads from the same variable that can legitimately be
> > reordered to be adjacent to one another can be fused
> > into a single load.
>
> Let's call this "Rule 1"
>
> I think you can extend this to also "can be forwarded from a previous store".

Agreed, with constraints based on intervening synchronization.

> I also think you're too strict in saying "fused into a single load".
> Let me show an example below.

I certainly did intend to make any errors in the direction of being
too strict.

> > o Stores to the same variable that can legitimately be
> > reordered to be adjacent to one another can be replaced
> > by the last store in the series.
>
> Rule 2.
>
> Ack, although again, I think you're being a bit too strict in your
> language, and the rule can be relaxed.
>
> > o Loads and stores may not be invented.
>
> Rule 3.
>
> I think you can and should relax this. You can invent loads all you want.

I might be misunderstanding you, but given my understanding, I disagree.
Consider this example:

x = __atomic_load(&a, RELAXED);
r0 = x * x + 2 * x + 1;

It would not be good for a load to be invented as follows:

x = __atomic_load(&a, RELAXED);
invented = __atomic_load(&a, RELAXED);
r0 = x * x + 2 * invented + 1;

In the first example, we know that r0 is a perfect square, at least
assuming that x is small enough to avoid wrapping. In the second
example, x might not be equal to the value from the invented load,
and r0 might not be a perfect square.

I believe that we really need the compiler to keep basic arithmetic
working.

That said, I agree that this disallows directly applying current
CSE optimizations, which might make some people sad. But we do need
code to work regardless.

Again, it is quite possible that I am misunderstanding you here.

> > o The only way that a computation based on the value from
> > a given load can instead use some other load is if the
> > two loads are fused into a single load.
>
> Rule 4.
>
> I'm not convinced that makes sense, and I don't think it's true as written.
>
> I think I understand what you are trying to say, but I think you're
> saying it in a way that only confuses a compiler person.
>
> In particular, the case I do not think is true is very much the
> "spill" case: if you have code like this:
>
> a = expression involving '__atomic_load_n(xyz, RELAXED)'
>
> then it's perfectly fine to spill the result of that load and reload
> the value. So the "computation based on the value" *is* actually based
> on "some other load" (the reload).

As in the result is stored to a compiler temporary and then reloaded
from that temporary? Agreed, that would be just fine. In contrast,
spilling and reloading from xyz would not be good at all.

> I really *REALLY* think you need to explain the semantics in concrete
> terms that a compiler writer would understand and agree with.

Experience would indicate that I should not dispute sentence. ;-)

> So to explain your rules to an actual compiler person (and relax the
> semantics a bit) I would rewrite your rules as:
>
> Rule 1: a strictly dominating load can be replaced by the value of a
> preceding load or store
>
> Ruie 2: a strictly dominating store can remove preceding stores
>
> Rule 3: stores cannot be done speculatively (put another way: a
> subsequent dominating store can only *remove* a store entirely, it
> can't turn the store into one with speculative data)
>
> Rule 4: loads cannot be rematerialized (ie a load can be *combined*
> as per Rule 1, but a load cannot be *split* into two loads)

I still believe that synchronization operations need a look-in, and
I am not sure what is being dominated in your Rules 1 and 2 (all
subsequent execution?), but let's proceed.

> Anyway, let's get to the examples of *why* I think your language was
> bad and your rules were too strict.
>
> Let's start with your Rule 3, where you said:
>
> - Loads and stores may not be invented
>
> and while I think this should be very very true for stores, I think
> inventing loads is not only valid, but a good idea. Example:
>
> if (a)
> b = __atomic_load_n(ptr) + 1;
>
> can perfectly validly just be written as
>
> tmp = __atomic_load_n(ptr);
> if (a)
> b = tmp+1;
>
> which in turn may allow other optimizations (ie depending on how 'b'
> is used, the conditional may go away entirely, and you just end up
> with 'b = tmp+!!a').
>
> There's nothing wrong with extra loads that aren't used.

From a functional viewpoint, if the value isn't used, then agreed,
inventing the load is harmless. But there are some code sequences where
I really wouldn't want the extra cache miss.

> And to make that more explicit, let's look at Rule 1:
>
> Example of Rule 1 (together with the above case):
>
> if (a)
> b = __atomic_load_n(ptr) + 1;
> else
> b = __atomic_load_n(ptr) + 2;
> c = __atomic_load_n(ptr) + 3;
>
> then that can perfectly validly re-write this all as
>
> tmp = __atomic_load_n(ptr);
> b = a ? tmp+1 : tmp+2;
> c = tmp + 3;
>
> because my version of Rule 1 allows the dominating load used for 'c'
> to be replaced by the value of a preceding load that was used for 'a'
> and 'b'.

OK, I thought that nodes early in the control-flow graph dominated
nodes that are later in that graph, but I am not a compiler expert.

In any case, I agree with this transformation. This is making three
loads into one load, and there is no intervening synchronization to gum
up the works.

> And to give an example of Rule 2, where you said "reordered to be
> adjacent", I'm saying that all that matters is being strictly
> dominant, so
>
> if (a)
> __atomic_store_n(ptr,1);
> else
> __atomic_store_n(ptr,2);
> __atomic_store_n(ptr,3);
>
> can be perfectly validly be combined into just
>
> __atomic_store_n(ptr,3);
>
> because the third store completely dominates the two others, even if
> in the intermediate form they are not necessarily ever "adjacent".

I agree with this transformation as well. But suppose that the code
also contained an smp_mb() right after that "if" statement. Given that,
it is not hard to construct a larger example in which dropping the first
two stores would be problematic.

> (Your "adjacency" model might still be valid in how you could turn
> first of the first stores to be a fall-through, then remove it, and
> then turn the other to be a fall-through and then remove it, so maybe
> your language isn't _tecnically_ wrong, But I think the whole
> "dominating store" is how a compiler writer would think about it).

I was thinking in terms of first transforming the code as follows:

if (a) {
__atomic_store_n(ptr,1);
__atomic_store_n(ptr,3);
} else {
__atomic_store_n(ptr,2);
__atomic_store_n(ptr,3);
}

(And no, I would not expect a real compiler to do this!)

Then it is clearly OK to further transform into the following:

if (a) {
__atomic_store_n(ptr,3);
} else {
__atomic_store_n(ptr,3);
}

At which point both branches of the "if" statement are doing the
same thing, so:

__atomic_store_n(ptr,3);

On to your next email!

Thanx, Paul

Paul E. McKenney

unread,
May 4, 2024, 6:17:41 PMMay 4
to Linus Torvalds, Boqun Feng, Marco Elver, Tetsuo Handa, Greg Kroah-Hartman, Dmitry Vyukov, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com, Nathan Chancellor, Arnd Bergmann, Al Viro, Jiri Slaby
On Sat, May 04, 2024 at 12:25:12PM -0700, Linus Torvalds wrote:
> On Sat, 4 May 2024 at 12:11, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > Anyway, the above are all normal optimizations that compilers
> > *already* do, and the only real issue is that with memory ordering,
> > the "dominance" thing will need to take into account the ordering
> > requirements of other memory operations with stricter memory ordering
> > in between. So, for example, if you have
> >
> > __atomic_store_n(ptr,1, RELAXED);
> > __atomic_load_n(otherptr,2, ACQUIRE);
> > __atomic_store_n(ptr,2, RELAXED);

I am assuming that I should ignore the "2," in that load.

> > then the second store doesn't dominate the first store, because
> > there's a stricter memory ordering instruction in between.
>
> Actually, that was a bad example, because in that pattern, the second
> store does actually dominate (the second store can not move up beyond
> the ACQUIRE, but the first store can indeed move down after it, so
> dominance analysis does actually allow the second store to strictly
> dominate the first one).

Agreed, and the stores can be combined as a result of the fact that
the two stores can be reordered to be adjacent to one another.

> So the ACQUIRE would need to be SEQ for my example to be valid.

And here the C and C++ memory models get very strange due to mixing
memory_order_seq_cst and non-memory_order_seq_cst.

But if there was a Linux-kernel smp_mb() immediately after that first
store, then the compiler would not be allowed to combine the two stores.
Though that is true regardless because of the smp_mb()'s barrier().

> Of course, usually the barrier that stops domination is something
> entirely different. Even without an actual conditional (which is
> almost certainly the most common reason why dominance goes away), it
> might be a function call (which could do any arbitrary memory ordering
> operations - there was a clang bug in this very case) or something
> like an explicit full memory barrier.

If there was something like, then the two stores could not be combined,
from what I can see.

__atomic_store_n(ptr,1, RELAXED);
__atomic_load_n(otherptr, ACQUIRE);
__atomic_store_n(otherptr, 4, RELEASE);
__atomic_store_n(ptr,2, RELAXED);

I freely confess that I don't know how to map that into dominance
relations, which means that I have no idea what this example means
in terms of your rules.

> Anyway, take that email as a "written in the MUA on the fly". There
> might be other thinkos in there. But I think the big picture was
> right.

If things go as they usually do, there will be quite a few litmus tests
between here and something credible. ;-)

Thank you for the tutorial on dominance in CFGs!

Thanx, Paul
Reply all
Reply to author
Forward
0 new messages