[syzbot] [mptcp?] general protection fault in proc_scheduler

8 views
Skip to first unread message

syzbot

unread,
Jan 2, 2025, 9:12:28 AM1/2/25
to da...@davemloft.net, edum...@google.com, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mat...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot found the following issue on:

HEAD commit: ccb98ccef0e5 Merge tag 'platform-drivers-x86-v6.13-4' of g..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=128f6ac4580000
kernel config: https://syzkaller.appspot.com/x/.config?x=86dd15278dbfe19f
dashboard link: https://syzkaller.appspot.com/bug?extid=e364f774c6f57f2c86d1
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1245eaf8580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/d24eb225cff7/disk-ccb98cce.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/dd81532f8240/vmlinux-ccb98cce.xz
kernel image: https://storage.googleapis.com/syzbot-assets/18b08e4bbf40/bzImage-ccb98cce.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+e364f7...@syzkaller.appspotmail.com

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
CPU: 1 UID: 0 PID: 5924 Comm: syz-executor Not tainted 6.13.0-rc5-syzkaller-00004-gccb98ccef0e5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:proc_scheduler+0xc6/0x3c0 net/mptcp/ctrl.c:125
Code: 03 42 80 3c 38 00 0f 85 fe 02 00 00 4d 8b a4 24 08 09 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 02 00 00 4d 8b 7c 24 28 48 8d 84 24 c8 00 00
RSP: 0018:ffffc900034774e8 EFLAGS: 00010206

RAX: dffffc0000000000 RBX: 1ffff9200068ee9e RCX: ffffc90003477620
RDX: 0000000000000005 RSI: ffffffff8b08f91e RDI: 0000000000000028
RBP: 0000000000000001 R08: ffffc90003477710 R09: 0000000000000040
R10: 0000000000000040 R11: 00000000726f7475 R12: 0000000000000000
R13: ffffc90003477620 R14: ffffc90003477710 R15: dffffc0000000000
FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee3cd452d8 CR3: 000000007d116000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
proc_sys_call_handler+0x403/0x5d0 fs/proc/proc_sysctl.c:601
__kernel_write_iter+0x318/0xa80 fs/read_write.c:612
__kernel_write+0xf6/0x140 fs/read_write.c:632
do_acct_process+0xcb0/0x14a0 kernel/acct.c:539
acct_pin_kill+0x2d/0x100 kernel/acct.c:192
pin_kill+0x194/0x7c0 fs/fs_pin.c:44
mnt_pin_kill+0x61/0x1e0 fs/fs_pin.c:81
cleanup_mnt+0x3ac/0x450 fs/namespace.c:1366
task_work_run+0x14e/0x250 kernel/task_work.c:239
exit_task_work include/linux/task_work.h:43 [inline]
do_exit+0xad8/0x2d70 kernel/exit.c:938
do_group_exit+0xd3/0x2a0 kernel/exit.c:1087
get_signal+0x2576/0x2610 kernel/signal.c:3017
arch_do_signal_or_restart+0x90/0x7e0 arch/x86/kernel/signal.c:337
exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x150/0x2a0 kernel/entry/common.c:218
do_syscall_64+0xda/0x250 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fee3cb87a6a
Code: Unable to access opcode bytes at 0x7fee3cb87a40.
RSP: 002b:00007fffcccac688 EFLAGS: 00000202 ORIG_RAX: 0000000000000037
RAX: 0000000000000000 RBX: 00007fffcccac710 RCX: 00007fee3cb87a6a
RDX: 0000000000000041 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00007fffcccac6ac R09: 00007fffcccacac7
R10: 00007fffcccac710 R11: 0000000000000202 R12: 00007fee3cd49500
R13: 00007fffcccac6ac R14: 0000000000000000 R15: 00007fee3cd4b000
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:proc_scheduler+0xc6/0x3c0 net/mptcp/ctrl.c:125
Code: 03 42 80 3c 38 00 0f 85 fe 02 00 00 4d 8b a4 24 08 09 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 02 00 00 4d 8b 7c 24 28 48 8d 84 24 c8 00 00
RSP: 0018:ffffc900034774e8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 1ffff9200068ee9e RCX: ffffc90003477620
RDX: 0000000000000005 RSI: ffffffff8b08f91e RDI: 0000000000000028
RBP: 0000000000000001 R08: ffffc90003477710 R09: 0000000000000040
R10: 0000000000000040 R11: 00000000726f7475 R12: 0000000000000000
R13: ffffc90003477620 R14: ffffc90003477710 R15: dffffc0000000000
FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee3cd452d8 CR3: 000000007d116000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess), 1 bytes skipped:
0: 42 80 3c 38 00 cmpb $0x0,(%rax,%r15,1)
5: 0f 85 fe 02 00 00 jne 0x309
b: 4d 8b a4 24 08 09 00 mov 0x908(%r12),%r12
12: 00
13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1a: fc ff df
1d: 49 8d 7c 24 28 lea 0x28(%r12),%rdi
22: 48 89 fa mov %rdi,%rdx
25: 48 c1 ea 03 shr $0x3,%rdx
* 29: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2d: 0f 85 cc 02 00 00 jne 0x2ff
33: 4d 8b 7c 24 28 mov 0x28(%r12),%r15
38: 48 rex.W
39: 8d .byte 0x8d
3a: 84 24 c8 test %ah,(%rax,%rcx,8)


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Eric Dumazet

unread,
Jan 2, 2025, 10:21:58 AM1/2/25
to syzbot, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mat...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
I thought acct(2) was only allowing regular files.

acct_on() indeed has :

if (!S_ISREG(file_inode(file)->i_mode)) {
kfree(acct);
filp_close(file, NULL);
return -EACCES;
}

It seems there are other ways to call do_acct_process() targeting a sysfs file ?

Hillf Danton

unread,
Jan 3, 2025, 5:33:12 AM1/3/25
to syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Thu, Jan 2, 2025 at 3:12 PM syzbot
> syzbot found the following issue on:
>
> HEAD commit: ccb98ccef0e5 Merge tag 'platform-drivers-x86-v6.13-4' of g..
> git tree: upstream
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1245eaf8580000

#syz test

--- x/net/mptcp/ctrl.c
+++ y/net/mptcp/ctrl.c
@@ -122,7 +122,7 @@ static int mptcp_set_scheduler(const str
static int proc_scheduler(const struct ctl_table *ctl, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
- const struct net *net = current->nsproxy->net_ns;
+ const struct net *net;
char val[MPTCP_SCHED_NAME_MAX];
struct ctl_table tbl = {
.data = val,
@@ -130,6 +130,9 @@ static int proc_scheduler(const struct c
};
int ret;

+ if (current->flags & PF_EXITING)
+ return -ENXIO;
+ net = current->nsproxy->net_ns;
strscpy(val, mptcp_get_scheduler(net), MPTCP_SCHED_NAME_MAX);

ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
@@ -161,9 +164,12 @@ static int proc_blackhole_detect_timeout
int write, void *buffer, size_t *lenp,
loff_t *ppos)
{
- struct mptcp_pernet *pernet = mptcp_get_pernet(current->nsproxy->net_ns);
+ struct mptcp_pernet *pernet;
int ret;

+ if (current->flags & PF_EXITING)
+ return -ENXIO;
+ pernet = mptcp_get_pernet(current->nsproxy->net_ns);
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (write && ret == 0)
atomic_set(&pernet->active_disable_times, 0);
--

syzbot

unread,
Jan 3, 2025, 5:52:04 AM1/3/25
to hda...@sina.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-by: syzbot+e364f7...@syzkaller.appspotmail.com
Tested-by: syzbot+e364f7...@syzkaller.appspotmail.com

Tested on:

commit: 0bc21e70 MAINTAINERS: Remove Olof from SoC maintainers
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17380edf980000
kernel config: https://syzkaller.appspot.com/x/.config?x=86dd15278dbfe19f
dashboard link: https://syzkaller.appspot.com/bug?extid=e364f774c6f57f2c86d1
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
patch: https://syzkaller.appspot.com/x/patch.diff?x=12f558b0580000

Note: testing is done by a robot and is best-effort only.

Matthieu Baerts

unread,
Jan 4, 2025, 1:38:59 PM1/4/25
to Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro
Hi Eric,

Thank you for the bug report!
(...)

> I thought acct(2) was only allowing regular files.
>
> acct_on() indeed has :
>
> if (!S_ISREG(file_inode(file)->i_mode)) {
> kfree(acct);
> filp_close(file, NULL);
> return -EACCES;
> }
>
> It seems there are other ways to call do_acct_process() targeting a sysfs file ?

Just to be sure I'm not misunderstanding your comment: do you mean that
here, the issue is *not* in MPTCP code where we get the 'struct net'
pointer via 'current->nsproxy->net_ns', but in the FS part, right?

Here, we have an issue because 'current->nsproxy' is NULL, but is it
normal? Or should we simply exit with an error if it is the case because
we are in an exiting phase?

I'm just a bit confused, because it looks like 'net' is retrieved from
different places elsewhere when dealing with sysfs: some get it from
'current' like us, some assign 'net' to 'table->extra2', others get it
from 'table->data' (via a container_of()), etc. Maybe we should not use
'current->nsproxy->net_ns' here then?

Cheers,
Matt
--
Sponsored by the NGI0 Core fund.

Eric Dumazet

unread,
Jan 4, 2025, 1:53:36 PM1/4/25
to Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro
I do think this is a bug in process accounting, not in networking.

It might make sense to output a record on a regular file, but probably
not on any other files.

diff --git a/kernel/acct.c b/kernel/acct.c
index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
const struct cred *orig_cred;
struct file *file = acct->file;

+ if (S_ISREG(file_inode(file)->i_mode))
+ return;
+
/*
* Accounting records are not subject to resource limits.
*/

Al Viro

unread,
Jan 4, 2025, 2:00:21 PM1/4/25
to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sat, Jan 04, 2025 at 07:53:22PM +0100, Eric Dumazet wrote:

> I do think this is a bug in process accounting, not in networking.
>
> It might make sense to output a record on a regular file, but probably
> not on any other files.
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
> 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
> const struct cred *orig_cred;
> struct file *file = acct->file;
>
> + if (S_ISREG(file_inode(file)->i_mode))
> + return;

... won't help, since the file in question *is* a regular file. IOW, it's
a wrong predicate here.

Matthieu Baerts

unread,
Jan 4, 2025, 2:12:11 PM1/4/25
to Al Viro, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
Hi Al, Eric,
On my side, it looks like I'm not able to reproduce the issue with this
patch. Without it, it is very easy to reproduce it. (But I don't know if
there are other consequences that would avoid the issue to happen: when
looking at the logs, with the patch, I don't have heaps of "Process
accounting resumed" messages that I had before.)

Matthieu Baerts

unread,
Jan 4, 2025, 2:12:15 PM1/4/25
to Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro, Joel Granados
Hi Eric,

(+cc Joel)

Thank you for your reply!
OK, thank you, that's clearer.

So this is then more a question for Joel, right?

Do you plan to send this patch to him?

#syz set subsystems: fs

Al Viro

unread,
Jan 4, 2025, 3:09:20 PM1/4/25
to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sat, Jan 04, 2025 at 07:53:22PM +0100, Eric Dumazet wrote:
> I do think this is a bug in process accounting, not in networking.
>
> It might make sense to output a record on a regular file, but probably
> not on any other files.
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
> 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
> const struct cred *orig_cred;
> struct file *file = acct->file;
>
> + if (S_ISREG(file_inode(file)->i_mode))
> + return;

Wait, what? OK, that will stop attempts to write there - or to any
other regular file.

If you modify that to
if (!S_ISREG(...))
you seem to have intended, it won't break the normal behaviour but it
won't help with sysctls.

Al Viro

unread,
Jan 4, 2025, 3:21:33 PM1/4/25
to Matthieu Baerts, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sat, Jan 04, 2025 at 08:11:49PM +0100, Matthieu Baerts wrote:
> >> + if (S_ISREG(file_inode(file)->i_mode))
^^^^^^^^^
> >> + return;
> >
> > ... won't help, since the file in question *is* a regular file. IOW, it's
> > a wrong predicate here.
>
> On my side, it looks like I'm not able to reproduce the issue with this
> patch. Without it, it is very easy to reproduce it. (But I don't know if
> there are other consequences that would avoid the issue to happen: when
> looking at the logs, with the patch, I don't have heaps of "Process
> accounting resumed" messages that I had before.)

Unsurprisingly so, since it rejects all regular files due to a typo;
fix that and you'll see that the oops is still there.

The real issue (and the one that affects more than just this scenario) is
the use of current->nsproxy->net to get to the damn thing.

Why not something like
static int proc_scheduler(const struct ctl_table *ctl, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
char (*data)[MPTCP_SCHED_NAME_MAX] = table->data;
char val[MPTCP_SCHED_NAME_MAX];
struct ctl_table tbl = {
.data = val,
.maxlen = MPTCP_SCHED_NAME_MAX,
};
int ret;

strscpy(val, *data, MPTCP_SCHED_NAME_MAX);

ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
if (write && ret == 0) {
rcu_read_lock();
sched = mptcp_sched_find(val);
if (sched)
strscpy(*data, val, MPTCP_SCHED_NAME_MAX);
else
ret = -ENOENT;
rcu_read_unlock();
}
return ret;
}

seeing that the data object you really want to access is
mptcp_get_pernet(net)->scheduler and you have that pointer
stored in table->data at the registration time?

Eric Dumazet

unread,
Jan 5, 2025, 3:32:50 AM1/5/25
to Al Viro, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sat, Jan 4, 2025 at 9:21 PM Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> On Sat, Jan 04, 2025 at 08:11:49PM +0100, Matthieu Baerts wrote:
> > >> + if (S_ISREG(file_inode(file)->i_mode))
> ^^^^^^^^^
> > >> + return;
> > >
> > > ... won't help, since the file in question *is* a regular file. IOW, it's
> > > a wrong predicate here.
> >
> > On my side, it looks like I'm not able to reproduce the issue with this
> > patch. Without it, it is very easy to reproduce it. (But I don't know if
> > there are other consequences that would avoid the issue to happen: when
> > looking at the logs, with the patch, I don't have heaps of "Process
> > accounting resumed" messages that I had before.)
>
> Unsurprisingly so, since it rejects all regular files due to a typo;
> fix that and you'll see that the oops is still there.
>
> The real issue (and the one that affects more than just this scenario) is
> the use of current->nsproxy->net to get to the damn thing.

According to grep, we have many other places directly reading
current->nsproxy->net_ns
For instance in net/sctp/sysctl.c
Should we change them all ?

Perhaps an alternative would be to add a generic check in
proc_sys_call_handler()

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 27a283d85a6e7df1a7edbfb513ce75832363e2e6..84968b10ce86e7fd88c6e3c43f52b601394b056f
100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -576,6 +576,8 @@ static ssize_t proc_sys_call_handler(struct kiocb
*iocb, struct iov_iter *iter,
error = -EINVAL;
if (!table->proc_handler)
goto out;
+ if (unlikely(current->flags & PF_EXITING))
+ goto out;

/* don't even try if the size is too large */
error = -ENOMEM;


Thanks.

Al Viro

unread,
Jan 5, 2025, 6:29:55 AM1/5/25
to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sun, Jan 05, 2025 at 09:32:36AM +0100, Eric Dumazet wrote:

> According to grep, we have many other places directly reading
> current->nsproxy->net_ns
> For instance in net/sctp/sysctl.c
> Should we change them all ?

Depends - do you want their contents match the netns of opener (as,
AFAICS, for ipv4 sysctls) or that of the reader?

Eric Dumazet

unread,
Jan 5, 2025, 11:52:34 AM1/5/25
to Al Viro, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
I am only worried that a malicious user could crash the host with
current kernels,
not about this MPTP crash, but all unaware users of current->nsproxy
in sysctl handlers.

Back to MPTCP :

Using the convention used in other mptcp sysctls like (enabled,
add_addr_timeout,
checksum_enabled, allow_join_initial_addr_port...) is better for consistency.

Matthieu Baerts

unread,
Jan 5, 2025, 12:03:43 PM1/5/25
to Eric Dumazet, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
Hi Eric,
Indeed, I can do the modifications to stop using current->nsproxy in
MPTCP. I can do the same in SCTP.

Do you plan to send your patch modifying proc_sysctl.c? It is just to
know if I should mark my patches as fixes, and split them to ease the
backports -- each helper using current->nsproxy has been introduced in
different commits -- or if I can send them to net-next instead.

Matthieu Baerts

unread,
Jan 5, 2025, 12:03:46 PM1/5/25
to Al Viro, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
Hi Al,

On 04/01/2025 21:21, Al Viro wrote:
> The real issue (and the one that affects more than just this scenario) is
> the use of current->nsproxy->net to get to the damn thing.
>
> Why not something like

(...)

> seeing that the data object you really want to access is
> mptcp_get_pernet(net)->scheduler and you have that pointer
> stored in table->data at the registration time?

Good point, thank you for the suggestion! :)

I will do this modification.

Al Viro

unread,
Jan 5, 2025, 2:54:42 PM1/5/25
to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sun, Jan 05, 2025 at 05:52:19PM +0100, Eric Dumazet wrote:
> On Sun, Jan 5, 2025 at 12:29 PM Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > On Sun, Jan 05, 2025 at 09:32:36AM +0100, Eric Dumazet wrote:
> >
> > > According to grep, we have many other places directly reading
> > > current->nsproxy->net_ns
> > > For instance in net/sctp/sysctl.c
> > > Should we change them all ?
> >
> > Depends - do you want their contents match the netns of opener (as,
> > AFAICS, for ipv4 sysctls) or that of the reader?
>
> I am only worried that a malicious user could crash the host with
> current kernels,
> not about this MPTP crash, but all unaware users of current->nsproxy
> in sysctl handlers.

I don't hate your mitigation in proc_sysctl.c, but IMO there are two
problems mixed here - one is that we probably should have access
to per-netns sysctl table act on the netns it had been created for,
which may not coincide with reader's/writer's netns and another is that
access to current->nsproxy->netns would simply oops if attempted when
current->nsproxy had been dropped.

So I suspect that current->nsproxy->netns shouldn't be used in
per-netns sysctls for consistency sake (note that it can get more
serious than just consistency, if you have e.g. a spinlock taken
in something hanging off current netns to protect access to
something table->data points to).

As for the mitigation in fs/proc/proc_sysctl.c... might be useful,
if it comes with a clear comment about the reasons it's there.

Al Viro

unread,
Jan 5, 2025, 3:51:01 PM1/5/25
to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sun, Jan 05, 2025 at 07:54:34PM +0000, Al Viro wrote:

> So I suspect that current->nsproxy->netns shouldn't be used in
> per-netns sysctls for consistency sake (note that it can get more
> serious than just consistency, if you have e.g. a spinlock taken
> in something hanging off current netns to protect access to
> something table->data points to).
>
> As for the mitigation in fs/proc/proc_sysctl.c... might be useful,
> if it comes with a clear comment about the reasons it's there.

FWIW, looks like we have two such in mptcp (with sysctls next to
those definitely accessing the netns of opener rather than reader/writer),
two in rds (both inconsistent on the write side -
struct net *net = current->nsproxy->net_ns;
int err;

err = proc_dointvec_minmax(ctl, write, buffer, lenp, fpos);
if (err < 0) {
pr_warn("Invalid input. Must be >= %d\n",
*(int *)(ctl->extra1));
return err;
}
if (write)
rds_tcp_sysctl_reset(net);
will modify ctl->data, which points to &rtn->{snd,rcv}buf_size, with
rtn == net_generic(net, rds_tcp_netid) and net being for opener's netns
and then call rds_tcp_sysctl_reset(net) with net being the writer's
netns) and 6 in sctp. At least some of sctp ones are also inconsistent
on the write side; e.g.
static int proc_sctp_do_rto_min(const struct ctl_table *ctl, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
struct net *net = current->nsproxy->net_ns;
unsigned int min = *(unsigned int *) ctl->extra1;
unsigned int max = *(unsigned int *) ctl->extra2;
struct ctl_table tbl;
int ret, new_value;

memset(&tbl, 0, sizeof(struct ctl_table));
tbl.maxlen = sizeof(unsigned int);

if (write)
tbl.data = &new_value;
else
tbl.data = &net->sctp.rto_min;

ret = proc_dointvec(&tbl, write, buffer, lenp, ppos);
if (write && ret == 0) {
if (new_value > max || new_value < min)
return -EINVAL;

net->sctp.rto_min = new_value;
}

return ret;
}
has max taken from ctl->extra2, which is &net->sctp.rto_max of the
opener's netns, but the value capped by that in stored into
net->sctp.rto_min of *writer's* netns. So the logics that is supposed
to prevent rto_min > rto_max can be bypassed; no idea how much can that
escalate to, but it's clearly not what the code intends.

So I'd rather document the "don't assume that current->nsproxy->netns will
point to the same netns this ctl is for" and fix those 10 instances - at
least some smell seriously fishy. It's not just the acct(2) weirdness and
the damage may be worse than an oops...

Al Viro

unread,
Jan 5, 2025, 4:12:04 PM1/5/25
to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
On Sun, Jan 05, 2025 at 08:50:56PM +0000, Al Viro wrote:

> has max taken from ctl->extra2, which is &net->sctp.rto_max of the
> opener's netns, but the value capped by that in stored into
> net->sctp.rto_min of *writer's* netns. So the logics that is supposed
> to prevent rto_min > rto_max can be bypassed; no idea how much can that
> escalate to, but it's clearly not what the code intends.

Speaking of which, the logics that tries to maintain rto_min <= rto_max is
broken in another way. There's no exclusion in those suckers. IOW, if
we have set rto_min to 1 and rto_max to 10000, two processes can try to
write 1000 to rto_min and 10 to rto_max resp., with successful validations
done against the original state in both, followed by actual stores.
Result is rto_min == 1000 and rto_max == 10, which is probably not what
one wants there...

IOW, the validation and stores should be atomic; the same goes for another
pair (pf_retrans <= ps_retrans). Again, I've no idea how severe it is,
but result seems to be at least contrary to expectation of the code
authors...

Matthieu Baerts

unread,
Jan 6, 2025, 9:27:55 AM1/6/25
to Joel Granados, Eric Dumazet, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
Hi Joel, Eric, Al,

On 06/01/2025 14:32, Joel Granados wrote:
> If this is the case, can you point me to the place where this happens?
>
>>>>
>>>> Just to be sure I'm not misunderstanding your comment: do you mean that
>>>> here, the issue is *not* in MPTCP code where we get the 'struct net'
>>>> pointer via 'current->nsproxy->net_ns', but in the FS part, right?
>>>>
>>>> Here, we have an issue because 'current->nsproxy' is NULL, but is it
>>>> normal? Or should we simply exit with an error if it is the case because
>>>> we are in an exiting phase?
>>>>
>>>> I'm just a bit confused, because it looks like 'net' is retrieved from
>>>> different places elsewhere when dealing with sysfs: some get it from
>>>> 'current' like us, some assign 'net' to 'table->extra2', others get it
>>>> from 'table->data' (via a container_of()), etc. Maybe we should not use
>>>> 'current->nsproxy->net_ns' here then?
>>>
>>> I do think this is a bug in process accounting, not in networking.
>>>
>>> It might make sense to output a record on a regular file, but probably
>>> not on any other files.
> It for sure does not make sense to output a record on a sysctl file that
> has a maxlen of just 3*sizeof(int) (kernel/acct.c:79).
>
>>>
>>> diff --git a/kernel/acct.c b/kernel/acct.c
>>> index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
>>> 100644
>>> --- a/kernel/acct.c
>>> +++ b/kernel/acct.c
>>> @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
>>> const struct cred *orig_cred;
>>> struct file *file = acct->file;
>>>
>>> + if (S_ISREG(file_inode(file)->i_mode))
>>> + return;
>>> +
> This seems like it does not handle the actual culprit which is. Why is
> the sysctl file being used for the accounting.
>
>>> /*
>>> * Accounting records are not subject to resource limits.
>>> */
>>
>> OK, thank you, that's clearer.
>>
>> So this is then more a question for Joel, right?
>>
>> Do you plan to send this patch to him?
>>
>> #syz set subsystems: fs
>>
>> Cheers,
>> Matt
>> --
>> Sponsored by the NGI0 Core fund.
>>
>
> So what is happening is that:
> 1. The accounting file is set to a non-sysctl file.
> 2. And when accounting tries to write to this file, you get the
> behaviour explained in this mail?
>
> Please correct me if I have miss-read the situation.

@Joel: Thank you for your reply!

I'm sorry, I'm not sure whether I can help here. I hope Eric and/or Al
can jump in.

What I can say is that the original issue has been found by syzbot, and
the reproducer [1] shows that 3 syscalls have been used:
- openat('/proc/sys/net/mptcp/scheduler')
- mprotect()
- acct()

Please also note that the conversation continued in a sub-tread where
you are not in the Cc list, see [2]. In short, Eric suggested another
patch only for sysfs, and Al recommended dropping the use of
'current->nsproxy'.

On my side, I'm looking at dropping the use of 'current->nsproxy' in
sysctl callbacks. I guess such patches will be seen as fixes, except if
Eric's new patch is enough for stable?

[1] https://syzkaller.appspot.com/x/repro.syz?x=1245eaf8580000
[2]
https://lore.kernel.org/netdev/67769ecb.050a022...@google.com/T/#m862d0913ebfcec5e462a9c33b47bc3f6440a2900

Joel Granados

unread,
Jan 6, 2025, 9:45:16 AM1/6/25
to Matthieu Baerts, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro
On Sat, Jan 04, 2025 at 08:11:52PM +0100, Matthieu Baerts wrote:
If this is the case, can you point me to the place where this happens?

> >>
> >> Just to be sure I'm not misunderstanding your comment: do you mean that
> >> here, the issue is *not* in MPTCP code where we get the 'struct net'
> >> pointer via 'current->nsproxy->net_ns', but in the FS part, right?
> >>
> >> Here, we have an issue because 'current->nsproxy' is NULL, but is it
> >> normal? Or should we simply exit with an error if it is the case because
> >> we are in an exiting phase?
> >>
> >> I'm just a bit confused, because it looks like 'net' is retrieved from
> >> different places elsewhere when dealing with sysfs: some get it from
> >> 'current' like us, some assign 'net' to 'table->extra2', others get it
> >> from 'table->data' (via a container_of()), etc. Maybe we should not use
> >> 'current->nsproxy->net_ns' here then?
> >
> > I do think this is a bug in process accounting, not in networking.
> >
> > It might make sense to output a record on a regular file, but probably
> > not on any other files.
It for sure does not make sense to output a record on a sysctl file that
has a maxlen of just 3*sizeof(int) (kernel/acct.c:79).

> >
> > diff --git a/kernel/acct.c b/kernel/acct.c
> > index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
> > 100644
> > --- a/kernel/acct.c
> > +++ b/kernel/acct.c
> > @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
> > const struct cred *orig_cred;
> > struct file *file = acct->file;
> >
> > + if (S_ISREG(file_inode(file)->i_mode))
> > + return;
> > +
This seems like it does not handle the actual culprit which is. Why is
the sysctl file being used for the accounting.

> > /*
> > * Accounting records are not subject to resource limits.
> > */
>
> OK, thank you, that's clearer.
>
> So this is then more a question for Joel, right?
>
> Do you plan to send this patch to him?
>
> #syz set subsystems: fs
>
> Cheers,
> Matt
> --
> Sponsored by the NGI0 Core fund.
>

So what is happening is that:
1. The accounting file is set to a non-sysctl file.
2. And when accounting tries to write to this file, you get the
behaviour explained in this mail?

Please correct me if I have miss-read the situation.

Best


--

Joel Granados

Eric Dumazet

unread,
Jan 6, 2025, 10:27:18 AM1/6/25
to Matthieu Baerts, Joel Granados, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
It might be less risky in terms of backports to patch mptcp and others.

Ie just use Al suggestion.

Thanks !

Matthieu Baerts

unread,
Jan 6, 2025, 10:34:45 AM1/6/25
to Eric Dumazet, Joel Granados, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
Hi Eric,

Thank you for your reply!
Thank you, will do! In fact, I already modified the kernel on my side,
but it is hard for me to validate that for the moment: it is nice to
have many trees around, but less when they fall on cables :)

Joel Granados

unread,
Jan 8, 2025, 9:37:27 AM1/8/25
to Matthieu Baerts, Eric Dumazet, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot
Perfect. Thx for the summary. I'll remove this thread from my radar as
it seems that a fix has already been found.

Best

--

Joel Granados
Reply all
Reply to author
Forward
0 new messages