[syzbot] [mptcp?] general protection fault in proc

syzbot

unread,

Jan 2, 2025, 9:12:28 AM1/2/25

to da...@davemloft.net, edum...@google.com, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mat...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com

Hello,

syzbot found the following issue on:

HEAD commit: ccb98ccef0e5 Merge tag 'platform-drivers-x86-v6.13-4' of g..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=128f6ac4580000
kernel config: https://syzkaller.appspot.com/x/.config?x=86dd15278dbfe19f
dashboard link: https://syzkaller.appspot.com/bug?extid=e364f774c6f57f2c86d1
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1245eaf8580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/d24eb225cff7/disk-ccb98cce.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/dd81532f8240/vmlinux-ccb98cce.xz
kernel image: https://storage.googleapis.com/syzbot-assets/18b08e4bbf40/bzImage-ccb98cce.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+e364f7...@syzkaller.appspotmail.com

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
CPU: 1 UID: 0 PID: 5924 Comm: syz-executor Not tainted 6.13.0-rc5-syzkaller-00004-gccb98ccef0e5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:proc_scheduler+0xc6/0x3c0 net/mptcp/ctrl.c:125
Code: 03 42 80 3c 38 00 0f 85 fe 02 00 00 4d 8b a4 24 08 09 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 02 00 00 4d 8b 7c 24 28 48 8d 84 24 c8 00 00
RSP: 0018:ffffc900034774e8 EFLAGS: 00010206

RAX: dffffc0000000000 RBX: 1ffff9200068ee9e RCX: ffffc90003477620
RDX: 0000000000000005 RSI: ffffffff8b08f91e RDI: 0000000000000028
RBP: 0000000000000001 R08: ffffc90003477710 R09: 0000000000000040
R10: 0000000000000040 R11: 00000000726f7475 R12: 0000000000000000
R13: ffffc90003477620 R14: ffffc90003477710 R15: dffffc0000000000
FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee3cd452d8 CR3: 000000007d116000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
proc_sys_call_handler+0x403/0x5d0 fs/proc/proc_sysctl.c:601
__kernel_write_iter+0x318/0xa80 fs/read_write.c:612
__kernel_write+0xf6/0x140 fs/read_write.c:632
do_acct_process+0xcb0/0x14a0 kernel/acct.c:539
acct_pin_kill+0x2d/0x100 kernel/acct.c:192
pin_kill+0x194/0x7c0 fs/fs_pin.c:44
mnt_pin_kill+0x61/0x1e0 fs/fs_pin.c:81
cleanup_mnt+0x3ac/0x450 fs/namespace.c:1366
task_work_run+0x14e/0x250 kernel/task_work.c:239
exit_task_work include/linux/task_work.h:43 [inline]
do_exit+0xad8/0x2d70 kernel/exit.c:938
do_group_exit+0xd3/0x2a0 kernel/exit.c:1087
get_signal+0x2576/0x2610 kernel/signal.c:3017
arch_do_signal_or_restart+0x90/0x7e0 arch/x86/kernel/signal.c:337
exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x150/0x2a0 kernel/entry/common.c:218
do_syscall_64+0xda/0x250 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fee3cb87a6a
Code: Unable to access opcode bytes at 0x7fee3cb87a40.
RSP: 002b:00007fffcccac688 EFLAGS: 00000202 ORIG_RAX: 0000000000000037
RAX: 0000000000000000 RBX: 00007fffcccac710 RCX: 00007fee3cb87a6a
RDX: 0000000000000041 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00007fffcccac6ac R09: 00007fffcccacac7
R10: 00007fffcccac710 R11: 0000000000000202 R12: 00007fee3cd49500
R13: 00007fffcccac6ac R14: 0000000000000000 R15: 00007fee3cd4b000
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:proc_scheduler+0xc6/0x3c0 net/mptcp/ctrl.c:125
Code: 03 42 80 3c 38 00 0f 85 fe 02 00 00 4d 8b a4 24 08 09 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 02 00 00 4d 8b 7c 24 28 48 8d 84 24 c8 00 00
RSP: 0018:ffffc900034774e8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 1ffff9200068ee9e RCX: ffffc90003477620
RDX: 0000000000000005 RSI: ffffffff8b08f91e RDI: 0000000000000028
RBP: 0000000000000001 R08: ffffc90003477710 R09: 0000000000000040
R10: 0000000000000040 R11: 00000000726f7475 R12: 0000000000000000
R13: ffffc90003477620 R14: ffffc90003477710 R15: dffffc0000000000
FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee3cd452d8 CR3: 000000007d116000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess), 1 bytes skipped:
0: 42 80 3c 38 00 cmpb $0x0,(%rax,%r15,1)
5: 0f 85 fe 02 00 00 jne 0x309
b: 4d 8b a4 24 08 09 00 mov 0x908(%r12),%r12
12: 00
13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1a: fc ff df
1d: 49 8d 7c 24 28 lea 0x28(%r12),%rdi
22: 48 89 fa mov %rdi,%rdx
25: 48 c1 ea 03 shr $0x3,%rdx
* 29: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2d: 0f 85 cc 02 00 00 jne 0x2ff
33: 4d 8b 7c 24 28 mov 0x28(%r12),%r15
38: 48 rex.W
39: 8d .byte 0x8d
3a: 84 24 c8 test %ah,(%rax,%rcx,8)

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Eric Dumazet

unread,

Jan 2, 2025, 10:21:58 AM1/2/25

to syzbot, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mat...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com

I thought acct(2) was only allowing regular files.

acct_on() indeed has :

if (!S_ISREG(file_inode(file)->i_mode)) {
kfree(acct);
filp_close(file, NULL);
return -EACCES;
}

It seems there are other ways to call do_acct_process() targeting a sysfs file ?

Hillf Danton

unread,

Jan 3, 2025, 5:33:12 AM1/3/25

to syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com

On Thu, Jan 2, 2025 at 3:12 PM syzbot

> syzbot found the following issue on:
>
> HEAD commit: ccb98ccef0e5 Merge tag 'platform-drivers-x86-v6.13-4' of g..
> git tree: upstream

> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1245eaf8580000

#syz test

--- x/net/mptcp/ctrl.c
+++ y/net/mptcp/ctrl.c
@@ -122,7 +122,7 @@ static int mptcp_set_scheduler(const str
static int proc_scheduler(const struct ctl_table *ctl, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
- const struct net *net = current->nsproxy->net_ns;
+ const struct net *net;
char val[MPTCP_SCHED_NAME_MAX];
struct ctl_table tbl = {
.data = val,
@@ -130,6 +130,9 @@ static int proc_scheduler(const struct c
};
int ret;

+ if (current->flags & PF_EXITING)
+ return -ENXIO;
+ net = current->nsproxy->net_ns;
strscpy(val, mptcp_get_scheduler(net), MPTCP_SCHED_NAME_MAX);

ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
@@ -161,9 +164,12 @@ static int proc_blackhole_detect_timeout
int write, void *buffer, size_t *lenp,
loff_t *ppos)
{
- struct mptcp_pernet *pernet = mptcp_get_pernet(current->nsproxy->net_ns);
+ struct mptcp_pernet *pernet;
int ret;

+ if (current->flags & PF_EXITING)
+ return -ENXIO;
+ pernet = mptcp_get_pernet(current->nsproxy->net_ns);
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (write && ret == 0)
atomic_set(&pernet->active_disable_times, 0);
--

syzbot

unread,

Jan 3, 2025, 5:52:04 AM1/3/25

to hda...@sina.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-by: syzbot+e364f7...@syzkaller.appspotmail.com
Tested-by: syzbot+e364f7...@syzkaller.appspotmail.com

Tested on:

commit: 0bc21e70 MAINTAINERS: Remove Olof from SoC maintainers
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17380edf980000

kernel config: https://syzkaller.appspot.com/x/.config?x=86dd15278dbfe19f
dashboard link: https://syzkaller.appspot.com/bug?extid=e364f774c6f57f2c86d1
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40

patch: https://syzkaller.appspot.com/x/patch.diff?x=12f558b0580000

Note: testing is done by a robot and is best-effort only.

Matthieu Baerts

unread,

Jan 4, 2025, 1:38:59 PM1/4/25

to Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro

Hi Eric,

Thank you for the bug report!

(...)

> I thought acct(2) was only allowing regular files.
>
> acct_on() indeed has :
>
> if (!S_ISREG(file_inode(file)->i_mode)) {
> kfree(acct);
> filp_close(file, NULL);
> return -EACCES;
> }
>
> It seems there are other ways to call do_acct_process() targeting a sysfs file ?

Just to be sure I'm not misunderstanding your comment: do you mean that
here, the issue is *not* in MPTCP code where we get the 'struct net'
pointer via 'current->nsproxy->net_ns', but in the FS part, right?

Here, we have an issue because 'current->nsproxy' is NULL, but is it
normal? Or should we simply exit with an error if it is the case because
we are in an exiting phase?

I'm just a bit confused, because it looks like 'net' is retrieved from
different places elsewhere when dealing with sysfs: some get it from
'current' like us, some assign 'net' to 'table->extra2', others get it
from 'table->data' (via a container_of()), etc. Maybe we should not use
'current->nsproxy->net_ns' here then?

Cheers,
Matt
--
Sponsored by the NGI0 Core fund.

Eric Dumazet

unread,

Jan 4, 2025, 1:53:36 PM1/4/25

to Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro

I do think this is a bug in process accounting, not in networking.

It might make sense to output a record on a regular file, but probably
not on any other files.

diff --git a/kernel/acct.c b/kernel/acct.c
index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
const struct cred *orig_cred;
struct file *file = acct->file;

+ if (S_ISREG(file_inode(file)->i_mode))
+ return;
+
/*
* Accounting records are not subject to resource limits.
*/

Al Viro

unread,

Jan 4, 2025, 2:00:21 PM1/4/25

to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sat, Jan 04, 2025 at 07:53:22PM +0100, Eric Dumazet wrote:

> I do think this is a bug in process accounting, not in networking.
>
> It might make sense to output a record on a regular file, but probably
> not on any other files.
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
> 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
> const struct cred *orig_cred;
> struct file *file = acct->file;
>
> + if (S_ISREG(file_inode(file)->i_mode))
> + return;

... won't help, since the file in question *is* a regular file. IOW, it's
a wrong predicate here.

Matthieu Baerts

unread,

Jan 4, 2025, 2:12:11 PM1/4/25

to Al Viro, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

Hi Al, Eric,

On my side, it looks like I'm not able to reproduce the issue with this
patch. Without it, it is very easy to reproduce it. (But I don't know if
there are other consequences that would avoid the issue to happen: when
looking at the logs, with the patch, I don't have heaps of "Process
accounting resumed" messages that I had before.)

Matthieu Baerts

unread,

Jan 4, 2025, 2:12:15 PM1/4/25

to Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro, Joel Granados

Hi Eric,

(+cc Joel)

Thank you for your reply!

OK, thank you, that's clearer.

So this is then more a question for Joel, right?

Do you plan to send this patch to him?

#syz set subsystems: fs

Al Viro

unread,

Jan 4, 2025, 3:09:20 PM1/4/25

to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sat, Jan 04, 2025 at 07:53:22PM +0100, Eric Dumazet wrote:

> I do think this is a bug in process accounting, not in networking.
>
> It might make sense to output a record on a regular file, but probably
> not on any other files.
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
> 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
> const struct cred *orig_cred;
> struct file *file = acct->file;
>
> + if (S_ISREG(file_inode(file)->i_mode))
> + return;

Wait, what? OK, that will stop attempts to write there - or to any
other regular file.

If you modify that to
if (!S_ISREG(...))
you seem to have intended, it won't break the normal behaviour but it
won't help with sysctls.

Al Viro

unread,

Jan 4, 2025, 3:21:33 PM1/4/25

to Matthieu Baerts, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sat, Jan 04, 2025 at 08:11:49PM +0100, Matthieu Baerts wrote:
> >> + if (S_ISREG(file_inode(file)->i_mode))
^^^^^^^^^

> >> + return;
> >
> > ... won't help, since the file in question *is* a regular file. IOW, it's
> > a wrong predicate here.
>
> On my side, it looks like I'm not able to reproduce the issue with this
> patch. Without it, it is very easy to reproduce it. (But I don't know if
> there are other consequences that would avoid the issue to happen: when
> looking at the logs, with the patch, I don't have heaps of "Process
> accounting resumed" messages that I had before.)

Unsurprisingly so, since it rejects all regular files due to a typo;
fix that and you'll see that the oops is still there.

The real issue (and the one that affects more than just this scenario) is
the use of current->nsproxy->net to get to the damn thing.

Why not something like

static int proc_scheduler(const struct ctl_table *ctl, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{

char (*data)[MPTCP_SCHED_NAME_MAX] = table->data;

char val[MPTCP_SCHED_NAME_MAX];
struct ctl_table tbl = {
.data = val,

.maxlen = MPTCP_SCHED_NAME_MAX,
};
int ret;

strscpy(val, *data, MPTCP_SCHED_NAME_MAX);

ret = proc_dostring(&tbl, write, buffer, lenp, ppos);

if (write && ret == 0) {
rcu_read_lock();
sched = mptcp_sched_find(val);
if (sched)
strscpy(*data, val, MPTCP_SCHED_NAME_MAX);
else
ret = -ENOENT;
rcu_read_unlock();
}
return ret;
}

seeing that the data object you really want to access is
mptcp_get_pernet(net)->scheduler and you have that pointer
stored in table->data at the registration time?

Eric Dumazet

unread,

Jan 5, 2025, 3:32:50 AM1/5/25

to Al Viro, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sat, Jan 4, 2025 at 9:21 PM Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> On Sat, Jan 04, 2025 at 08:11:49PM +0100, Matthieu Baerts wrote:
> > >> + if (S_ISREG(file_inode(file)->i_mode))
> ^^^^^^^^^
> > >> + return;
> > >
> > > ... won't help, since the file in question *is* a regular file. IOW, it's
> > > a wrong predicate here.
> >
> > On my side, it looks like I'm not able to reproduce the issue with this
> > patch. Without it, it is very easy to reproduce it. (But I don't know if
> > there are other consequences that would avoid the issue to happen: when
> > looking at the logs, with the patch, I don't have heaps of "Process
> > accounting resumed" messages that I had before.)
>
> Unsurprisingly so, since it rejects all regular files due to a typo;
> fix that and you'll see that the oops is still there.
>
> The real issue (and the one that affects more than just this scenario) is
> the use of current->nsproxy->net to get to the damn thing.

According to grep, we have many other places directly reading
current->nsproxy->net_ns
For instance in net/sctp/sysctl.c
Should we change them all ?

Perhaps an alternative would be to add a generic check in
proc_sys_call_handler()

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 27a283d85a6e7df1a7edbfb513ce75832363e2e6..84968b10ce86e7fd88c6e3c43f52b601394b056f
100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -576,6 +576,8 @@ static ssize_t proc_sys_call_handler(struct kiocb
*iocb, struct iov_iter *iter,
error = -EINVAL;
if (!table->proc_handler)
goto out;
+ if (unlikely(current->flags & PF_EXITING))
+ goto out;

/* don't even try if the size is too large */
error = -ENOMEM;

Thanks.

Al Viro

unread,

Jan 5, 2025, 6:29:55 AM1/5/25

to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sun, Jan 05, 2025 at 09:32:36AM +0100, Eric Dumazet wrote:

> According to grep, we have many other places directly reading
> current->nsproxy->net_ns
> For instance in net/sctp/sysctl.c
> Should we change them all ?

Depends - do you want their contents match the netns of opener (as,
AFAICS, for ipv4 sysctls) or that of the reader?

Eric Dumazet

unread,

Jan 5, 2025, 11:52:34 AM1/5/25

to Al Viro, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

I am only worried that a malicious user could crash the host with
current kernels,
not about this MPTP crash, but all unaware users of current->nsproxy
in sysctl handlers.

Back to MPTCP :

Using the convention used in other mptcp sysctls like (enabled,
add_addr_timeout,
checksum_enabled, allow_join_initial_addr_port...) is better for consistency.

Matthieu Baerts

unread,

Jan 5, 2025, 12:03:43 PM1/5/25

to Eric Dumazet, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

Hi Eric,

Indeed, I can do the modifications to stop using current->nsproxy in
MPTCP. I can do the same in SCTP.

Do you plan to send your patch modifying proc_sysctl.c? It is just to
know if I should mark my patches as fixes, and split them to ease the
backports -- each helper using current->nsproxy has been introduced in
different commits -- or if I can send them to net-next instead.

Matthieu Baerts

unread,

Jan 5, 2025, 12:03:46 PM1/5/25

to Al Viro, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

Hi Al,

On 04/01/2025 21:21, Al Viro wrote:
> The real issue (and the one that affects more than just this scenario) is
> the use of current->nsproxy->net to get to the damn thing.
>
> Why not something like

(...)

> seeing that the data object you really want to access is
> mptcp_get_pernet(net)->scheduler and you have that pointer
> stored in table->data at the registration time?

Good point, thank you for the suggestion! :)

I will do this modification.

Al Viro

unread,

Jan 5, 2025, 2:54:42 PM1/5/25

to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sun, Jan 05, 2025 at 05:52:19PM +0100, Eric Dumazet wrote:
> On Sun, Jan 5, 2025 at 12:29 PM Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > On Sun, Jan 05, 2025 at 09:32:36AM +0100, Eric Dumazet wrote:
> >
> > > According to grep, we have many other places directly reading
> > > current->nsproxy->net_ns
> > > For instance in net/sctp/sysctl.c
> > > Should we change them all ?
> >
> > Depends - do you want their contents match the netns of opener (as,
> > AFAICS, for ipv4 sysctls) or that of the reader?
>
> I am only worried that a malicious user could crash the host with
> current kernels,
> not about this MPTP crash, but all unaware users of current->nsproxy
> in sysctl handlers.

I don't hate your mitigation in proc_sysctl.c, but IMO there are two
problems mixed here - one is that we probably should have access
to per-netns sysctl table act on the netns it had been created for,
which may not coincide with reader's/writer's netns and another is that
access to current->nsproxy->netns would simply oops if attempted when
current->nsproxy had been dropped.

So I suspect that current->nsproxy->netns shouldn't be used in
per-netns sysctls for consistency sake (note that it can get more
serious than just consistency, if you have e.g. a spinlock taken
in something hanging off current netns to protect access to
something table->data points to).

As for the mitigation in fs/proc/proc_sysctl.c... might be useful,
if it comes with a clear comment about the reasons it's there.

Al Viro

unread,

Jan 5, 2025, 3:51:01 PM1/5/25

to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sun, Jan 05, 2025 at 07:54:34PM +0000, Al Viro wrote:

> So I suspect that current->nsproxy->netns shouldn't be used in
> per-netns sysctls for consistency sake (note that it can get more
> serious than just consistency, if you have e.g. a spinlock taken
> in something hanging off current netns to protect access to
> something table->data points to).
>
> As for the mitigation in fs/proc/proc_sysctl.c... might be useful,
> if it comes with a clear comment about the reasons it's there.

FWIW, looks like we have two such in mptcp (with sysctls next to
those definitely accessing the netns of opener rather than reader/writer),
two in rds (both inconsistent on the write side -

struct net *net = current->nsproxy->net_ns;

int err;

err = proc_dointvec_minmax(ctl, write, buffer, lenp, fpos);
if (err < 0) {
pr_warn("Invalid input. Must be >= %d\n",
*(int *)(ctl->extra1));
return err;
}
if (write)
rds_tcp_sysctl_reset(net);
will modify ctl->data, which points to &rtn->{snd,rcv}buf_size, with
rtn == net_generic(net, rds_tcp_netid) and net being for opener's netns
and then call rds_tcp_sysctl_reset(net) with net being the writer's
netns) and 6 in sctp. At least some of sctp ones are also inconsistent
on the write side; e.g.
static int proc_sctp_do_rto_min(const struct ctl_table *ctl, int write,

void *buffer, size_t *lenp, loff_t *ppos)
{

struct net *net = current->nsproxy->net_ns;

unsigned int min = *(unsigned int *) ctl->extra1;
unsigned int max = *(unsigned int *) ctl->extra2;
struct ctl_table tbl;
int ret, new_value;

memset(&tbl, 0, sizeof(struct ctl_table));
tbl.maxlen = sizeof(unsigned int);

if (write)
tbl.data = &new_value;
else
tbl.data = &net->sctp.rto_min;

ret = proc_dointvec(&tbl, write, buffer, lenp, ppos);

if (write && ret == 0) {

if (new_value > max || new_value < min)
return -EINVAL;

net->sctp.rto_min = new_value;
}

return ret;
}
has max taken from ctl->extra2, which is &net->sctp.rto_max of the
opener's netns, but the value capped by that in stored into
net->sctp.rto_min of *writer's* netns. So the logics that is supposed
to prevent rto_min > rto_max can be bypassed; no idea how much can that
escalate to, but it's clearly not what the code intends.

So I'd rather document the "don't assume that current->nsproxy->netns will
point to the same netns this ctl is for" and fix those 10 instances - at
least some smell seriously fishy. It's not just the acct(2) weirdness and
the damage may be worse than an oops...

Al Viro

unread,

Jan 5, 2025, 4:12:04 PM1/5/25

to Eric Dumazet, Matthieu Baerts, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

On Sun, Jan 05, 2025 at 08:50:56PM +0000, Al Viro wrote:

> has max taken from ctl->extra2, which is &net->sctp.rto_max of the
> opener's netns, but the value capped by that in stored into
> net->sctp.rto_min of *writer's* netns. So the logics that is supposed
> to prevent rto_min > rto_max can be bypassed; no idea how much can that
> escalate to, but it's clearly not what the code intends.

Speaking of which, the logics that tries to maintain rto_min <= rto_max is
broken in another way. There's no exclusion in those suckers. IOW, if
we have set rto_min to 1 and rto_max to 10000, two processes can try to
write 1000 to rto_min and 10 to rto_max resp., with successful validations
done against the original state in both, followed by actual stores.
Result is rto_min == 1000 and rto_max == 10, which is probably not what
one wants there...

IOW, the validation and stores should be atomic; the same goes for another
pair (pf_retrans <= ps_retrans). Again, I've no idea how severe it is,
but result seems to be at least contrary to expectation of the code
authors...

Matthieu Baerts

unread,

Jan 6, 2025, 9:27:55 AM1/6/25

to Joel Granados, Eric Dumazet, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

Hi Joel, Eric, Al,

On 06/01/2025 14:32, Joel Granados wrote:

> If this is the case, can you point me to the place where this happens?

>
>>>>
>>>> Just to be sure I'm not misunderstanding your comment: do you mean that
>>>> here, the issue is *not* in MPTCP code where we get the 'struct net'
>>>> pointer via 'current->nsproxy->net_ns', but in the FS part, right?
>>>>
>>>> Here, we have an issue because 'current->nsproxy' is NULL, but is it
>>>> normal? Or should we simply exit with an error if it is the case because
>>>> we are in an exiting phase?
>>>>
>>>> I'm just a bit confused, because it looks like 'net' is retrieved from
>>>> different places elsewhere when dealing with sysfs: some get it from
>>>> 'current' like us, some assign 'net' to 'table->extra2', others get it
>>>> from 'table->data' (via a container_of()), etc. Maybe we should not use
>>>> 'current->nsproxy->net_ns' here then?
>>>
>>> I do think this is a bug in process accounting, not in networking.
>>>
>>> It might make sense to output a record on a regular file, but probably
>>> not on any other files.

> It for sure does not make sense to output a record on a sysctl file that
> has a maxlen of just 3*sizeof(int) (kernel/acct.c:79).

>
>>>
>>> diff --git a/kernel/acct.c b/kernel/acct.c
>>> index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
>>> 100644
>>> --- a/kernel/acct.c
>>> +++ b/kernel/acct.c
>>> @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
>>> const struct cred *orig_cred;
>>> struct file *file = acct->file;
>>>
>>> + if (S_ISREG(file_inode(file)->i_mode))
>>> + return;
>>> +

> This seems like it does not handle the actual culprit which is. Why is
> the sysctl file being used for the accounting.

>
>>> /*
>>> * Accounting records are not subject to resource limits.
>>> */
>>
>> OK, thank you, that's clearer.
>>
>> So this is then more a question for Joel, right?
>>
>> Do you plan to send this patch to him?
>>
>> #syz set subsystems: fs
>>
>> Cheers,
>> Matt
>> --
>> Sponsored by the NGI0 Core fund.
>>
>

> So what is happening is that:
> 1. The accounting file is set to a non-sysctl file.
> 2. And when accounting tries to write to this file, you get the
> behaviour explained in this mail?
>
> Please correct me if I have miss-read the situation.

@Joel: Thank you for your reply!

I'm sorry, I'm not sure whether I can help here. I hope Eric and/or Al
can jump in.

What I can say is that the original issue has been found by syzbot, and
the reproducer [1] shows that 3 syscalls have been used:
- openat('/proc/sys/net/mptcp/scheduler')
- mprotect()
- acct()

Please also note that the conversation continued in a sub-tread where
you are not in the Cc list, see [2]. In short, Eric suggested another
patch only for sysfs, and Al recommended dropping the use of
'current->nsproxy'.

On my side, I'm looking at dropping the use of 'current->nsproxy' in
sysctl callbacks. I guess such patches will be seen as fixes, except if
Eric's new patch is enough for stable?

[1] https://syzkaller.appspot.com/x/repro.syz?x=1245eaf8580000
[2]
https://lore.kernel.org/netdev/67769ecb.050a022...@google.com/T/#m862d0913ebfcec5e462a9c33b47bc3f6440a2900

Joel Granados

unread,

Jan 6, 2025, 9:45:16 AM1/6/25

to Matthieu Baerts, Eric Dumazet, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot, Al Viro

On Sat, Jan 04, 2025 at 08:11:52PM +0100, Matthieu Baerts wrote:

If this is the case, can you point me to the place where this happens?

> >>

> >> Just to be sure I'm not misunderstanding your comment: do you mean that
> >> here, the issue is *not* in MPTCP code where we get the 'struct net'
> >> pointer via 'current->nsproxy->net_ns', but in the FS part, right?
> >>
> >> Here, we have an issue because 'current->nsproxy' is NULL, but is it
> >> normal? Or should we simply exit with an error if it is the case because
> >> we are in an exiting phase?
> >>
> >> I'm just a bit confused, because it looks like 'net' is retrieved from
> >> different places elsewhere when dealing with sysfs: some get it from
> >> 'current' like us, some assign 'net' to 'table->extra2', others get it
> >> from 'table->data' (via a container_of()), etc. Maybe we should not use
> >> 'current->nsproxy->net_ns' here then?
> >
> > I do think this is a bug in process accounting, not in networking.
> >
> > It might make sense to output a record on a regular file, but probably
> > not on any other files.

It for sure does not make sense to output a record on a sysctl file that
has a maxlen of just 3*sizeof(int) (kernel/acct.c:79).

> >

> > diff --git a/kernel/acct.c b/kernel/acct.c
> > index 179848ad33e978a557ce695a0d6020aa169177c6..a211305cb930f6860d02de7f45ebd260ae03a604
> > 100644
> > --- a/kernel/acct.c
> > +++ b/kernel/acct.c
> > @@ -495,6 +495,9 @@ static void do_acct_process(struct bsd_acct_struct *acct)
> > const struct cred *orig_cred;
> > struct file *file = acct->file;
> >
> > + if (S_ISREG(file_inode(file)->i_mode))
> > + return;
> > +

This seems like it does not handle the actual culprit which is. Why is
the sysctl file being used for the accounting.

> > /*
> > * Accounting records are not subject to resource limits.
> > */
>
> OK, thank you, that's clearer.
>
> So this is then more a question for Joel, right?
>
> Do you plan to send this patch to him?
>
> #syz set subsystems: fs
>
> Cheers,
> Matt
> --
> Sponsored by the NGI0 Core fund.
>

So what is happening is that:
1. The accounting file is set to a non-sysctl file.
2. And when accounting tries to write to this file, you get the
behaviour explained in this mail?

Please correct me if I have miss-read the situation.

Best

--

Joel Granados

Eric Dumazet

unread,

Jan 6, 2025, 10:27:18 AM1/6/25

to Matthieu Baerts, Joel Granados, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

It might be less risky in terms of backports to patch mptcp and others.

Ie just use Al suggestion.

Thanks !

Matthieu Baerts

unread,

Jan 6, 2025, 10:34:45 AM1/6/25

to Eric Dumazet, Joel Granados, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

Hi Eric,

Thank you for your reply!

Thank you, will do! In fact, I already modified the kernel on my side,
but it is hard for me to validate that for the moment: it is nice to
have many trees around, but less when they fall on cables :)

Joel Granados

unread,

Jan 8, 2025, 9:37:27 AM1/8/25

to Matthieu Baerts, Eric Dumazet, Al Viro, da...@davemloft.net, gel...@kernel.org, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, mart...@kernel.org, mp...@lists.linux.dev, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com, syzbot

Perfect. Thx for the summary. I'll remove this thread from my radar as
it seems that a fix has already been found.

Best

--

Joel Granados

Reply all

Reply to author

Forward

[syzbot] [mptcp?] general protection fault in proc_scheduler

syzbot

Eric Dumazet

Hillf Danton

syzbot

Matthieu Baerts

Eric Dumazet

Al Viro

Matthieu Baerts

Matthieu Baerts

Al Viro

Al Viro

Eric Dumazet

Al Viro

Eric Dumazet

Matthieu Baerts

Matthieu Baerts

Al Viro

Al Viro

Al Viro

Matthieu Baerts

Joel Granados

Eric Dumazet

Matthieu Baerts

Joel Granados