KASAN: use-after-free Read in sock_release

92 views
Skip to first unread message

syzbot

unread,
Nov 29, 2017, 3:33:04 AM11/29/17
to da...@davemloft.net, linux-...@vger.kernel.org, net...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzkaller hit the following crash on
1d3b78bbc6e983fabb3fbf91b76339bf66e4a12c
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
compiler: gcc (GCC) 7.1.1 20170620
.config is attached
Raw console output is attached.

Unfortunately, I don't have any reproducer for this bug yet.


device syz3 left promiscuous mode
device syz3 entered promiscuous mode
==================================================================
BUG: KASAN: use-after-free in sock_release+0x1c6/0x1e0 net/socket.c:601
Read of size 8 at addr ffff8801c8dd1d10 by task syz-executor4/31085

CPU: 0 PID: 31085 Comm: syz-executor4 Not tainted 4.14.0+ #129
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x25b/0x340 mm/kasan/report.c:409
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
sock_release+0x1c6/0x1e0 net/socket.c:601
sock_close+0x16/0x20 net/socket.c:1125
__fput+0x333/0x7f0 fs/file_table.c:210
____fput+0x15/0x20 fs/file_table.c:244
task_work_run+0x199/0x270 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x9bb/0x1ae0 kernel/exit.c:865
do_group_exit+0x149/0x400 kernel/exit.c:968
get_signal+0x73f/0x16c0 kernel/signal.c:2335
do_signal+0x94/0x1ee0 arch/x86/kernel/signal.c:809
exit_to_usermode_loop+0x214/0x310 arch/x86/entry/common.c:158
prepare_exit_to_usermode arch/x86/entry/common.c:195 [inline]
syscall_return_slowpath+0x490/0x550 arch/x86/entry/common.c:264
entry_SYSCALL_64_fastpath+0x94/0x96
RIP: 0033:0x452879
RSP: 002b:00007fb1c2435ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 0000000000758100 RCX: 0000000000452879
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000758100
RBP: 0000000000758100 R08: 0000000000000304 R09: 00000000007580d8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000a6f7ff R14: 00007fb1c24369c0 R15: 000000000000000e

Allocated by task 31066:
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3613
kmalloc include/linux/slab.h:499 [inline]
sock_alloc_inode+0xb4/0x300 net/socket.c:253
alloc_inode+0x65/0x180 fs/inode.c:208
new_inode_pseudo+0x69/0x190 fs/inode.c:890
sock_alloc+0x41/0x270 net/socket.c:565
__sock_create+0x148/0x850 net/socket.c:1225
sock_create net/socket.c:1301 [inline]
SYSC_socket net/socket.c:1331 [inline]
SyS_socket+0xeb/0x200 net/socket.c:1311
entry_SYSCALL_64_fastpath+0x1f/0x96

Freed by task 3039:
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3491 [inline]
kfree+0xca/0x250 mm/slab.c:3806
__rcu_reclaim kernel/rcu/rcu.h:190 [inline]
rcu_do_batch kernel/rcu/tree.c:2758 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:3012 [inline]
__rcu_process_callbacks kernel/rcu/tree.c:2979 [inline]
rcu_process_callbacks+0xe79/0x17d0 kernel/rcu/tree.c:2996
__do_softirq+0x29d/0xbb2 kernel/softirq.c:285

The buggy address belongs to the object at ffff8801c8dd1cc0
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 80 bytes inside of
128-byte region [ffff8801c8dd1cc0, ffff8801c8dd1d40)
The buggy address belongs to the page:
page:ffffea0007237440 count:1 mapcount:0 mapping:ffff8801c8dd1000 index:0x0
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffff8801c8dd1000 0000000000000000 0000000100000015
raw: ffffea0007086320 ffffea00070588a0 ffff8801db000640 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff8801c8dd1c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c8dd1c80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff8801c8dd1d00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
^
ffff8801c8dd1d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c8dd1e00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzk...@googlegroups.com.
Please credit me with: Reported-by: syzbot <syzk...@googlegroups.com>

syzbot will keep track of this bug report.
Once a fix for this bug is committed, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
config.txt
raw.log

Cong Wang

unread,
Nov 29, 2017, 2:37:25 PM11/29/17
to syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel, Linus Torvalds, Al Viro
(Cc'ing fs people...)

On Wed, Nov 29, 2017 at 12:33 AM, syzbot
<bot+9abea25706ae350223...@syzkaller.appspotmail.com>
wrote:
This looks more like a fs issue than network, my fs knowledge
is not good enough to justify why the hell the inode could be
destroyed before we release the fd.

My _guess_ is that it is because we defer the ____fput() to a
task work. If this is the case, then fs layer is not guilty for this.

On the other hand, if we have to blame net layer, it does look
suspicious on the RCU usage in sock_release() where we
claim RCU protection but I don't see we hold any RCU lock
there. Also, the code that deferences sock->wq is pretty much
useless now, at least I don't see it catches any bug though.


diff --git a/net/socket.c b/net/socket.c
index 42d8e9c9ccd5..b2390b5591a9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -598,9 +598,6 @@ void sock_release(struct socket *sock)
module_put(owner);
}

- if (rcu_dereference_protected(sock->wq, 1)->fasync_list)
- pr_err("%s: fasync list not empty!\n", __func__);
-
this_cpu_sub(sockets_in_use, 1);
if (!sock->file) {
iput(SOCK_INODE(sock));

Linus Torvalds

unread,
Nov 29, 2017, 3:24:56 PM11/29/17
to Cong Wang, Al Viro, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel
On Wed, Nov 29, 2017 at 11:37 AM, Cong Wang <xiyou.w...@gmail.com> wrote:
> (Cc'ing fs people...)
>
> On Wed, Nov 29, 2017 at 12:33 AM, syzbot wrote:
>> BUG: KASAN: use-after-free in sock_release+0x1c6/0x1e0 net/socket.c:601

Lovely.

Yeah, that is:

601 if (rcu_dereference_protected(sock->wq, 1)->fasync_list)

and as you say, that "rcu_dereference_protected()" is confusing, but
that should be ok because we have a ref to the inode, and we're really
just testing that the pointer is zero.

The call trace here is:

>> sock_release+0x1c6/0x1e0 net/socket.c:601
>> sock_close+0x16/0x20 net/socket.c:1125
>> __fput+0x333/0x7f0 fs/file_table.c:210
>> ____fput+0x15/0x20 fs/file_table.c:244
>> task_work_run+0x199/0x270 kernel/task_work.c:113

and there is no RCU protection anywhere, but it's really just a sanity
check, and the access _should_ be ok.

The stale access does seem to be because 'sock' (embedded in the
inode) itself that has been free'd:

>> Allocated by task 31066:
>> kmalloc include/linux/slab.h:499 [inline]
>> sock_alloc_inode+0xb4/0x300 net/socket.c:253
>> alloc_inode+0x65/0x180 fs/inode.c:208
>> new_inode_pseudo+0x69/0x190 fs/inode.c:890
>> sock_alloc+0x41/0x270 net/socket.c:565
>> __sock_create+0x148/0x850 net/socket.c:1225
>> sock_create net/socket.c:1301 [inline]
>> SYSC_socket net/socket.c:1331 [inline]
>> SyS_socket+0xeb/0x200 net/socket.c:1311
>
> This looks more like a fs issue than network, my fs knowledge
> is not good enough to justify why the hell the inode could be
> destroyed before we release the fd.

Ugh. The inode freeing really is confusing and fairly involved, but
the last free *should* happen as part of the final dput() that is done
at the end of __fput().

So in __fput() calls into the

if (file->f_op->release)
file->f_op->release(inode, file);

then the inode should still be around, because the final ref won't be
done until later. And RCU simply shouldn't be an issue, because of
that reference count on the inode.

So it smells like some reference counting went wrong. The socket inode
creation is a bit confusing, and then in "sock_release()" we do have
that

if (!sock->file) {
iput(SOCK_INODE(sock));
return;
}
sock->file = NULL;

which *also* tries to free the inode. I'm not sure what the logic (and
what the locking) behind that code all is.

What *is* the locking for "sock->file" anyway?

Al, can you take a look on the vfs side? But I'm inclined to blame the
socket code, because if we really had a "inode free'd early" issue at
a vfs level, I'd have expected us to see infinite chaos.

Linus

Eric Dumazet

unread,
Nov 29, 2017, 3:49:21 PM11/29/17
to Cong Wang, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel, Linus Torvalds, Al Viro
There is rcu protection for sock->wq, and the 1 in
rcu_dereference_protected(sock->wq, 1) is because we do not have a
lockdep convenient way to express that we are the last user of sock,
and about to free it.


> Also, the code that deferences sock->wq is pretty much
> useless now, at least I don't see it catches any bug though.
>
>
> diff --git a/net/socket.c b/net/socket.c
> index 42d8e9c9ccd5..b2390b5591a9 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -598,9 +598,6 @@ void sock_release(struct socket *sock)
>                 module_put(owner);
>         }
>
> -       if (rcu_dereference_protected(sock->wq, 1)->fasync_list)
> -               pr_err("%s: fasync list not empty!\n", __func__);
> -
>

At this point, sock->wq must be valid, and freed later (by us)

This really looks like some other bug, and a late effect.








Al Viro

unread,
Nov 29, 2017, 9:07:27 PM11/29/17
to Linus Torvalds, Cong Wang, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel
On Wed, Nov 29, 2017 at 12:24:55PM -0800, Linus Torvalds wrote:
> Ugh. The inode freeing really is confusing and fairly involved, but
> the last free *should* happen as part of the final dput() that is done
> at the end of __fput().

Note that struct socket is coallocated with its inode. _Normally_
from sock_alloc() (and that's the case here, apparently), but in several
cases it's embedded into another object. TUN and TAP - definitely,
might have been other added. Those should never be passed to sock_release()
at all.

> So in __fput() calls into the
>
> if (file->f_op->release)
> file->f_op->release(inode, file);
>
> then the inode should still be around, because the final ref won't be
> done until later. And RCU simply shouldn't be an issue, because of
> that reference count on the inode.
>
> So it smells like some reference counting went wrong. The socket inode
> creation is a bit confusing, and then in "sock_release()" we do have
> that
>
> if (!sock->file) {
> iput(SOCK_INODE(sock));
> return;
> }
> sock->file = NULL;
>
> which *also* tries to free the inode. I'm not sure what the logic (and
> what the locking) behind that code all is.

If socket has never gone through sock_alloc_file(), sock_release() on it
is called explicitly and frees the sucker. If it has been through
sock_alloc_file(), we must not call sock_release() directly and freeing
is done by iput() from final fput().

> What *is* the locking for "sock->file" anyway?

Pretty much assign-once - zeroing it in the end of sock_release() is
pure cosmetics (we'd damn better have no other references to that
sucker left anywhere; there's still a reference to embedded inode,
but that's it).

FWIW, looking through the callers of sock_alloc_file()... we might be
better off if it did sock_release() on failure. Then the calling
conventions become "sock_alloc_file() means not calling sock_release()
directly - either it'll be done by the final fput() on resulting file,
or by sock_alloc_file() itself".

Look:
1) in lustre:
sock_filp = sock_alloc_file(sock, 0, NULL);
if (IS_ERR(sock_filp)) {
sock_release(sock);
rc = PTR_ERR(sock_filp);
goto out;
}
2) in net/9p:
file = sock_alloc_file(csocket, 0, NULL);
if (IS_ERR(file)) {
pr_err("%s (%d): failed to map fd\n",
__func__, task_pid_nr(current));
sock_release(csocket);
kfree(p);
return PTR_ERR(file);
}
3) in sctp:
*newfile = sock_alloc_file(newsock, 0, NULL);
if (IS_ERR(*newfile)) {
put_unused_fd(retval);
sock_release(newsock);
retval = PTR_ERR(*newfile);
*newfile = NULL;
return retval;
}
4) in accept4():
newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name);
if (IS_ERR(newfile)) {
err = PTR_ERR(newfile);
put_unused_fd(newfd);
sock_release(newsock);
goto out_put;
}

5) called in sock_map_fd(), and the sole caller is
retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
if (retval < 0)
goto out_release;
...
out_release:
sock_release(sock);
return retval;
(with no fallthrough or other goto into out_release)

6) the second caller in socketpair():
newfile2 = sock_alloc_file(sock2, flags, NULL);
if (IS_ERR(newfile2)) {
err = PTR_ERR(newfile2);
goto out_fput_1;
}
...
out_fput_1:
fput(newfile1);
put_unused_fd(fd2);
put_unused_fd(fd1);
sock_release(sock2);
goto out;
(again, no fallthrough or other goto into out_fput_1)

7) the first caller in socketpair():
newfile1 = sock_alloc_file(sock1, flags, NULL);
if (IS_ERR(newfile1)) {
err = PTR_ERR(newfile1);
goto out_put_unused_both;
}
...
out_put_unused_both:
put_unused_fd(fd2);
out_put_unused_1:
put_unused_fd(fd1);
out_release_both:
sock_release(sock2);
out_release_1:
sock_release(sock1);
out:
return err;
No fallthrough or goto either. Sure, we get a failure exit unshared,
but AFAICS some reordering can simplify things quite a bit there.

8) kcm_clone(). Fucked in head - we allocate socket, then file, *THEN*
sock, then attach sock to socket (already attached to file), then finally
deign to initialize sock (already attached to socket, which is attached
to file). And, surprise, surprise, failure exits are all wrong.
Moreover, calling conventions are broken by design - after we'd put
the damn file into descriptor table we return the pointer to sock
to the caller. By that time it might have bloody well been destroyed
by close(2) from another thread; good thing the caller doesn't use
the damn thing. Unfortunately, it is doing this:
if (!err) {
if (copy_to_user((void __user *)arg, &info,
sizeof(info))) {
err = -EFAULT;
sys_close(info.fd);
}
}
which is also bogus - again, the descriptor might have been already
closed by another thread, with another one put in its place, etc.
The right way to do that is to do fd_install() _last_. After the
last failure exit, i.e. after copy_to_user() in that case.

KCM would've looked as a likely cause of that shit, if not for the
fact that object had been created by socket(2). It definitely needs
fixing, but that's not the cause of PITA here.

Incidentally, grepping for sys_close() shows another piece of fun in
net/netfilter/xt_bpf.c. Folks, ONCE DESCRIPTOR IS INSTALLED, THAT'S
IT; THERE'S NO REMOVING IT ON FAILURE EXITS. sys_close() should
never, ever be used that way. Sigh...

Al Viro

unread,
Nov 29, 2017, 9:24:39 PM11/29/17
to Cong Wang, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel, Linus Torvalds
IDGI. We are running into the object pointed to by sock->wq
already freed, right? So how the hell had we managed to _fetch_
the pointer in the first place? Freeing had been scheduled
by
wq = rcu_dereference_protected(ei->socket.wq, 1);
kfree_rcu(wq, rcu);
kmem_cache_free(sock_inode_cachep, ei);
so we should have
* sock_destroy_inode() run on another CPU while we are
in the middle of sock_release(), sock->wq fetched by sock_release(),
sock->wq fed to kfree_rcu() by sock_destroy_inode() *AND* freed
before sock_release() got around to dereferencing it.

Not impossible to hit, but... why hadn't we run into
much wider window? If that sock_destroy_inode() on another
CPU had gotten to the call right after that kfree_rcu(), we
would've seen use-after-free on attempt to fetch ->wq...

And it goes without saying that sock_destroy_inode() should
not have happened in parallel to sock_release(), or, for that matter,
to anything else done to struct socket instance...

Al Viro

unread,
Nov 29, 2017, 11:17:08 PM11/29/17
to Linus Torvalds, Cong Wang, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel
On Thu, Nov 30, 2017 at 02:07:19AM +0000, Al Viro wrote:

> FWIW, looking through the callers of sock_alloc_file()... we might be
> better off if it did sock_release() on failure. Then the calling
> conventions become "sock_alloc_file() means not calling sock_release()
> directly - either it'll be done by the final fput() on resulting file,
> or by sock_alloc_file() itself".

FWIW^2: vfs.git#work.net is (completely untested) implementation of
that. KCM fixes + sock_alloc_file() calling conventions change.

That's _not_ a pull request, it obviously needs testing and review on
netdev. I like the way it looks, though...

Al Viro (3):
socketpair(): allocate descriptors first
fix kcm_clone()
make sock_alloc_file() do sock_release() on failures
Diffstat:
drivers/staging/lustre/lnet/lnet/lib-socket.c | 8 ++---
net/9p/trans_fd.c | 1 -
net/kcm/kcmsock.c | 68 ++++++++++++++---------------------------
net/sctp/socket.c | 1 -
net/socket.c | 110 +++++++++++++++++++++++++++----------------------------------------
5 files changed, 69 insertions(+), 119 deletions(-)

Cumulative diff:

diff --git a/drivers/staging/lustre/lnet/lnet/lib-socket.c b/drivers/staging/lustre/lnet/lnet/lib-socket.c
index 539a26444f31..7d49d4865298 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-socket.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-socket.c
@@ -71,16 +71,12 @@ lnet_sock_ioctl(int cmd, unsigned long arg)
}

sock_filp = sock_alloc_file(sock, 0, NULL);
- if (IS_ERR(sock_filp)) {
- sock_release(sock);
- rc = PTR_ERR(sock_filp);
- goto out;
- }
+ if (IS_ERR(sock_filp))
+ return PTR_ERR(sock_filp);

rc = kernel_sock_unlocked_ioctl(sock_filp, cmd, arg);

fput(sock_filp);
-out:
return rc;
}

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 985046ae4231..80f5c79053a4 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -839,7 +839,6 @@ static int p9_socket_open(struct p9_client *client, struct socket *csocket)
if (IS_ERR(file)) {
pr_err("%s (%d): failed to map fd\n",
__func__, task_pid_nr(current));
- sock_release(csocket);
kfree(p);
return PTR_ERR(file);
}
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 0b750a22c4b9..d4e98f20fc2a 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1625,60 +1625,30 @@ static struct proto kcm_proto = {
};

/* Clone a kcm socket. */
-static int kcm_clone(struct socket *osock, struct kcm_clone *info,
- struct socket **newsockp)
+static struct file *kcm_clone(struct socket *osock)
{
struct socket *newsock;
struct sock *newsk;
- struct file *newfile;
- int err, newfd;

- err = -ENFILE;
newsock = sock_alloc();
if (!newsock)
- goto out;
+ return ERR_PTR(-ENFILE);

newsock->type = osock->type;
newsock->ops = osock->ops;

__module_get(newsock->ops->owner);

- newfd = get_unused_fd_flags(0);
- if (unlikely(newfd < 0)) {
- err = newfd;
- goto out_fd_fail;
- }
-
- newfile = sock_alloc_file(newsock, 0, osock->sk->sk_prot_creator->name);
- if (IS_ERR(newfile)) {
- err = PTR_ERR(newfile);
- goto out_sock_alloc_fail;
- }
-
newsk = sk_alloc(sock_net(osock->sk), PF_KCM, GFP_KERNEL,
&kcm_proto, true);
if (!newsk) {
- err = -ENOMEM;
- goto out_sk_alloc_fail;
+ sock_release(newsock);
+ return ERR_PTR(-ENOMEM);
}
-
sock_init_data(newsock, newsk);
init_kcm_sock(kcm_sk(newsk), kcm_sk(osock->sk)->mux);

- fd_install(newfd, newfile);
- *newsockp = newsock;
- info->fd = newfd;
-
- return 0;
-
-out_sk_alloc_fail:
- fput(newfile);
-out_sock_alloc_fail:
- put_unused_fd(newfd);
-out_fd_fail:
- sock_release(newsock);
-out:
- return err;
+ return sock_alloc_file(newsock, 0, osock->sk->sk_prot_creator->name);
}

static int kcm_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
@@ -1708,17 +1678,25 @@ static int kcm_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
}
case SIOCKCMCLONE: {
struct kcm_clone info;
- struct socket *newsock = NULL;
-
- err = kcm_clone(sock, &info, &newsock);
- if (!err) {
- if (copy_to_user((void __user *)arg, &info,
- sizeof(info))) {
- err = -EFAULT;
- sys_close(info.fd);
- }
- }
+ struct file *file;
+
+ info.fd = get_unused_fd_flags(0);
+ if (unlikely(info.fd < 0))
+ return info.fd;

+ file = kcm_clone(sock);
+ if (IS_ERR(file)) {
+ put_unused_fd(info.fd);
+ return PTR_ERR(file);
+ }
+ if (copy_to_user((void __user *)arg, &info,
+ sizeof(info))) {
+ put_unused_fd(info.fd);
+ fput(file);
+ return -EFAULT;
+ }
+ fd_install(info.fd, file);
+ err = 0;
break;
}
default:
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 3204a9b29407..8bb5163d6331 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -5080,7 +5080,6 @@ static int sctp_getsockopt_peeloff_common(struct sock *sk, sctp_peeloff_arg_t *p
*newfile = sock_alloc_file(newsock, 0, NULL);
if (IS_ERR(*newfile)) {
put_unused_fd(retval);
- sock_release(newsock);
retval = PTR_ERR(*newfile);
*newfile = NULL;
return retval;
diff --git a/net/socket.c b/net/socket.c
index 42d8e9c9ccd5..05f361faec45 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -406,8 +406,10 @@ struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
name.len = strlen(name.name);
}
path.dentry = d_alloc_pseudo(sock_mnt->mnt_sb, &name);
- if (unlikely(!path.dentry))
+ if (unlikely(!path.dentry)) {
+ sock_release(sock);
return ERR_PTR(-ENOMEM);
+ }
path.mnt = mntget(sock_mnt);

d_instantiate(path.dentry, SOCK_INODE(sock));
@@ -415,9 +417,11 @@ struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
file = alloc_file(&path, FMODE_READ | FMODE_WRITE,
&socket_file_ops);
if (IS_ERR(file)) {
- /* drop dentry, keep inode */
+ /* drop dentry, keep inode for a bit */
ihold(d_inode(path.dentry));
path_put(&path);
+ /* ... and now kill it properly */
+ sock_release(sock);
return file;
}

@@ -1330,19 +1334,9 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)

retval = sock_create(family, type, protocol, &sock);
if (retval < 0)
- goto out;
-
- retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
- if (retval < 0)
- goto out_release;
-
-out:
- /* It may be already another descriptor 8) Not kernel problem. */
- return retval;
+ return retval;

-out_release:
- sock_release(sock);
- return retval;
+ return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
}

/*
@@ -1366,87 +1360,72 @@ SYSCALL_DEFINE4(socketpair, int, family, int, type, int, protocol,
flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;

/*
+ * reserve descriptors and make sure we won't fail
+ * to return them to userland.
+ */
+ fd1 = get_unused_fd_flags(flags);
+ if (unlikely(fd1 < 0))
+ return fd1;
+
+ fd2 = get_unused_fd_flags(flags);
+ if (unlikely(fd2 < 0)) {
+ put_unused_fd(fd1);
+ return fd2;
+ }
+
+ err = put_user(fd1, &usockvec[0]);
+ if (err)
+ goto out;
+
+ err = put_user(fd2, &usockvec[1]);
+ if (err)
+ goto out;
+
+ /*
* Obtain the first socket and check if the underlying protocol
* supports the socketpair call.
*/

err = sock_create(family, type, protocol, &sock1);
- if (err < 0)
+ if (unlikely(err < 0))
goto out;

err = sock_create(family, type, protocol, &sock2);
- if (err < 0)
- goto out_release_1;
-
- err = sock1->ops->socketpair(sock1, sock2);
- if (err < 0)
- goto out_release_both;
-
- fd1 = get_unused_fd_flags(flags);
- if (unlikely(fd1 < 0)) {
- err = fd1;
- goto out_release_both;
+ if (unlikely(err < 0)) {
+ sock_release(sock1);
+ goto out;
}

- fd2 = get_unused_fd_flags(flags);
- if (unlikely(fd2 < 0)) {
- err = fd2;
- goto out_put_unused_1;
+ err = sock1->ops->socketpair(sock1, sock2);
+ if (unlikely(err < 0)) {
+ sock_release(sock2);
+ sock_release(sock1);
+ goto out;
}

newfile1 = sock_alloc_file(sock1, flags, NULL);
if (IS_ERR(newfile1)) {
err = PTR_ERR(newfile1);
- goto out_put_unused_both;
+ sock_release(sock2);
+ goto out;
}

newfile2 = sock_alloc_file(sock2, flags, NULL);
if (IS_ERR(newfile2)) {
err = PTR_ERR(newfile2);
- goto out_fput_1;
+ fput(newfile1);
+ goto out;
}

- err = put_user(fd1, &usockvec[0]);
- if (err)
- goto out_fput_both;
-
- err = put_user(fd2, &usockvec[1]);
- if (err)
- goto out_fput_both;
-
audit_fd_pair(fd1, fd2);

fd_install(fd1, newfile1);
fd_install(fd2, newfile2);
- /* fd1 and fd2 may be already another descriptors.
- * Not kernel problem.
- */
-
return 0;

-out_fput_both:
- fput(newfile2);
- fput(newfile1);
- put_unused_fd(fd2);
- put_unused_fd(fd1);
- goto out;
-
-out_fput_1:
- fput(newfile1);
- put_unused_fd(fd2);
- put_unused_fd(fd1);
- sock_release(sock2);
- goto out;
-
-out_put_unused_both:
+out:
put_unused_fd(fd2);
-out_put_unused_1:
put_unused_fd(fd1);
-out_release_both:
- sock_release(sock2);
-out_release_1:
- sock_release(sock1);
-out:
return err;
}

@@ -1562,7 +1541,6 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
if (IS_ERR(newfile)) {
err = PTR_ERR(newfile);
put_unused_fd(newfd);
- sock_release(newsock);
goto out_put;
}

Christoph Hellwig

unread,
Nov 30, 2017, 8:18:42 AM11/30/17
to Al Viro, Linus Torvalds, Cong Wang, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel
On Thu, Nov 30, 2017 at 02:07:19AM +0000, Al Viro wrote:
> Incidentally, grepping for sys_close() shows another piece of fun in
> net/netfilter/xt_bpf.c. Folks, ONCE DESCRIPTOR IS INSTALLED, THAT'S
> IT; THERE'S NO REMOVING IT ON FAILURE EXITS. sys_close() should
> never, ever be used that way. Sigh...

Would be great do unexport the thing. Except that we also have
binfmt_misc (which looks legit) and autofs4, which on crack decided
that close() isn't a fun syscall, they'd much rather have an ioctl
that does exactly the same..

Al Viro

unread,
Nov 30, 2017, 10:46:59 AM11/30/17
to Christoph Hellwig, Linus Torvalds, Cong Wang, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel
Yes, since binfmt_misc one is guaranteed that its descriptor table is
not shared - all callchains go through do_execveat_common(), where we'd
use unshare_files(). autofs one is... not in good taste, but still
safe; there the descriptor is preexisting and it's essentially a weird
way of spelling close(2). References from syscall tables are, of course,
OK. init/*.c uses are done pretty much from userland - they could have
been straight syscalls, if not for the lack of klibc in kernel tree.
Everything else, though...

IMO we need a whack-a-mole list somewhere; "new callers of sys_close()
anywhere outside of init/* and syscall tables" definitely should be
on it...

Dmitry Vyukov

unread,
Feb 13, 2018, 2:17:03 PM2/13/18
to Al Viro, Christoph Hellwig, Linus Torvalds, Cong Wang, syzbot, David Miller, LKML, Linux Kernel Network Developers, syzkall...@googlegroups.com, linux-fsdevel
#syz fix: fix kcm_clone()
Reply all
Reply to author
Forward
0 new messages