Use-after-free in ep_remove_wait_queue

1,257 views
Skip to first unread message

Dmitry Vyukov

unread,
Oct 12, 2015, 7:08:15 AM10/12/15
to Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook
Hello,

The following program causes use-after-in kernel:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <syscall.h>
#include <string.h>
#include <stdint.h>

int main()
{
long r0 = syscall(SYS_mmap, 0x20001000ul, 0x1000ul, 0x3ul,
0x32ul, 0xfffffffffffffffful, 0x0ul);
long r1 = syscall(SYS_mmap, 0x20000000ul, 0x1000ul, 0x3ul,
0x32ul, 0xfffffffffffffffful, 0x0ul);
long r2 = syscall(SYS_socketpair, 0x1ul, 0x3ul, 0x1ul, 0x20000ffcul);
long r3 = -1;
if (r2 != -1)
r3 = *(uint32_t*)0x20000ffc;
long r4 = -1;
if (r2 != -1)
r4 = *(uint32_t*)0x20001000;
long r5 = syscall(SYS_epoll_create, 0x1ul);
long r6 = syscall(SYS_mmap, 0x20003000ul, 0x1000ul, 0x3ul,
0x32ul, 0xfffffffffffffffful, 0x0ul);
long r7 = syscall(SYS_dup3, r4, r3, 0x80000ul);
*(uint32_t*)0x20003000 = 0x6;
*(uint32_t*)0x20003004 = 0x2;
*(uint64_t*)0x20003008 = 0x6;
long r11 = syscall(SYS_epoll_ctl, r5, 0x1ul, r3, 0x20003000ul);
long r12 = syscall(SYS_mmap, 0x20002000ul, 0x1000ul, 0x3ul,
0x32ul, 0xfffffffffffffffful, 0x0ul);
memcpy((void*)0x20002000, "\x00", 1);
long r14 = syscall(SYS_write, r7, 0x20002000ul, 0x1ul);
*(uint64_t*)0x20001a4d = 0x5;
long r16 = syscall(SYS_epoll_pwait, r5, 0x20004000ul, 0x1ul,
0x1ul, 0x20001a4dul, 0x8ul);
return 0;
}


on commit bcee19f424a0d8c26ecf2607b73c690802658b29
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)

BUG: KASan: use after free in remove_wait_queue+0xfb/0x120 at addr
ffff88002db3cf50
Write of size 8 by task syzkaller_execu/10568
================================================================
BUG UNIX (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------

Call Trace:
[< inline >] __list_del include/linux/list.h:89
[< inline >] list_del include/linux/list.h:107
[< inline >] __remove_wait_queue include/linux/wait.h:145
[<ffffffff811bfbab>] remove_wait_queue+0xfb/0x120 kernel/sched/wait.c:50
[< inline >] ep_remove_wait_queue fs/eventpoll.c:524
[<ffffffff8154094b>] ep_unregister_pollwait.isra.7+0x10b/0x1c0
fs/eventpoll.c:542
[<ffffffff81541417>] ep_free+0x97/0x190 fs/eventpoll.c:759 (discriminator 3)
[<ffffffff81541554>] ep_eventpoll_release+0x44/0x60 fs/eventpoll.c:791
[<ffffffff814772ad>] __fput+0x21d/0x6e0 fs/file_table.c:208
[<ffffffff814777f5>] ____fput+0x15/0x20 fs/file_table.c:244
[<ffffffff81150094>] task_work_run+0x164/0x1f0 kernel/task_work.c:115
(discriminator 1)
[< inline >] exit_task_work include/linux/task_work.h:21
[<ffffffff810fe7ae>] do_exit+0xa4e/0x2d40 kernel/exit.c:746
[<ffffffff81104426>] do_group_exit+0xf6/0x340 kernel/exit.c:874
[< inline >] SYSC_exit_group kernel/exit.c:885
[<ffffffff8110468d>] SyS_exit_group+0x1d/0x20 kernel/exit.c:883
[<ffffffff82e424d1>] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:187

INFO: Allocated in sk_prot_alloc+0x69/0x340 age=6 cpu=1 pid=10568
[< none >] __slab_alloc+0x426/0x470 mm/slub.c:2402
[< inline >] slab_alloc_node mm/slub.c:2470
[< inline >] slab_alloc mm/slub.c:2512
[< none >] kmem_cache_alloc+0x10d/0x140 mm/slub.c:2517
[< none >] sk_prot_alloc+0x69/0x340 net/core/sock.c:1329
[< none >] sk_alloc+0x33/0x280 net/core/sock.c:1404
[< none >] unix_create1+0x5e/0x3d0 net/unix/af_unix.c:648
[< none >] unix_create+0x14d/0x1f0 net/unix/af_unix.c:707
[< none >] __sock_create+0x1f1/0x4c0 net/socket.c:1169
[< inline >] sock_create net/socket.c:1209
[< inline >] SYSC_socketpair net/socket.c:1281
[< none >] SyS_socketpair+0x10f/0x4d0 net/socket.c:1260
[< none >] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:187

INFO: Freed in sk_destruct+0x2e9/0x400 age=9 cpu=1 pid=10568
[< none >] __slab_free+0x12d/0x280 mm/slub.c:2587 (discriminator 1)
[< inline >] slab_free mm/slub.c:2736
[< none >] kmem_cache_free+0x161/0x180 mm/slub.c:2745
[< inline >] sk_prot_free net/core/sock.c:1374
[< none >] sk_destruct+0x2e9/0x400 net/core/sock.c:1452
[< none >] __sk_free+0x57/0x200 net/core/sock.c:1460
[< none >] sk_free+0x30/0x40 net/core/sock.c:1471
[< inline >] sock_put include/net/sock.h:1593
[< none >] unix_dgram_sendmsg+0xeaf/0x1100 net/unix/af_unix.c:1571
[< inline >] sock_sendmsg_nosec net/socket.c:610
[< none >] sock_sendmsg+0xca/0x110 net/socket.c:620
[< none >] sock_write_iter+0x216/0x3a0 net/socket.c:819
[< inline >] new_sync_write fs/read_write.c:478
[< none >] __vfs_write+0x2ed/0x3d0 fs/read_write.c:491
[< none >] vfs_write+0x173/0x4a0 fs/read_write.c:538
[< inline >] SYSC_write fs/read_write.c:585
[< none >] SyS_write+0x108/0x220 fs/read_write.c:577
[< none >] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:187

INFO: Slab 0xffffea0000b6ce00 objects=26 used=4 fp=0xffff88002db3cc00
flags=0x100000000004080
INFO: Object 0xffff88002db3cc00 @offset=19456 fp=0xffff88002db3c280
CPU: 1 PID: 10568 Comm: syzkaller_execu Tainted: G B 4.3.0-rc2+ #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
00000000ffffffff ffff880026647a80 ffffffff81a425a0 ffff88002e17c000
ffff88002db3cc00 ffff88002db38000 ffff880026647ab0 ffffffff8145a554
ffff88002e17c000 ffffea0000b6ce00 ffff88002db3cc00 ffff88002db3cf40

Memory state around the buggy address:
ffff88002db3ce00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88002db3ce80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88002db3cf00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88002db3cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88002db3d000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
=========================================================

Found with syzkaller fuzzer.

Michal Kubecek

unread,
Oct 12, 2015, 8:04:28 AM10/12/15
to Dmitry Vyukov, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause, Jason Baron, Rainer Weikusat
On Mon, Oct 12, 2015 at 01:07:55PM +0200, Dmitry Vyukov wrote:
> Hello,
>
> The following program causes use-after-in kernel:
>
...
...
Probably the issue discussed in

http://thread.gmane.org/gmane.linux.kernel/2057497/

and previous related threads.

Michal Kubecek

Eric Dumazet

unread,
Oct 12, 2015, 8:14:33 AM10/12/15
to Michal Kubecek, Dmitry Vyukov, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, syzk...@googlegroups.com, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause, Jason Baron, Rainer Weikusat
On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:

> Probably the issue discussed in
>
> http://thread.gmane.org/gmane.linux.kernel/2057497/
>
> and previous related threads.
>

Same issue, but Dmitry apparently did not trust me.


Dmitry Vyukov

unread,
Oct 12, 2015, 8:17:57 AM10/12/15
to syzk...@googlegroups.com, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause, Jason Baron, Rainer Weikusat
Just wanted to make sure. Let's consider this as fixed.

Dmitry Vyukov

unread,
Nov 6, 2015, 8:06:40 AM11/6/15
to syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause, Jason Baron, Rainer Weikusat
Is it meant to be fixed now?

I am still seeing use-after-free reports in ep_unregister_pollwait on
d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).

Jason Baron

unread,
Nov 6, 2015, 9:58:11 AM11/6/15
to Dmitry Vyukov, syzkaller, Rainer Weikusat, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Hi,

Indeed - No formal patch has been posted for this yet - we've agreed
that Rainer is going to post the patch for this issue.

Thanks,

-Jason

Rainer Weikusat

unread,
Nov 6, 2015, 10:16:01 AM11/6/15
to Jason Baron, Dmitry Vyukov, syzkaller, Rainer Weikusat, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Jason Baron <jba...@akamai.com> writes:
> On 11/06/2015 08:06 AM, Dmitry Vyukov wrote:
>> On Mon, Oct 12, 2015 at 2:17 PM, Dmitry Vyukov <dvy...@google.com> wrote:
>>> On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet <eric.d...@gmail.com> wrote:
>>>> On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:
>>>>
>>>>> Probably the issue discussed in
>>>>>
>>>>> http://thread.gmane.org/gmane.linux.kernel/2057497/
>>>>>
>>>>> and previous related threads.
>>>>>
>>>>
>>>> Same issue, but Dmitry apparently did not trust me.
>>>
>>> Just wanted to make sure. Let's consider this as fixed.
>>
>>
>> Is it meant to be fixed now?
>>
>> I am still seeing use-after-free reports in ep_unregister_pollwait on
>> d1e41ff11941784f469f17795a4d9425c2eb4b7a (Nov 5).
>>
>
> Indeed - No formal patch has been posted for this yet - we've agreed
> that Rainer is going to post the patch for this issue.

I should be able to put some more work into this on Sunday or possibly
(but unlikely) later today (supposed to mean that I'll also actually do
that).

Rainer Weikusat

unread,
Nov 9, 2015, 9:41:10 AM11/9/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusuat <rwei...@mobileactivedefense.com>

---
"Why do things always end up messy and complicated"? Patch is against
4.3.0.


diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};

static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..4f263e3 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,122 @@ found:
return s;
}

+/* Support code for asymmetrically connected dgram sockets
+ *
+ * If a datagram socket is connected to a socket not itself connected
+ * to the first socket (eg, /dev/log), clients may only enqueue more
+ * messages if the present receive queue of the server socket is not
+ * "too large". This means there's a second writeability condition
+ * poll and sendmsg need to test. The dgram recv code will do a wake
+ * up on the peer_wait wait queue of a socket upon reception of a
+ * datagram which needs to be propagated to sleeping would-be writers
+ * since these might not have sent anything so far. This can't be
+ * accomplished via poll_wait because the lifetime of the server
+ * socket might be less than that of its clients if these break their
+ * association with it or if the server socket is closed while clients
+ * are still connected to it and there's no way to inform "a polling
+ * implementation" that it should let go of a certain wait queue
+ *
+ * In order to propagate a wake up, a wait_queue_t of the client
+ * socket is enqueued on the peer_wait queue of the server socket
+ * whose wake function does a wake_up on the ordinary client socket
+ * wait queue. This connection is established whenever a write (or
+ * poll for write) hit the flow control condition and broken when the
+ * association to the server socket is dissolved or after a wake up
+ * was relayed.
+ */
+
+static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags,
+ void *key)
+{
+ struct unix_sock *u;
+ wait_queue_head_t *u_sleep;
+
+ u = container_of(q, struct unix_sock, peer_wake);
+
+ __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
+ &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ /* relaying can only happen while the wq still exists */
+ u_sleep = sk_sleep(&u->sk);
+ if (u_sleep)
+ wake_up_interruptible_poll(u_sleep, key);
+
+ return 0;
+}
+
+static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (!u->peer_wake.private) {
+ u->peer_wake.private = other;
+ __add_wait_queue(&u_other->peer_wait, &u->peer_wake);
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+static int unix_dgram_peer_wake_disconnect(struct sock *sk, struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+ int rc;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ rc = 0;
+
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+
+ rc = 1;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+ return rc;
+}
+
+/* Needs sk unix state lock. Otherwise lockless check if data can (likely)
+ * be sent.
+ */
+static inline int unix_dgram_peer_recv_ready(struct sock *sk,
+ struct sock *other)
+{
+ return unix_peer(other) == sk || !unix_recvq_full(other);
+}
+
+/* Needs sk unix state lock. After recv_ready indicated not ready,
+ * establish peer_wait connection if still needed.
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +546,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +782,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1150,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);

if (other != old_peer)
@@ -1565,6 +1691,13 @@ restart:
unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);

unix_dgram_disconnected(sk, other);
@@ -1590,10 +1723,14 @@ restart:
goto out_unlock;
}

- if (unix_peer(other) != sk && unix_recvq_full(other)) {
+ if (!unix_dgram_peer_recv_ready(sk, other)) {
if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
+ if (unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
+
+ goto restart;
}

timeo = unix_wait_for_peer(other, timeo);
@@ -2453,14 +2590,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;

writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other &&
+ !unix_dgram_peer_recv_ready(sk, other) &&
+ unix_dgram_peer_wake_me(sk, other))
+ writable = 0;
+
+ unix_state_unlock(sk);
}

if (writable)

David Miller

unread,
Nov 9, 2015, 1:25:30 PM11/9/15
to rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
From: Rainer Weikusat <rwei...@mobileactivedefense.com>
Date: Mon, 09 Nov 2015 14:40:48 +0000

> + __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
> + &u->peer_wake);

This is more simply:

__remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, q);

> +static inline int unix_dgram_peer_recv_ready(struct sock *sk,
> + struct sock *other)

Please do not us the inline keyword in foo.c files, let the compiler
decide.

Thanks.

Jason Baron

unread,
Nov 9, 2015, 5:44:34 PM11/9/15
to Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
[...]

> @@ -1590,10 +1723,14 @@ restart:
> goto out_unlock;
> }
>
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> + if (!unix_dgram_peer_recv_ready(sk, other)) {
> if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> + if (unix_dgram_peer_wake_me(sk, other)) {
> + err = -EAGAIN;
> + goto out_unlock;
> + }
> +
> + goto restart;
> }


So this will cause 'unix_state_lock(other) to be called twice in a
row if we 'goto restart' (and hence will softlock the box). It just
needs a 'unix_state_unlock(other);' before the 'goto restart'.

I also tested this patch with a single unix server and 200 client
threads doing roughly epoll() followed by write() until -EAGAIN in a
loop. The throughput for the test was roughly the same as current
upstream, but the cpu usage was a lot higher. I think its b/c this patch
takes the server wait queue lock in the _poll() routine. This causes a
lot of contention. The previous patch you posted for this where you did
not clear the wait queue on every wakeup and thus didn't need the queue
lock in poll() (unless we were adding to it), performed much better.

However, the previous patch which tested better didn't add to the remote
queue when it was full on sendmsg() - so it wouldn't be correct for
epoll ET. Adding to the remote queue for every sendmsg() that fails does
seem undesirable, if we aren't even doing poll(). So I'm not sure if
just going back to the previous patch is a great option either....I'm
also not sure how realistic the test case I have is. It would be great
if we had some other workloads to test against.

Thanks,

-Jason

Rainer Weikusat

unread,
Nov 10, 2015, 12:16:55 PM11/10/15
to David Miller, rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
David Miller <da...@davemloft.net> writes:
> From: Rainer Weikusat <rwei...@mobileactivedefense.com>
> Date: Mon, 09 Nov 2015 14:40:48 +0000
>
>> + __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
>> + &u->peer_wake);
>
> This is more simply:
>
> __remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait, q);

Thank you for pointing this out.

Rainer Weikusat

unread,
Nov 10, 2015, 12:39:03 PM11/10/15
to Jason Baron, Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Jason Baron <jba...@akamai.com> writes:
> On 11/09/2015 09:40 AM, Rainer Weikusat wrote:

[...]

>> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> + if (!unix_dgram_peer_recv_ready(sk, other)) {
>> if (!timeo) {
>> - err = -EAGAIN;
>> - goto out_unlock;
>> + if (unix_dgram_peer_wake_me(sk, other)) {
>> + err = -EAGAIN;
>> + goto out_unlock;
>> + }
>> +
>> + goto restart;
>> }
>
>
> So this will cause 'unix_state_lock(other) to be called twice in a
> row if we 'goto restart' (and hence will softlock the box). It just
> needs a 'unix_state_unlock(other);' before the 'goto restart'.

The goto restart was nonsense to begin with in this code path:
Restarting something is necessary after sleeping for some time but for
the case above, execution just continues. I've changed that (updated
patch should follow 'soon') to

if (!unix_dgram_peer_recv_ready(sk, other)) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);

err = sock_intr_errno(timeo);
if (signal_pending(current))
goto out_free;

goto restart;
}

if (unix_dgram_peer_wake_me(sk, other)) {
err = -EAGAIN;
goto out_unlock;
}
}

> I also tested this patch with a single unix server and 200 client
> threads doing roughly epoll() followed by write() until -EAGAIN in a
> loop. The throughput for the test was roughly the same as current
> upstream, but the cpu usage was a lot higher. I think its b/c this patch
> takes the server wait queue lock in the _poll() routine. This causes a
> lot of contention. The previous patch you posted for this where you did
> not clear the wait queue on every wakeup and thus didn't need the queue
> lock in poll() (unless we were adding to it), performed much better.

I'm somewhat unsure what to make of that: The previous patch would also
take the wait queue lock whenever poll was about to return 'not
writable' because of the length of the server receive queue unless
another thread using the same client socket also noticed this and
enqueued this same socket already. And "hundreds of clients using a
single client socket in order to send data to a single server socket"
doesn't seem very realistic to me.

Also, this code shouldn't usually be executed as the server should
usually be capable of keeping up with the data sent by clients. If it's
permanently incapable of that, you're effectively performing a
(successful) DDOS against it. Which should result in "high CPU
utilization" in either case. It may be possible to improve this by
tuning/ changing the flow control mechanism. Out of my head, I'd suggest
making the queue longer (the default value is 10) and delaying wake ups
until the server actually did catch up, IOW, the receive queue is empty
or almost empty. But this ought to be done with a different patch.

Rainer Weikusat

unread,
Nov 10, 2015, 4:55:47 PM11/10/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Signed-off-by: Rainer Weikusat <rwei...@mobileactivedefense.com>
---

- use wait_queue_t passed as argument to _relay

- fix possible deadlock and logic error in _dgram_sendmsg by
straightening the control flow ("spaghetti code considered
confusing")

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};

static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..4297d8e 100644
+ q);
+static int unix_dgram_peer_recv_ready(struct sock *sk,
@@ -1590,19 +1723,21 @@ restart:
goto out_unlock;
}

- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ if (!unix_dgram_peer_recv_ready(sk, other)) {
+ if (timeo) {
+ timeo = unix_wait_for_peer(other, timeo);

- timeo = unix_wait_for_peer(other, timeo);
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;

- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ goto restart;
+ }

- goto restart;
+ if (unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
}

if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -2453,14 +2588,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,

Hannes Frederic Sowa

unread,
Nov 11, 2015, 7:28:21 AM11/11/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Hello,
This whole patch seems pretty complicated to me.

Can't we just remove the unix_recvq_full checks alltogether and unify
unix_dgram_poll with unix_poll?

If we want to be cautious we could simply make unix_max_dgram_qlen limit
the number of skbs which are in flight from a sending socket. The skb
destructor can then decrement this. This seems much simpler.

Would this work?

Thanks,
Hannes

Rainer Weikusat

unread,
Nov 11, 2015, 11:12:47 AM11/11/15
to Hannes Frederic Sowa, Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Hannes Frederic Sowa <han...@stressinduktion.org> writes:
> On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
>> An AF_UNIX datagram socket being the client in an n:1 association with
>> some server socket is only allowed to send messages to the server if the
>> receive queue of this socket contains at most sk_max_ack_backlog
>> datagrams.

[...]

> This whole patch seems pretty complicated to me.
>
> Can't we just remove the unix_recvq_full checks alltogether and unify
> unix_dgram_poll with unix_poll?
>
> If we want to be cautious we could simply make unix_max_dgram_qlen limit
> the number of skbs which are in flight from a sending socket. The skb
> destructor can then decrement this. This seems much simpler.
>
> Would this work?

In the way this is intended to work, cf

http://marc.info/?t=115627606000002&r=1&w=2

only if the limit would also apply to sockets which didn't sent anything
so far. Which means it'll end up in the exact same situation as before:
Sending something using a certain socket may not be possible because of
data sent by other sockets, so either, code trying to send using this
sockets ends up busy-waiting for "space again available" despite it's
trying to use select/ poll/ epolll/ $whatnot to get notified of this
condition and sleep until then or this notification needs to be
propagated to sleeping threads which didn't get to send anything yet.

Jason Baron

unread,
Nov 11, 2015, 12:35:40 PM11/11/15
to Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Hi Rainer,

> +
> +/* Needs sk unix state lock. After recv_ready indicated not ready,
> + * establish peer_wait connection if still needed.
> + */
> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
> +{
> + int connected;
> +
> + connected = unix_dgram_peer_wake_connect(sk, other);
> +
> + if (unix_recvq_full(other))
> + return 1;
> +
> + if (connected)
> + unix_dgram_peer_wake_disconnect(sk, other);
> +
> + return 0;
> +}
> +

So the comment above this function says 'needs unix state lock', however
the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage
in unix_dgram_poll() has the 'sk' lock. So this looks racy.

Also, another tweak on this scheme: Instead of calling
'__remove_wait_queue()' in unix_dgram_peer_wake_relay(). We could
instead simply mark each item in the queue as 'WQ_FLAG_EXCLUSIVE'. Then,
since 'unix_dgram_recvmsg()' does an exclusive wakeup the queue has
effectively been disabled (minus the first exlusive item in the list
which can just return if its marked exclusive). This means that in
dgram_poll(), we add to the list if we have not yet been added, and if
we are on the list, we do a remove and then add removing the exclusive
flag. Thus, all the waiters that need a wakeup are at the beginning of
the queue, and the disabled ones are at the end with the
'WQ_FLAG_EXCLUSIVE' flag set.

This does make the list potentially long, but if we only walk it to the
point we are doing the wakeup, it has no impact. I like the fact that in
this scheme the wakeup doesn't have to call remove against a long of
waiters - its just setting the exclusive flag.

Thanks,

-Jason

Hannes Frederic Sowa

unread,
Nov 11, 2015, 1:52:12 PM11/11/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Hi,

On Wed, Nov 11, 2015, at 17:12, Rainer Weikusat wrote:
> Hannes Frederic Sowa <han...@stressinduktion.org> writes:
> > On Tue, Nov 10, 2015, at 22:55, Rainer Weikusat wrote:
> >> An AF_UNIX datagram socket being the client in an n:1 association with
> >> some server socket is only allowed to send messages to the server if the
> >> receive queue of this socket contains at most sk_max_ack_backlog
> >> datagrams.
>
> [...]
>
> > This whole patch seems pretty complicated to me.
> >
> > Can't we just remove the unix_recvq_full checks alltogether and unify
> > unix_dgram_poll with unix_poll?
> >
> > If we want to be cautious we could simply make unix_max_dgram_qlen limit
> > the number of skbs which are in flight from a sending socket. The skb
> > destructor can then decrement this. This seems much simpler.
> >
> > Would this work?
>
> In the way this is intended to work, cf
>
> http://marc.info/?t=115627606000002&r=1&w=2

Oh, I see, we don't limit closed but still referenced sockets. This
actually makes sense on how fd handling is implemented, just as a range
check.

Have you checked if we can somehow deregister the socket in the poll
event framework? You wrote that it does not provide such a function but
maybe it would be easy to add?

Thanks,
Hannes

Rainer Weikusat

unread,
Nov 12, 2015, 2:12:02 PM11/12/15
to Jason Baron, Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Jason Baron <jba...@akamai.com> writes:
>> +
>> +/* Needs sk unix state lock. After recv_ready indicated not ready,
>> + * establish peer_wait connection if still needed.
>> + */
>> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
>> +{
>> + int connected;
>> +
>> + connected = unix_dgram_peer_wake_connect(sk, other);
>> +
>> + if (unix_recvq_full(other))
>> + return 1;
>> +
>> + if (connected)
>> + unix_dgram_peer_wake_disconnect(sk, other);
>> +
>> + return 0;
>> +}
>> +
>
> So the comment above this function says 'needs unix state lock', however
> the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage
> in unix_dgram_poll() has the 'sk' lock. So this looks racy.

That's one thing which is broken with this patch. Judging from a 'quick'
look at the _dgram_sendmsg code, the unix_state_lock(other) will need to
be turned into a unix_state_double_lock(sk, other) and the remaining
code changed accordingly (since all of the checks must be done without
unlocking other).

There's also something else seriously wrong with the present patch: Some
code in unix_dgram_connect presently (with this change) looks like this:

/*
* If it was connected, reconnect.
*/
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;

if (unix_dgram_peer_wake_disconnect(sk, other))
wake_up_interruptible_poll(sk_sleep(sk),
POLLOUT |
POLLWRNORM |
POLLWRBAND);

unix_state_double_unlock(sk, other);

if (other != old_peer)
unix_dgram_disconnected(sk, old_peer);
sock_put(old_peer);

and trying to disconnect from a peer the socket is just being
connected to is - of course - "flowering tomfoolery" (literal
translation of the German "bluehender Bloedsinn") --- it needs to
disconnect from old_peer instead.

I'll address the suggestion and send an updated patch "later today" (may
become "early tomorrow"). I have some code addressing both issues but
that's part of a release of 'our' kernel fork, ie, 3.2.54-based I'll
need to do 'soon'.

Rainer Weikusat

unread,
Nov 13, 2015, 1:52:04 PM11/13/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
"Believed to be least buggy version"

- disconnect from former peer in _dgram_connect

- use unix_state_double_lock in _dgram_sendmsg to ensure
recv_ready/ wake_me preconditions are met (noted by Jason
Baron)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};

static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..30e7c56 100644
+/* Preconditions for the following two functions:
+ *
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+
+static int unix_dgram_peer_recv_ready(struct sock *sk,
+ struct sock *other)
+{
+ return unix_peer(other) == sk || !unix_recvq_full(other);
+}
+
+ if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);

if (other != old_peer)
@@ -1548,7 +1674,7 @@ restart:
goto out_free;
}

- unix_state_lock(other);
+ unix_state_double_lock(sk, other);
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;
@@ -1562,9 +1688,15 @@ restart:
sock_put(other);

err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);

unix_dgram_disconnected(sk, other);
@@ -1590,21 +1722,27 @@ restart:
goto out_unlock;
}

- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
+ if (timeo) {
+ unix_state_unlock(sk);

- timeo = unix_wait_for_peer(other, timeo);
+ timeo = unix_wait_for_peer(other, timeo);

- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;

- goto restart;
+ goto restart;
+ }
+
+ if (unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
}

+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,7 +1756,7 @@ restart:
return len;

out_unlock:
- unix_state_unlock(other);
+ unix_state_double_unlock(sk, other);
out_free:
kfree_skb(skb);
out:
@@ -2453,14 +2591,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,

Rainer Weikusat

unread,
Nov 13, 2015, 2:06:28 PM11/13/15
to Hannes Frederic Sowa, Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
I thought about this but this would amount to adding a general interface
for the sole purpose of enabling the af_unix code to talk to the
eventpoll code and I don't really like this idea: IMHO, there should be
at least two users (preferably three) before creating any kind of
'abstract interface'. An even more ideal "castle in the air"
(hypothetical) solution would be "change the eventpoll.c code such that
it won't be affected if a wait queue just goes away". That's at least
theoretically possible (although it might not be in practice).

I wouldn't mind doing that (assuming it was possible) if it was just
for the kernels my employer uses because I'm aware of the uses these
will be put to and in control of the corresponding userland code. But
for "general Linux code", changing epoll in order to help the af_unix
code is more potential trouble than it's worth: Exchanging a relatively
unimportant bug in some module for a much more visibly damaging bug in a
central facility would be a bad tradeoff.

Jason Baron

unread,
Nov 13, 2015, 5:17:33 PM11/13/15
to Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause


On 11/13/2015 01:51 PM, Rainer Weikusat wrote:

[...]

>
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> - err = -EAGAIN;
> - goto out_unlock;
> - }
> + if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {

Remind me why the 'unix_peer(sk) == other' is added here? If the remote
is not connected we still want to make sure that we don't overflow the
the remote rcv queue, right?

In terms of this added 'double' lock for both sk and other, where
previously we just held the 'other' lock. I think we could continue to
just hold the 'other' lock unless the remote queue is full, so something
like:

if (unix_peer(other) != sk && unix_recvq_full(other)) {
bool need_wakeup = false;

....skipping the blocking case...

err = -EAGAIN;
if (!other_connected)
goto out_unlock;
unix_state_unlock(other);
unix_state_lock(sk);

/* if remote peer has changed under us, the connect()
will wake up any pending waiter, just return -EAGAIN

if (unix_peer(sk) == other) {
/* In case we see there is space available
queue the wakeup and we will try again. This
this should be an unlikely condition */
if (!unix_dgram_peer_wake_me(sk, other))
need_wakeup = true;
}
unix_state_unlock(sk);
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),POLLOUT
| POLLWRNORM | POLLWRBAND);
goto out_free;
}

So I'm not sure if the 'double' lock really affects any workload, but
the above might be away to avoid it.

Also - it might be helpful to add a 'Fixes:' tag referencing where this
issue started, in the changelog.

Worth mentioning too is that this patch should improve the polling case
here dramatically, as we currently wake the entire queue on every remote
read even when we have room in the rcv buffer. So this patch will cut
down on ctxt switching rate dramatically from what we currently have.

Thanks,

-Jason

Rainer Weikusat

unread,
Nov 15, 2015, 1:33:15 PM11/15/15
to Jason Baron, Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Jason Baron <jba...@akamai.com> writes:
> On 11/13/2015 01:51 PM, Rainer Weikusat wrote:
>
> [...]
>
>>
>> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
>> - if (!timeo) {
>> - err = -EAGAIN;
>> - goto out_unlock;
>> - }
>> + if (unix_peer(sk) == other && !unix_dgram_peer_recv_ready(sk, other)) {
>
> Remind me why the 'unix_peer(sk) == other' is added here? If the remote
> is not connected we still want to make sure that we don't overflow the
> the remote rcv queue, right?

Good point. The check is actually wrong there as the original code would
also check the limit in case of an unconnected send to a socket found
via address lookup. It belongs into the 2nd if (were I originally put
it).

>
> In terms of this added 'double' lock for both sk and other, where
> previously we just held the 'other' lock. I think we could continue to
> just hold the 'other' lock unless the remote queue is full, so something
> like:
>
> if (unix_peer(other) != sk && unix_recvq_full(other)) {
> bool need_wakeup = false;
>
> ....skipping the blocking case...
>
> err = -EAGAIN;
> if (!other_connected)
> goto out_unlock;
> unix_state_unlock(other);
> unix_state_lock(sk);

That was my original idea. The problem with this is that the code
starting after the _lock and running until the main code path unlock has
to be executed in one go with the other lock held as the results of the
tests above this one may become invalid as soon as the other lock is
released. This means instead of continuing execution with the send code
proper after the block in case other became receive-ready between the
first and the second test (possible because _dgram_recvmsg does not
take the unix state lock), the whole procedure starting with acquiring
the other lock would need to be restarted. Given sufficiently unfavorable
circumstances, this could even turn into an endless loop which couldn't
be interrupted. (unless code for this was added).

[...]

> we currently wake the entire queue on every remote read even when we
> have room in the rcv buffer. So this patch will cut down on ctxt
> switching rate dramatically from what we currently have.

In my opinion, this could be improved by making the throttling mechanism
work like a flip flop: If the queue lenght hits the limit after a
_sendmsg, set a "no more applicants" flag blocking future sends until
cleared (checking the flag would replace the present
check). After the receive queue ran empty (or almost empty),
_dgram_sendmsg would clear the flag and do a wakeup. But this should be
an independent patch (if implemented).

Rainer Weikusat

unread,
Nov 16, 2015, 5:15:57 PM11/16/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
- fix logic in _dgram_sendmsg: queue limit also needs to be
enforced for unconnected sends

- drop _recv_ready helper function: I'm usually a big fan of
functional decomposition but in this case, the abstraction
seemed to obscure things rather than making them easier to
understand

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};

static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3f4974d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,112 @@ found:
+/* preconditions:
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +536,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +772,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1140,13 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+
+ if (unix_dgram_peer_wake_disconnect(sk, old_peer))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_double_unlock(sk, other);

if (other != old_peer)
@@ -1548,7 +1664,7 @@ restart:
goto out_free;
}

- unix_state_lock(other);
+ unix_state_double_lock(sk, other);
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;
@@ -1562,9 +1678,15 @@ restart:
sock_put(other);

err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);

unix_dgram_disconnected(sk, other);
@@ -1591,20 +1713,27 @@ restart:
}

if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
- }
+ if (timeo) {
+ unix_state_unlock(sk);

- timeo = unix_wait_for_peer(other, timeo);
+ timeo = unix_wait_for_peer(other, timeo);

- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;

- goto restart;
+ goto restart;
+ }
+
+ if (unix_peer(sk) != other ||
+ unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ goto out_unlock;
+ }
}

+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,7 +1747,7 @@ restart:
return len;

out_unlock:
- unix_state_unlock(other);
+ unix_state_double_unlock(sk, other);
out_free:
kfree_skb(skb);
out:
@@ -2453,14 +2582,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
return mask;

writable = unix_writable(sk);
- other = unix_peer_get(sk);
- if (other) {
- if (unix_peer(other) != sk) {
- sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
- if (unix_recvq_full(other))
- writable = 0;
- }
- sock_put(other);
+ if (writable) {
+ unix_state_lock(sk);
+
+ other = unix_peer(sk);
+ if (other && unix_peer(other) != sk &&
+ unix_recvq_full(other) &&

Rainer Weikusat

unread,
Nov 16, 2015, 5:28:58 PM11/16/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
---

Additional remark about "5456f09aaf88/ af_unix: fix unix_dgram_poll()
behavior for EPOLLOUT event": This shouldn't be an issue anymore with
this change despite it restores the "only when writable" behaviour" as
the wake up relay will also be set up once _dgram_sendmsg returned
EAGAIN for a send attempt on a n:1 connected socket.

Jason Baron

unread,
Nov 17, 2015, 11:08:23 AM11/17/15
to Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause


On 11/15/2015 01:32 PM, Rainer Weikusat wrote:

>
> That was my original idea. The problem with this is that the code
> starting after the _lock and running until the main code path unlock has
> to be executed in one go with the other lock held as the results of the
> tests above this one may become invalid as soon as the other lock is
> released. This means instead of continuing execution with the send code
> proper after the block in case other became receive-ready between the
> first and the second test (possible because _dgram_recvmsg does not
> take the unix state lock), the whole procedure starting with acquiring
> the other lock would need to be restarted. Given sufficiently unfavorable
> circumstances, this could even turn into an endless loop which couldn't
> be interrupted. (unless code for this was added).
>

hmmm - I think we can avoid it by doing the wakeup from the write path
in the rare case that the queue has emptied - and avoid the double lock. IE:

unix_state_unlock(other);
unix_state_lock(sk);
err = -EAGAIN;
if (unix_peer(sk) == other) {
unix_dgram_peer_wake_connect(sk, other);
if (skb_queue_len(&other->sk_receive_queue) == 0)
need_wakeup = true;
}
unix_state_unlock(sk);
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk), POLLOUT
| POLLWRNORM | POLLWRBAND);
goto out_free;


Thanks,

-Jason

Jason Baron

unread,
Nov 17, 2015, 11:13:46 AM11/17/15
to Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Hi,

My only comment was about potentially avoiding the double lock in the
write path, otherwise this looks ok to me.

Thanks,

-Jason

Rainer Weikusat

unread,
Nov 17, 2015, 1:39:07 PM11/17/15
to Jason Baron, Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
This should probably rather be

if (unix_dgram_peer_wake_connect(sk, other) &&
skb_queue_len(&other->sk_receive_queue) == 0)
need_wakeup = 1;

as there's no need to do the wake up if someone else already connected
and then, the double lock could be avoided at the expense of returning a
gratuitous EAGAIN to the caller and throwing all of the work
_dgram_sendmsg did so far, eg, allocate a skb, copy the data into the
kernel, do all the other checks, away.

This would enable another thread to do one of the following things in
parallell with the 'locked' part of _dgram_sendmsg

1) connect sk to a socket != other
2) use sk to send to a socket != other
3) do a shutdown on sk
4) determine write-readyness of sk via poll callback

IMHO, the only thing which could possibly matter is 2) and my suggestion
for this would rather be "use a send socket per sending thread if this
matters to you" than "cause something to fail which could as well
have succeeded".

David Miller

unread,
Nov 17, 2015, 3:14:28 PM11/17/15
to rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
From: Rainer Weikusat <rwei...@mobileactivedefense.com>
Date: Mon, 16 Nov 2015 22:28:40 +0000
So because of a corner case of epoll handling and sender socket release,
every single datagram sendmsg has to do a double lock now?

I do not dispute the correctness of your fix at this point, but that
added cost in the fast path is really too high.

Rainer Weikusat

unread,
Nov 17, 2015, 4:38:05 PM11/17/15
to David Miller, rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
David Miller <da...@davemloft.net> writes:
> From: Rainer Weikusat <rwei...@mobileactivedefense.com>
> Date: Mon, 16 Nov 2015 22:28:40 +0000
>
>> An AF_UNIX datagram socket being the client in an n:1 association with
>> some server socket is only allowed to send messages to the server if the
>> receive queue of this socket contains at most sk_max_ack_backlog
>> datagrams.

[...]

>> Signed-off-by: Rainer Weikusat <rwei...@mobileactivedefense.com>
>> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
>
> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.

This leaves only the option of a somewhat incorrect solution and what is
or isn't acceptable in this respect is somewhat difficult to decide. The
basic options would be

- return EAGAIN even if sending became possible (Jason's most
recent suggestions)

- retry sending a limited number of times, eg, once, before
returning EAGAIN, on the grounds that this is nicer to the
application and that redoing all the stuff up to the _lock in
dgram_sendmsg can possibly/ likely be avoided

Which one do you prefer?

Rainer Weikusat

unread,
Nov 17, 2015, 5:09:59 PM11/17/15
to David Miller, rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
Rainer Weikusat <r...@doppelsaurus.mobileactivedefense.com> writes:

[...]

> The basic options would be
>
> - return EAGAIN even if sending became possible (Jason's most
> recent suggestions)
>
> - retry sending a limited number of times, eg, once, before
> returning EAGAIN, on the grounds that this is nicer to the
> application and that redoing all the stuff up to the _lock in
> dgram_sendmsg can possibly/ likely be avoided

A third option: Use trylock to acquire the sk lock. If this succeeds,
there's no risk of deadlocking anyone even if acquiring the locks in the
wrong order. This could look as follows (NB: I didn't even compile this,
I just wrote the code to get an idea how complicated it would be):

int need_wakeup;

[...]

need_wakeup = 0;
err = 0;
if (spin_lock_trylock(unix_sk(sk)->lock)) {
if (unix_peer(sk) != other ||
unix_dgram_peer_wake_me(sk, other))
err = -EAGAIN;
} else {
err = -EAGAIN;

unix_state_unlock(other);
unix_state_lock(sk);

need_wakeup = unix_peer(sk) != other &&
unix_dgram_peer_wake_connect(sk, other) &&
sk_receive_queue_len(other) == 0;
}

unix_state_unlock(sk);

if (err) {

Rainer Weikusat

unread,
Nov 17, 2015, 5:49:09 PM11/17/15
to Rainer Weikusat, David Miller, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
Rainer Weikusat <rwei...@mobileactivedefense.com> writes:

[...]

> This leaves only the option of a somewhat incorrect solution and what is
> or isn't acceptable in this respect is somewhat difficult to decide. The
> basic options would be

[...]
> - retry sending a limited number of times, eg, once, before
> returning EAGAIN, on the grounds that this is nicer to the
> application and that redoing all the stuff up to the _lock in
> dgram_sendmsg can possibly/ likely be avoided

Since it's better to have a specific example of something: Here's
another 'code sketch' of this option (hopefully with less errors this
time, there's an int restart = 0 above):

if (unix_peer(other) != sk && unix_recvq_full(other)) {
int need_wakeup;


[...]

need_wakeup = 0;
err = 0;
unix_state_unlock(other);
unix_state_lock(sk);

if (unix_peer(sk) == other) {
if (++restart == 2) {
need_wakeup = unix_dgram_peer_wake_connect(sk, other) &&
sk_receive_queue_len(other) == 0;
err = -EAGAIN;
} else if (unix_dgram_peer_wake_me(sk, other))
err = -EAGAIN;
} else
err = -EAGAIN;

unix_state_unlock(sk);

if (err || !restart) {
if (need_wakeup)
wake_up_interruptible_poll(sk_sleep(sk),
POLLOUT |
POLLWRNORM |
POLLWRBAND);

goto out_free;
}

goto restart;
}

I don't particularly like that, either, and to me, the best option seems
to be to return the spurious EAGAIN if taking both locks unconditionally
is not an option as that's the simplest choice.

Rainer Weikusat

unread,
Nov 18, 2015, 1:15:48 PM11/18/15
to David Miller, rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
David Miller <da...@davemloft.net> writes:
> From: Rainer Weikusat <rwei...@mobileactivedefense.com>
> Date: Mon, 16 Nov 2015 22:28:40 +0000
>
>> An AF_UNIX datagram socket being the client in an n:1

[...]

> So because of a corner case of epoll handling and sender socket release,
> every single datagram sendmsg has to do a double lock now?
>
> I do not dispute the correctness of your fix at this point, but that
> added cost in the fast path is really too high.

Some more information on this: Running the test program included below
on my 'work' system (otherwise idle, after logging in via VT with no GUI
running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in the
following throughput statistics[*]:

avg 13.617 M/s
median 13.393 M/s
max 17.14 M/s
min 13.047 M/s
deviation 0.85

I'll try to post the results for 'unpatched' later as I'm also working
on a couple of other things.

[*] I do not use my fingers for counting, hence, these are binary and
not decimal units.

------------
#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <unistd.h>

enum {
MSG_SZ = 16,
MSGS = 1000000
};

static char msg[MSG_SZ];

static uint64_t tv2u(struct timeval *tv)
{
uint64_t u;

u = tv->tv_sec;
u *= 1000000;
return u + tv->tv_usec;
}

int main(void)
{
struct timeval start, stop;
uint64_t t_diff;
double rate;
int sks[2];
unsigned remain;
char buf[MSG_SZ];

socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sks);

if (fork() == 0) {
close(*sks);

gettimeofday(&start, 0);
while (read(sks[1], buf, sizeof(buf)) > 0);
gettimeofday(&stop, 0);

t_diff = tv2u(&stop);
t_diff -= tv2u(&start);
rate = MSG_SZ * MSGS;
rate /= t_diff;
rate *= 1000000;
printf("rate %fM/s\n", rate / (1 << 20));

fflush(stdout);
_exit(0);
}

close(sks[1]);

remain = MSGS;
do write(*sks, msg, sizeof(msg)); while (--remain);
close(*sks);

wait(NULL);
return 0;
}

Rainer Weikusat

unread,
Nov 18, 2015, 6:39:53 PM11/18/15
to David Miller, rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
Rainer Weikusat <r...@doppelsaurus.mobileactivedefense.com> writes:

[...]

> Some more information on this: Running the test program included below
> on my 'work' system (otherwise idle, after logging in via VT with no GUI
> running)/ quadcore AMD A10-5700, 3393.984 for 20 times/ patched 4.3 resulted in the
> following throughput statistics[*]:

Since the results were too variable with only 20 runs, I've also tested
this with 100 for three kernels, stock 4.3, 4.3 plus the published
patch, 4.3 plus the published patch plus the "just return EAGAIN"
modification". The 1st and the 3rd perform about identical for the
test program I used (slightly modified version included below), the 2nd
is markedly slower. This is most easily visible when grouping the
printed data rates (B/s) 'by millions':

stock 4.3
---------
13000000.000-13999999.000 3 (3%)
14000000.000-14999999.000 82 (82%)
15000000.000-15999999.000 15 (15%)


4.3 + patch
-----------
13000000.000-13999999.000 54 (54%)
14000000.000-14999999.000 35 (35%)
15000000.000-15999999.000 7 (7%)
16000000.000-16999999.000 1 (1%)
18000000.000-18999999.000 1 (1%)
22000000.000-22999999.000 2 (2%)


4.3 + modified patch
--------------------
13000000.000-13999999.000 3 (3%)
14000000.000-14999999.000 82 (82%)
15000000.000-15999999.000 14 (14%)
24000000.000-24999999.000 1 (1%)


IMHO, the 3rd option would be the way to go if this was considered an
acceptable option (ie, despite it returns spurious errors in 'rare
cases').


modified test program
=====================
printf("%f\n", rate);

Rainer Weikusat

unread,
Nov 19, 2015, 6:49:04 PM11/19/15
to Rainer Weikusat, David Miller, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
Rainer Weikusat <rwei...@mobileactivedefense.com> writes:
> Rainer Weikusat <r...@doppelsaurus.mobileactivedefense.com> writes:
>
> [...]
>
>> The basic options would be
>>
>> - return EAGAIN even if sending became possible (Jason's most
>> recent suggestions)
>>
>> - retry sending a limited number of times, eg, once, before
>> returning EAGAIN, on the grounds that this is nicer to the
>> application and that redoing all the stuff up to the _lock in
>> dgram_sendmsg can possibly/ likely be avoided
>
> A third option:

A fourth and even one that's reasonably simple to implement: In case
other became ready during the checks, drop other lock, do a double-lock
sk, other, set a flag variable indicating this and restart the procedure
after the unix_state_lock_other[*], using the value of the flag to lock/
unlock sk as needed. Should other still be ready to receive data,
execution can then continue with the 'queue it' code as the other lock
was held all the time this time. Combined with a few unlikely
annotations in place where they're IMHO appropriate, this is speed-wise
comparable to the stock kernel.

Rainer Weikusat

unread,
Nov 19, 2015, 6:53:11 PM11/19/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
This has been created around midnight at the end of a work day. I've
read through it a couple of times and found no errors. But I'm not
"ideally awake and attentive" right now.


diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};

static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..6962ff1 100644
@@ -1470,6 +1586,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
struct scm_cookie scm;
int max_level;
int data_len = 0;
+ int sk_locked;

wait_for_unix_gc();
err = scm_send(sock, msg, &scm, false);
@@ -1548,12 +1665,14 @@ restart:
goto out_free;
}

+ sk_locked = 0;
unix_state_lock(other);
+restart_locked:
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;

- if (sock_flag(other, SOCK_DEAD)) {
+ if (unlikely(sock_flag(other, SOCK_DEAD))) {
/*
* Check with 1003.1g - what should
* datagram error
@@ -1561,10 +1680,19 @@ restart:
unix_state_unlock(other);
sock_put(other);

+ if (!sk_locked)
+ unix_state_lock(sk);
+
err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+
+ if (unix_dgram_peer_wake_disconnect(sk, other))
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+
unix_state_unlock(sk);

unix_dgram_disconnected(sk, other);
@@ -1590,21 +1718,35 @@ restart:
goto out_unlock;
}

- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
+ if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
+ if (timeo) {
+ timeo = unix_wait_for_peer(other, timeo);
+
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
+
+ goto restart;
+ }
+
+ if (unix_peer(sk) != other ||
+ unix_dgram_peer_wake_me(sk, other)) {
err = -EAGAIN;
goto out_unlock;
}

- timeo = unix_wait_for_peer(other, timeo);
+ if (!sk_locked) {
+ unix_state_unlock(other);
+ unix_state_double_lock(sk, other);

- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
-
- goto restart;
+ sk_locked = 1;
+ goto restart_locked;
+ }
}

+ if (unlikely(sk_locked))
+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,6 +1760,8 @@ restart:
return len;

out_unlock:
+ if (sk_locked)
+ unix_state_unlock(sk);
unix_state_unlock(other);
out_free:
kfree_skb(skb);
@@ -2453,14 +2597,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,

Jason Baron

unread,
Nov 20, 2015, 11:03:39 AM11/20/15
to Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
On 11/19/2015 06:52 PM, Rainer Weikusat wrote:

[...]

> @@ -1590,21 +1718,35 @@ restart:
> goto out_unlock;
> }
>
> - if (unix_peer(other) != sk && unix_recvq_full(other)) {
> - if (!timeo) {
> + if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
> + if (timeo) {
> + timeo = unix_wait_for_peer(other, timeo);
> +
> + err = sock_intr_errno(timeo);
> + if (signal_pending(current))
> + goto out_free;
> +
> + goto restart;
> + }
> +
> + if (unix_peer(sk) != other ||
> + unix_dgram_peer_wake_me(sk, other)) {
> err = -EAGAIN;
> goto out_unlock;
> }

Hi,

So here we are calling unix_dgram_peer_wake_me() without the sk lock the first time
through - right? In that case, we can end up registering on the queue of other for
the callback but we might have already connected to a different remote. In that case,
the wakeup will crash if 'sk' has freed in the meantime.

Thanks,

-Jason

Rainer Weikusat

unread,
Nov 20, 2015, 11:21:40 AM11/20/15
to Jason Baron, Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Yes. And this is obviously wrong. I spend most of the 'evening time'
(some people would call that 'night time') with testing this and didn't
get to read through it again yet. Thank you for pointing this out. I'll
send an updated patch shortly.

Rainer Weikusat

unread,
Nov 20, 2015, 5:07:56 PM11/20/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
- uninvert the lock/ check code in _dgram_sendmsg

- introduce a unix_dgram_peer_wake_disconnect_wakuep helper
function as there were two calls with a wakeup immediately
following and two without

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b36d837..2a91a05 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
#define UNIX_GC_CANDIDATE 0
#define UNIX_GC_MAYBE_CYCLE 1
struct socket_wq peer_wq;
+ wait_queue_t peer_wake;
};

static inline struct unix_sock *unix_sk(const struct sock *sk)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 94f6582..3d93b0d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,6 +326,118 @@ found:
+static void unix_dgram_peer_wake_disconnect(struct sock *sk,
+ struct sock *other)
+{
+ struct unix_sock *u, *u_other;
+
+ u = unix_sk(sk);
+ u_other = unix_sk(other);
+ spin_lock(&u_other->peer_wait.lock);
+
+ if (u->peer_wake.private == other) {
+ __remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
+ u->peer_wake.private = NULL;
+ }
+
+ spin_unlock(&u_other->peer_wait.lock);
+}
+
+static void unix_dgram_peer_wake_disconnect_wakeup(struct sock *sk,
+ struct sock *other)
+{
+ unix_dgram_peer_wake_disconnect(sk, other);
+ wake_up_interruptible_poll(sk_sleep(sk),
+ POLLOUT |
+ POLLWRNORM |
+ POLLWRBAND);
+}
+
+/* preconditions:
+ * - unix_peer(sk) == other
+ * - association is stable
+ */
+static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
+{
+ int connected;
+
+ connected = unix_dgram_peer_wake_connect(sk, other);
+
+ if (unix_recvq_full(other))
+ return 1;
+
+ if (connected)
+ unix_dgram_peer_wake_disconnect(sk, other);
+
+ return 0;
+}
+
static inline int unix_writable(struct sock *sk)
{
return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
@@ -430,6 +542,8 @@ static void unix_release_sock(struct sock *sk, int embrion)
skpair->sk_state_change(skpair);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
}
+
+ unix_dgram_peer_wake_disconnect(sk, skpair);
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
}
@@ -664,6 +778,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
INIT_LIST_HEAD(&u->link);
mutex_init(&u->readlock); /* single task reading lock */
init_waitqueue_head(&u->peer_wait);
+ init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
unix_insert_socket(unix_sockets_unbound(sk), sk);
out:
if (sk == NULL)
@@ -1031,6 +1146,8 @@ restart:
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;
+ unix_dgram_peer_wake_disconnect_wakeup(sk, old_peer);
+
unix_state_double_unlock(sk, other);

if (other != old_peer)
@@ -1470,6 +1587,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
struct scm_cookie scm;
int max_level;
int data_len = 0;
+ int sk_locked;

wait_for_unix_gc();
err = scm_send(sock, msg, &scm, false);
@@ -1548,12 +1666,14 @@ restart:
goto out_free;
}

+ sk_locked = 0;
unix_state_lock(other);
+restart_locked:
err = -EPERM;
if (!unix_may_send(sk, other))
goto out_unlock;

- if (sock_flag(other, SOCK_DEAD)) {
+ if (unlikely(sock_flag(other, SOCK_DEAD))) {
/*
* Check with 1003.1g - what should
* datagram error
@@ -1561,10 +1681,14 @@ restart:
unix_state_unlock(other);
sock_put(other);

+ if (!sk_locked)
+ unix_state_lock(sk);
+
err = 0;
- unix_state_lock(sk);
if (unix_peer(sk) == other) {
unix_peer(sk) = NULL;
+ unix_dgram_peer_wake_disconnect_wakeup(sk, other);
+
unix_state_unlock(sk);

unix_dgram_disconnected(sk, other);
@@ -1590,21 +1714,38 @@ restart:
goto out_unlock;
}

- if (unix_peer(other) != sk && unix_recvq_full(other)) {
- if (!timeo) {
- err = -EAGAIN;
- goto out_unlock;
+ if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
+ if (timeo) {
+ timeo = unix_wait_for_peer(other, timeo);
+
+ err = sock_intr_errno(timeo);
+ if (signal_pending(current))
+ goto out_free;
+
+ goto restart;
}

- timeo = unix_wait_for_peer(other, timeo);
+ if (!sk_locked) {
+ unix_state_unlock(other);
+ unix_state_double_lock(sk, other);
+ }

- err = sock_intr_errno(timeo);
- if (signal_pending(current))
- goto out_free;
+ if (unix_peer(sk) != other ||
+ unix_dgram_peer_wake_me(sk, other)) {
+ err = -EAGAIN;
+ sk_locked = 1;
+ goto out_unlock;
+ }

- goto restart;
+ if (!sk_locked) {
+ sk_locked = 1;
+ goto restart_locked;
+ }
}

+ if (unlikely(sk_locked))
+ unix_state_unlock(sk);
+
if (sock_flag(other, SOCK_RCVTSTAMP))
__net_timestamp(skb);
maybe_add_creds(skb, sock, other);
@@ -1618,6 +1759,8 @@ restart:
return len;

out_unlock:
+ if (sk_locked)
+ unix_state_unlock(sk);
unix_state_unlock(other);
out_free:
kfree_skb(skb);
@@ -2453,14 +2596,16 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,

Rainer Weikusat

unread,
Nov 22, 2015, 4:44:17 PM11/22/15
to Rainer Weikusat, Jason Baron, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Rainer Weikusat <rwei...@mobileactivedefense.com> writes:

[AF_UNIX SOCK_DGRAM throughput]

> It may be possible to improve this by tuning/ changing the flow
> control mechanism. Out of my head, I'd suggest making the queue longer
> (the default value is 10) and delaying wake ups until the server
> actually did catch up, IOW, the receive queue is empty or almost
> empty. But this ought to be done with a different patch.

Because I was curious about the effects, I implemented this using a
slightly modified design than the one I originally suggested to account
for the different uses of the 'is the receive queue full' check. The
code uses a datagram-specific checking function,

static int unix_dgram_recvq_full(struct sock const *sk)
{
struct unix_sock *u;

u = unix_sk(sk);
if (test_bit(UNIX_DG_FULL, &u->flags))
return 1;

if (!unix_recvq_full(sk))
return 0;

__set_bit(UNIX_DG_FULL, &u->flags);
return 1;
}

which gets called instead of the other for the n:1 datagram checks and a

if (test_bit(UNIX_DG_FULL, &u->flags) &&
!skb_queue_len(&sk->sk_receive_queue)) {
__clear_bit(UNIX_DG_FULL, &u->flags);
wake_up_interruptible_sync_poll(&u->peer_wait,
POLLOUT | POLLWRNORM |
POLLWRBAND);
}

in unix_dgram_recvmsg to delay wakeups until the queued datagrams have
been consumed if the queue overflowed before. This has the additional,
nice side effect that wakeups won't ever be done for 1:1 connected
datagram sockets (both SOCK_DGRAM and SOCK_SEQPACKET) where they're of
no use, anyway.

Compared to a 'stock' 4.3 running the test program I posted (supposed to
make the overhead noticable by sending lots of small messages), the
average number of bytes sent per second increased by about 782,961.79
(ca 764.61K), about 5.32% of the 4.3 number (14,714,579.91), with a
fairly simple code change.

Jason Baron

unread,
Nov 23, 2015, 11:21:02 AM11/23/15
to Rainer Weikusat, Dmitry Vyukov, syzkaller, Michal Kubecek, Al Viro, linux-...@vger.kernel.org, LKML, David Miller, Hannes Frederic Sowa, David Howells, Paul Moore, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, netdev, Kostya Serebryany, Alexander Potapenko, Andrey Konovalov, Sasha Levin, Julien Tinnes, Kees Cook, Mathias Krause
Looks good to me.
Reviewed-by: Jason Baron <jba...@akamai.com>

Thanks,

-Jason

David Miller

unread,
Nov 23, 2015, 12:30:37 PM11/23/15
to rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
From: Rainer Weikusat <rwei...@mobileactivedefense.com>
Date: Fri, 20 Nov 2015 22:07:23 +0000
Applied and queued up for -stable, thanks to you and Jason for all of
your hard work on this fix.

Rainer Weikusat

unread,
Nov 23, 2015, 4:37:40 PM11/23/15
to David Miller, rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
David Miller <da...@davemloft.net> writes:
> From: Rainer Weikusat <rwei...@mobileactivedefense.com>
>> Rainer Weikusat <rwei...@mobileactivedefense.com> writes:
>> An AF_UNIX datagram socket being the client in an n:1 association

[...]

> Applied and queued up for -stable,

I'm sorry for this 13th hour request/ suggestion but while thinking
about a reply to Dmitry, it occurred to me that the restart_locked/
sk_locked logic could be avoided by moving the test for this condition
in front of all the others while leaving the 'act on it' code at its
back, ie, reorganize unix_dgram_sendmsg such that it looks like this:

unix_state_lock(other);

if (unix_peer(other) != sk && unix_recvq_full(other)) {
need_wait = 1;

if (!timeo) {
unix_state_unlock(other);
unix_state_double_lock(sk, other);

if (unix_peer(other) == sk ||
(unix_peer(sk) == other &&
!unix_dgram_peer_wake_me(sk, other)))
need_wait = 0;

unix_state_unlock(sk);
}
}

/* original code here */

if (need_wait) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);

err = sock_intr_errno(timeo);
if (signal_pending(current))
goto out_free;

goto restart;
}

err = -EAGAIN;
goto out_unlock;
}

/* original tail here */

This might cause a socket to be enqueued to the peer despite it's not
allowed to send to it but I don't think this matters much. This is a
less conservative modification but one which results in simpler code
overall. The kernel I'm currently running has been modified like this
and 'survived' the usual tests.

Rainer Weikusat

unread,
Nov 23, 2015, 6:06:36 PM11/23/15
to David Miller, rwei...@mobileactivedefense.com, jba...@akamai.com, dvy...@google.com, syzk...@googlegroups.com, mkub...@suse.cz, vi...@zeniv.linux.org.uk, linux-...@vger.kernel.org, linux-...@vger.kernel.org, han...@stressinduktion.org, dhow...@redhat.com, pa...@paul-moore.com, sal...@android.com, s...@tycho.nsa.gov, ying...@windriver.com, net...@vger.kernel.org, k...@google.com, gli...@google.com, andre...@google.com, sasha...@oracle.com, j...@google.com, kees...@google.com, min...@googlemail.com
Rainer Weikusat <r...@doppelsaurus.mobileactivedefense.com> writes:
> David Miller <da...@davemloft.net> writes:

[...]

> I'm sorry for this 13th hour request/ suggestion but while thinking
> about a reply to Dmitry, it occurred to me that the restart_locked/
> sk_locked logic could be avoided by moving the test for this condition
> in front of all the others while leaving the 'act on it' code at its
> back, ie, reorganize unix_dgram_sendmsg such that it looks like this:

[...]

Just in case this is unclear on its own: If this was considered an
improvement by someone other than me, I could supply either a "complete"
patch with this re-arrangement or a cleanup delta patch changing the
previous change.
Reply all
Reply to author
Forward
0 new messages