WARNING in __mmdrop

73 views
Skip to first unread message

syzbot

unread,
Jul 18, 2019, 11:35:06 PM7/18/19
to aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, kees...@chromium.org, l...@altlinux.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Hello,

syzbot found the following crash on:

HEAD commit: 6d21a41b Add linux-next specific files for 20190718
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=134920f0600000
kernel config: https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+e58112...@syzkaller.appspotmail.com

WARNING: CPU: 0 PID: 9142 at kernel/fork.c:677 __mmdrop+0x26a/0x320
/kernel/fork.c:677
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 9142 Comm: syz-executor.0 Not tainted 5.2.0-next-20190718 #41
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack /lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 /lib/dump_stack.c:113
panic+0x2dc/0x755 /kernel/panic.c:219
__warn.cold+0x20/0x4c /kernel/panic.c:576
report_bug+0x263/0x2b0 /lib/bug.c:186
fixup_bug /arch/x86/kernel/traps.c:179 [inline]
fixup_bug /arch/x86/kernel/traps.c:174 [inline]
do_error_trap+0x11b/0x200 /arch/x86/kernel/traps.c:272
do_invalid_op+0x37/0x50 /arch/x86/kernel/traps.c:291
invalid_op+0x14/0x20 /arch/x86/entry/entry_64.S:1008
RIP: 0010:__mmdrop+0x26a/0x320 /kernel/fork.c:677
Code: 5f 5d c3 e8 18 7e 2f 00 4c 89 ef e8 60 d7 2b 00 eb d2 e8 09 7e 2f 00
0f 0b e8 02 7e 2f 00 0f 0b e9 fa fd ff ff e8 f6 7d 2f 00 <0f> 0b e9 2b fe
ff ff e8 ea 7d 2f 00 4c 89 e7 e8 a2 f7 67 00 e9 85
RSP: 0018:ffff8880a176fac0 EFLAGS: 00010293
RAX: ffff888098f02600 RBX: ffff888098f02600 RCX: ffffffff81430bab
RDX: 0000000000000000 RSI: ffffffff814306fa RDI: ffff888098f02a30
RBP: ffff8880a176fae8 R08: ffff888098f02600 R09: ffffed10148dd4ec
R10: ffffed10148dd4eb R11: ffff8880a46ea75f R12: ffff8880a46ea700
R13: ffff8880a46ea828 R14: ffff8880a46eac50 R15: 0000000000000000
mmdrop /./include/linux/sched/mm.h:49 [inline]
__mmput /kernel/fork.c:1074 [inline]
mmput+0x3f0/0x4d0 /kernel/fork.c:1085
exit_mm /kernel/exit.c:547 [inline]
do_exit+0x84e/0x2ea0 /kernel/exit.c:864
do_group_exit+0x135/0x360 /kernel/exit.c:981
get_signal+0x47c/0x2500 /kernel/signal.c:2728
do_signal+0x87/0x1700 /arch/x86/kernel/signal.c:815
exit_to_usermode_loop+0x286/0x380 /arch/x86/entry/common.c:159
prepare_exit_to_usermode /arch/x86/entry/common.c:194 [inline]
syscall_return_slowpath /arch/x86/entry/common.c:274 [inline]
do_syscall_64+0x5a9/0x6a0 /arch/x86/entry/common.c:299
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459819
Code: Bad RIP value.
RSP: 002b:00007f75afccbc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000459819
RDX: 00000000200023c0 RSI: 000000004028af11 RDI: 0000000000000003
RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f75afccc6d4
R13: 00000000004c4722 R14: 00000000004d87d0 R15: 00000000ffffffff
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

syzbot

unread,
Jul 20, 2019, 6:08:01 AM7/20/19
to aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, m...@redhat.com, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
syzbot has bisected this bug to:

commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
Author: Jason Wang <jaso...@redhat.com>
Date: Fri May 24 08:12:18 2019 +0000

vhost: access vq metadata through kernel virtual address

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
start commit: 6d21a41b Add linux-next specific files for 20190718
git tree: linux-next
final crash: https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000

Reported-by: syzbot+e58112...@syzkaller.appspotmail.com
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
address")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

Michael S. Tsirkin

unread,
Jul 21, 2019, 6:03:05 AM7/21/19
to syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
OK I poked at this for a bit, I see several things that
we need to fix, though I'm not yet sure it's the reason for
the failures:


1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
That's just a bad hack, in particular I don't think device
mutex is taken and so poking at two VQs will corrupt
memory.
So what to do? How about a per vq notifier?
Of course we also have synchronize_rcu
in the notifier which is slow and is now going to be called twice.
I think call_rcu would be more appropriate here.
We then need rcu_barrier on module unload.
OTOH if we make pages linear with map then we are good
with kfree_rcu which is even nicer.

2. Doesn't map leak after vhost_map_unprefetch?
And why does it poke at contents of the map?
No one should use it right?

3. notifier unregister happens last in vhost_dev_cleanup,
but register happens first. This looks wrong to me.

4. OK so we use the invalidate count to try and detect that
some invalidate is in progress.
I am not 100% sure why do we care.
Assuming we do, uaddr can change between start and end
and then the counter can get negative, or generally
out of sync.

So what to do about all this?
I am inclined to say let's just drop the uaddr optimization
for now. E.g. kvm invalidates unconditionally.
3 should be fixed independently.


--
MST

syzbot

unread,
Jul 21, 2019, 8:11:01 AM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

les configured (established 65536 bind 65536)
[ 6.808609][ T1] UDP hash table entries: 4096 (order: 7, 655360
bytes, vmalloc)
[ 6.810813][ T1] UDP-Lite hash table entries: 4096 (order: 7, 655360
bytes, vmalloc)
[ 6.813604][ T1] NET: Registered protocol family 1
[ 6.816831][ T1] RPC: Registered named UNIX socket transport module.
[ 6.818527][ T1] RPC: Registered udp transport module.
[ 6.819544][ T1] RPC: Registered tcp transport module.
[ 6.820501][ T1] RPC: Registered tcp NFSv4.1 backchannel transport
module.
[ 6.822944][ T1] NET: Registered protocol family 44
[ 6.823738][ T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[ 6.825254][ T1] PCI: CLS 0 bytes, default 64
[ 6.829091][ T1] PCI-DMA: Using software bounce buffering for IO
(SWIOTLB)
[ 6.830088][ T1] software IO TLB: mapped [mem 0xaa800000-0xae800000]
(64MB)
[ 6.833252][ T1] RAPL PMU: API unit is 2^-32 Joules, 0 fixed
counters, 10737418240 ms ovfl timer
[ 6.838785][ T1] kvm: already loaded the other module
[ 6.839685][ T1] clocksource: tsc: mask: 0xffffffffffffffff
max_cycles: 0x212735223b2, max_idle_ns: 440795277976 ns
[ 6.841374][ T1] clocksource: Switched to clocksource tsc
[ 6.842536][ T1] mce: Machine check injector initialized
[ 6.847292][ T1] check: Scanning for low memory corruption every 60
seconds
[ 6.937780][ T1] Initialise system trusted keyrings
[ 6.940843][ T1] workingset: timestamp_bits=40 max_order=21
bucket_order=0
[ 6.942419][ T1] zbud: loaded
[ 6.948151][ T1] DLM installed
[ 6.950370][ T1] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 6.955181][ T1] FS-Cache: Netfs 'nfs' registered for caching
[ 6.958514][ T1] NFS: Registering the id_resolver key type
[ 6.959782][ T1] Key type id_resolver registered
[ 6.960872][ T1] Key type id_legacy registered
[ 6.961793][ T1] nfs4filelayout_init: NFSv4 File Layout Driver
Registering...
[ 6.963047][ T1] Installing knfsd (copyright (C) 1996
ok...@monad.swb.de).
[ 6.966739][ T1] ntfs: driver 2.1.32 [Flags: R/W].
[ 6.968466][ T1] fuse: init (API version 7.31)
[ 6.971220][ T1] JFS: nTxBlock = 8192, nTxLock = 65536
[ 6.980736][ T1] SGI XFS with ACLs, security attributes, realtime, no
debug enabled
[ 6.985971][ T1] 9p: Installing v9fs 9p2000 file system support
[ 6.987761][ T1] FS-Cache: Netfs '9p' registered for caching
[ 6.992344][ T1] gfs2: GFS2 installed
[ 6.995258][ T1] FS-Cache: Netfs 'ceph' registered for caching
[ 6.996351][ T1] ceph: loaded (mds proto 32)
[ 7.002881][ T1] NET: Registered protocol family 38
[ 7.004563][ T1] async_tx: api initialized (async)
[ 7.005934][ T1] Key type asymmetric registered
[ 7.006720][ T1] Asymmetric key parser 'x509' registered
[ 7.007658][ T1] Asymmetric key parser 'pkcs8' registered
[ 7.008684][ T1] Key type pkcs7_test registered
[ 7.009480][ T1] Asymmetric key parser 'tpm_parser' registered
[ 7.010703][ T1] Block layer SCSI generic (bsg) driver version 0.4
loaded (major 246)
[ 7.012951][ T1] io scheduler mq-deadline registered
[ 7.013822][ T1] io scheduler kyber registered
[ 7.014898][ T1] io scheduler bfq registered
[ 7.019871][ T1] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[ 7.021981][ T1] ACPI: Power Button [PWRF]
[ 7.023566][ T1] input: Sleep Button as
/devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
[ 7.025183][ T1] ACPI: Sleep Button [SLPF]
[ 7.030616][ T1] ioatdma: Intel(R) QuickData Technology Driver 5.00
[ 7.042672][ T1] PCI Interrupt Link [LNKC] enabled at IRQ 11
[ 7.043670][ T1] virtio-pci 0000:00:03.0: virtio_pci: leaving for
legacy driver
[ 7.056046][ T1] PCI Interrupt Link [LNKD] enabled at IRQ 10
[ 7.057009][ T1] virtio-pci 0000:00:04.0: virtio_pci: leaving for
legacy driver
[ 7.322858][ T1] HDLC line discipline maxframe=4096
[ 7.324054][ T1] N_HDLC line discipline registered.
[ 7.324788][ T1] Serial: 8250/16550 driver, 4 ports, IRQ sharing
enabled
[ 7.349558][ T1] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud =
115200) is a 16550A
[ 7.376566][ T1] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud =
115200) is a 16550A
[ 7.401675][ T1] 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud =
115200) is a 16550A
[ 7.426911][ T1] 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud =
115200) is a 16550A
[ 7.435838][ T1] Non-volatile memory driver v1.3
[ 7.437455][ T1] Linux agpgart interface v0.103
[ 7.446446][ T1] [drm] Initialized vgem 1.0.0 20120112 for vgem on
minor 0
[ 7.448571][ T1] [drm] Supports vblank timestamp caching Rev 2
(21.10.2013).
[ 7.450254][ T1] [drm] Driver supports precise vblank timestamp query.
[ 7.453895][ T1] [drm] Initialized vkms 1.0.0 20180514 for vkms on
minor 1
[ 7.455890][ T1] usbcore: registered new interface driver udl
[ 7.500541][ T1] brd: module loaded
[ 7.531107][ T1] loop: module loaded
[ 7.592794][ T1] zram: Added device: zram0
[ 7.599273][ T1] null: module loaded
[ 7.604504][ T1] nfcsim 0.2 initialized
[ 7.606991][ T1] Loading iSCSI transport class v2.0-870.
[ 7.629140][ T1] scsi host0: Virtio SCSI HBA
[ 7.663976][ T1] st: Version 20160209, fixed bufsize 32768, s/g segs
256
[ 7.667390][ T7] kasan: CONFIG_KASAN_INLINE enabled
[ 7.668816][ T7] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 7.670661][ T7] general protection fault: 0000 [#1] SMP KASAN
[ 7.672001][ T7] CPU: 1 PID: 7 Comm: kworker/u4:0 Not tainted 5.2.0+
#1
[ 7.673495][ T7] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 7.675799][ T7] Workqueue: events_unbound async_run_entry_fn
[ 7.676896][ T1] kobject: 'nvme-wq' (000000007e9a1e3d):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.676124][ T7] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.679656][ T1] kobject: 'nvme-wq' (000000007e9a1e3d):
kobject_uevent_env
[ 7.676124][ T7] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.682435][ T1] kobject: 'nvme-wq' (000000007e9a1e3d):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.676124][ T7] RSP: 0000:ffff8880a989f768 EFLAGS: 00010246
[ 7.676124][ T7] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff816007b1
[ 7.676124][ T7] RDX: 0000000000000000 RSI: ffffffff816007d0 RDI:
ffff8880a51b54b8
[ 7.676124][ T7] RBP: ffff8880a989f780 R08: ffff8880a98921c0 R09:
ffffed10146fb4dc
[ 7.676124][ T7] R10: ffffed10146fb4db R11: ffff8880a37da6df R12:
ffff8880a51b5180
[ 7.676124][ T7] R13: ffff8880a51b5180 R14: ffff8880a329aa30 R15:
0000000000000200
[ 7.676124][ T7] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.676124][ T7] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.676124][ T7] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.676124][ T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.676124][ T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.676124][ T7] Call Trace:
[ 7.676124][ T7] dma_max_mapping_size+0xba/0x100
[ 7.676124][ T7] __scsi_init_queue+0x1cb/0x580
[ 7.676124][ T7] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 7.676124][ T7] scsi_mq_alloc_queue+0xd2/0x180
[ 7.676124][ T7] scsi_alloc_sdev+0x837/0xc60
[ 7.676124][ T7] scsi_probe_and_add_lun+0x2440/0x39f0
[ 7.676124][ T7] ? __kasan_check_read+0x11/0x20
[ 7.676124][ T7] ? mark_lock+0xc0/0x11e0
[ 7.689057][ T1] kobject: 'nvme-wq' (000000007e9a1e3d):
kobject_uevent_env
[ 7.676124][ T7] ? scsi_alloc_sdev+0xc60/0xc60
[ 7.691857][ T1] kobject: 'nvme-wq' (000000007e9a1e3d):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-wq'
[ 7.676124][ T7] ? mark_held_locks+0xa4/0xf0
[ 7.696116][ T1] kobject: 'nvme-reset-wq' (00000000fdfb290b):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.676124][ T7] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.698770][ T1] kobject: 'nvme-reset-wq' (00000000fdfb290b):
kobject_uevent_env
[ 7.676124][ T7] ? __pm_runtime_resume+0x11b/0x180
[ 7.702018][ T1] kobject: 'nvme-reset-wq' (00000000fdfb290b):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.676124][ T7] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.705278][ T1] kobject: 'nvme-reset-wq' (00000000fdfb290b):
kobject_uevent_env
[ 7.676124][ T7] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.676124][ T7] ? trace_hardirqs_on+0x67/0x220
[ 7.676124][ T7] ? __kasan_check_read+0x11/0x20
[ 7.707608][ T1] kobject: 'nvme-reset-wq' (00000000fdfb290b):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-reset-wq'
[ 7.676124][ T7] ? __pm_runtime_resume+0x11b/0x180
[ 7.710173][ T1] kobject: 'nvme-delete-wq' (00000000a55404d0):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.676124][ T7] __scsi_scan_target+0x29a/0xfa0
[ 7.712104][ T1] kobject: 'nvme-delete-wq' (00000000a55404d0):
kobject_uevent_env
[ 7.676124][ T7] ? __pm_runtime_resume+0x11b/0x180
[ 7.714038][ T1] kobject: 'nvme-delete-wq' (00000000a55404d0):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.676124][ T7] ? __kasan_check_read+0x11/0x20
[ 7.716055][ T1] kobject: 'nvme-delete-wq' (00000000a55404d0):
kobject_uevent_env
[ 7.676124][ T7] ? mark_lock+0xc0/0x11e0
[ 7.718611][ T1] kobject: 'nvme-delete-wq' (00000000a55404d0):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-delete-wq'
[ 7.676124][ T7] ? scsi_probe_and_add_lun+0x39f0/0x39f0
[ 7.721775][ T1] kobject: 'nvme' (00000000f1fa4168):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.676124][ T7] ? mark_held_locks+0xa4/0xf0
[ 7.725467][ T1] kobject: 'nvme' (00000000f1fa4168):
kobject_uevent_env
[ 7.676124][ T7] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.728076][ T1] kobject: 'nvme' (00000000f1fa4168): fill_kobj_path:
path = '/class/nvme'
[ 7.676124][ T7] ? __pm_runtime_resume+0x11b/0x180
[ 7.731943][ T1] kobject: 'nvme-subsystem' (000000006185cd1c):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.676124][ T7] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.734789][ T1] kobject: 'nvme-subsystem' (000000006185cd1c):
kobject_uevent_env
[ 7.676124][ T7] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.676124][ T7] ? trace_hardirqs_on+0x67/0x220
[ 7.676124][ T7] scsi_scan_channel.part.0+0x11a/0x190
[ 7.737024][ T1] kobject: 'nvme-subsystem' (000000006185cd1c):
fill_kobj_path: path = '/class/nvme-subsystem'
[ 7.676124][ T7] scsi_scan_host_selected+0x313/0x450
[ 7.740736][ T1] kobject: 'nvme' (00000000ed171e99):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.676124][ T7] ? scsi_scan_host+0x450/0x450
[ 7.744307][ T1] kobject: 'drivers' (0000000000dc76ca):
kobject_add_internal: parent: 'nvme', set: '<NULL>'
[ 7.676124][ T7] do_scsi_scan_host+0x1ef/0x260
[ 7.676124][ T7] ? scsi_scan_host+0x450/0x450
[ 7.676124][ T7] do_scan_async+0x41/0x500
[ 7.747175][ T1] kobject: 'nvme' (00000000ed171e99):
kobject_uevent_env
[ 7.676124][ T7] ? scsi_scan_host+0x450/0x450
[ 7.750808][ T1] kobject: 'nvme' (00000000ed171e99): fill_kobj_path:
path = '/bus/pci/drivers/nvme'
[ 7.676124][ T7] async_run_entry_fn+0x124/0x570
[ 7.753429][ T1] kobject: 'ahci' (000000007d94d297):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.676124][ T7] process_one_work+0x9af/0x16d0
[ 7.676124][ T7] ? pwq_dec_nr_in_flight+0x320/0x320
[ 7.757152][ T1] kobject: 'drivers' (00000000f2654fc1):
kobject_add_internal: parent: 'ahci', set: '<NULL>'
[ 7.676124][ T7] ? lock_acquire+0x190/0x400
[ 7.760120][ T1] kobject: 'ahci' (000000007d94d297):
kobject_uevent_env
[ 7.676124][ T7] worker_thread+0x98/0xe40
[ 7.762876][ T1] kobject: 'ahci' (000000007d94d297): fill_kobj_path:
path = '/bus/pci/drivers/ahci'
[ 7.676124][ T7] kthread+0x361/0x430
[ 7.765779][ T1] kobject: 'ata_piix' (000000005a7b5c04):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.676124][ T7] ? process_one_work+0x16d0/0x16d0
[ 7.769294][ T1] kobject: 'drivers' (00000000c9a0712e):
kobject_add_internal: parent: 'ata_piix', set: '<NULL>'
[ 7.676124][ T7] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 7.772020][ T1] kobject: 'ata_piix' (000000005a7b5c04):
kobject_uevent_env
[ 7.676124][ T7] ret_from_fork+0x24/0x30
[ 7.774433][ T1] kobject: 'ata_piix' (000000005a7b5c04):
fill_kobj_path: path = '/bus/pci/drivers/ata_piix'
[ 7.676124][ T7] Modules linked in:
[ 7.776889][ T7] ---[ end trace f5779f5637a8c0ef ]---
[ 7.777844][ T1] kobject: 'pata_amd' (00000000de91832b):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.779953][ T7] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.781045][ T1] kobject: 'drivers' (000000003216327f):
kobject_add_internal: parent: 'pata_amd', set: '<NULL>'
[ 7.783158][ T7] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.784233][ T1] kobject: 'pata_amd' (00000000de91832b):
kobject_uevent_env
[ 7.785245][ T7] RSP: 0000:ffff8880a989f768 EFLAGS: 00010246
[ 7.786345][ T1] kobject: 'pata_amd' (00000000de91832b):
fill_kobj_path: path = '/bus/pci/drivers/pata_amd'
[ 7.787828][ T7] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff816007b1
[ 7.788932][ T1] kobject: 'pata_oldpiix' (000000003918be48):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.790813][ T7] RDX: 0000000000000000 RSI: ffffffff816007d0 RDI:
ffff8880a51b54b8
[ 7.792042][ T1] kobject: 'drivers' (000000006caab76b):
kobject_add_internal: parent: 'pata_oldpiix', set: '<NULL>'
[ 7.794119][ T7] RBP: ffff8880a989f780 R08: ffff8880a98921c0 R09:
ffffed10146fb4dc
[ 7.795266][ T1] kobject: 'pata_oldpiix' (000000003918be48):
kobject_uevent_env
[ 7.796321][ T7] R10: ffffed10146fb4db R11: ffff8880a37da6df R12:
ffff8880a51b5180
[ 7.796328][ T7] R13: ffff8880a51b5180 R14: ffff8880a329aa30 R15:
0000000000000200
[ 7.796344][ T7] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.798513][ T1] kobject: 'pata_oldpiix' (000000003918be48):
fill_kobj_path: path = '/bus/pci/drivers/pata_oldpiix'
[ 7.799458][ T7] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.800870][ T1] kobject: 'pata_sch' (00000000c82df7fd):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.801778][ T7] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.803715][ T1] kobject: 'drivers' (000000008f06f99a):
kobject_add_internal: parent: 'pata_sch', set: '<NULL>'
[ 7.804546][ T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.806913][ T1] kobject: 'pata_sch' (00000000c82df7fd):
kobject_uevent_env
[ 7.808080][ T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.810380][ T1] kobject: 'pata_sch' (00000000c82df7fd):
fill_kobj_path: path = '/bus/pci/drivers/pata_sch'
[ 7.811731][ T7] Kernel panic - not syncing: Fatal exception
[ 7.813265][ T1] kobject: 'ata_generic' (000000008108b9b4):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.816827][ T7] Kernel Offset: disabled
[ 7.816827][ T7] Rebooting in 86400 seconds..


Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=156e5e3fa00000


Tested on:

commit: f1a3b43c Merge branch 'for-linus' of git://git.kernel.org/..
git tree: upstream
kernel config: https://syzkaller.appspot.com/x/.config?x=19dd7cf81d8c8469

Michael S. Tsirkin

unread,
Jul 21, 2019, 8:19:10 AM7/21/19
to syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Above implements this but is only build-tested.
Jason, pls take a look. If you like the approach feel
free to take it from here.

One thing the below does not have is any kind of rate-limiting.
Given it's so easy to restart I'm thinking it makes sense
to add a generic infrastructure for this.
Can be a separate patch I guess.

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0536f8526359..1d89715af89d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -299,53 +299,30 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
}

#if VHOST_ARCH_CAN_ACCEL_UACCESS
-static void vhost_map_unprefetch(struct vhost_map *map)
-{
- kfree(map->pages);
- map->pages = NULL;
- map->npages = 0;
- map->addr = NULL;
-}
-
-static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
+static void __vhost_cleanup_vq_maps(struct vhost_virtqueue *vq)
{
struct vhost_map *map[VHOST_NUM_ADDRS];
int i;

- spin_lock(&vq->mmu_lock);
for (i = 0; i < VHOST_NUM_ADDRS; i++) {
map[i] = rcu_dereference_protected(vq->maps[i],
lockdep_is_held(&vq->mmu_lock));
- if (map[i])
+ if (map[i]) {
+ if (vq->uaddrs[i].write) {
+ for (i = 0; i < map[i]->npages; i++)
+ set_page_dirty(map[i]->pages[i]);
+ }
rcu_assign_pointer(vq->maps[i], NULL);
+ kfree_rcu(map[i], head);
+ }
}
+}
+
+static void vhost_cleanup_vq_maps(struct vhost_virtqueue *vq)
+{
+ spin_lock(&vq->mmu_lock);
+ __vhost_cleanup_vq_maps(vq);
spin_unlock(&vq->mmu_lock);
-
- synchronize_rcu();
-
- for (i = 0; i < VHOST_NUM_ADDRS; i++)
- if (map[i])
- vhost_map_unprefetch(map[i]);
-
-}
-
-static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
-{
- int i;
-
- vhost_uninit_vq_maps(vq);
- for (i = 0; i < VHOST_NUM_ADDRS; i++)
- vq->uaddrs[i].size = 0;
-}
-
-static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
- unsigned long start,
- unsigned long end)
-{
- if (unlikely(!uaddr->size))
- return false;
-
- return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
}

static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
@@ -353,31 +330,11 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
unsigned long start,
unsigned long end)
{
- struct vhost_uaddr *uaddr = &vq->uaddrs[index];
- struct vhost_map *map;
- int i;
-
- if (!vhost_map_range_overlap(uaddr, start, end))
- return;
-
spin_lock(&vq->mmu_lock);
++vq->invalidate_count;

- map = rcu_dereference_protected(vq->maps[index],
- lockdep_is_held(&vq->mmu_lock));
- if (map) {
- if (uaddr->write) {
- for (i = 0; i < map->npages; i++)
- set_page_dirty(map->pages[i]);
- }
- rcu_assign_pointer(vq->maps[index], NULL);
- }
+ __vhost_cleanup_vq_maps(vq);
spin_unlock(&vq->mmu_lock);
-
- if (map) {
- synchronize_rcu();
- vhost_map_unprefetch(map);
- }
}

static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
@@ -385,9 +342,6 @@ static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
unsigned long start,
unsigned long end)
{
- if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
- return;
-
spin_lock(&vq->mmu_lock);
--vq->invalidate_count;
spin_unlock(&vq->mmu_lock);
@@ -483,7 +437,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->invalidate_count = 0;
__vhost_vq_meta_reset(vq);
#if VHOST_ARCH_CAN_ACCEL_UACCESS
- vhost_reset_vq_maps(vq);
+ vhost_cleanup_vq_maps(vq);
#endif
}

@@ -833,6 +787,7 @@ static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
size_t size, bool write)
{
struct vhost_uaddr *addr = &vq->uaddrs[index];
+ spin_lock(&vq->mmu_lock);

addr->uaddr = uaddr;
addr->size = size;
@@ -841,6 +796,8 @@ static void vhost_setup_uaddr(struct vhost_virtqueue *vq,

static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
{
+ spin_lock(&vq->mmu_lock);
+
vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
(unsigned long)vq->desc,
vhost_get_desc_size(vq, vq->num),
@@ -853,6 +810,8 @@ static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
(unsigned long)vq->used,
vhost_get_used_size(vq, vq->num),
true);
+
+ spin_unlock(&vq->mmu_lock);
}

static int vhost_map_prefetch(struct vhost_virtqueue *vq,
@@ -874,13 +833,11 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
goto err;

err = -ENOMEM;
- map = kmalloc(sizeof(*map), GFP_ATOMIC);
+ map = kmalloc(sizeof(*map) + sizeof(*map->pages) * npages, GFP_ATOMIC);
if (!map)
goto err;

- pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
- if (!pages)
- goto err_pages;
+ pages = map->pages;

err = EFAULT;
npinned = __get_user_pages_fast(uaddr->uaddr, npages,
@@ -907,7 +864,6 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,

map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
map->npages = npages;
- map->pages = pages;

rcu_assign_pointer(vq->maps[index], map);
/* No need for a synchronize_rcu(). This function should be
@@ -919,8 +875,6 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
return 0;

err_gup:
- kfree(pages);
-err_pages:
kfree(map);
err:
spin_unlock(&vq->mmu_lock);
@@ -942,6 +896,10 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
vhost_vq_reset(dev, dev->vqs[i]);
}
vhost_dev_free_iovecs(dev);
+#if VHOST_ARCH_CAN_ACCEL_UACCESS
+ if (dev->mm)
+ mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
+#endif
if (dev->log_ctx)
eventfd_ctx_put(dev->log_ctx);
dev->log_ctx = NULL;
@@ -957,16 +915,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
kthread_stop(dev->worker);
dev->worker = NULL;
}
- if (dev->mm) {
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
- mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
-#endif
+ if (dev->mm)
mmput(dev->mm);
- }
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
- for (i = 0; i < dev->nvqs; i++)
- vhost_uninit_vq_maps(dev->vqs[i]);
-#endif
dev->mm = NULL;
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
@@ -1426,7 +1376,7 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
if (likely(map)) {
avail = map->addr;
- *event = (__virtio16)avail->ring[vq->num];
+ *event = avail->ring[vq->num];
rcu_read_unlock();
return 0;
}
@@ -1830,6 +1780,8 @@ static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
struct vhost_map __rcu *map;
int i;

+ vhost_setup_vq_uaddr(vq);
+
for (i = 0; i < VHOST_NUM_ADDRS; i++) {
rcu_read_lock();
map = rcu_dereference(vq->maps[i]);
@@ -1838,6 +1790,10 @@ static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
vhost_map_prefetch(vq, i);
}
}
+#else
+static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
+{
+}
#endif

int vq_meta_prefetch(struct vhost_virtqueue *vq)
@@ -1845,9 +1801,7 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
unsigned int num = vq->num;

if (!vq->iotlb) {
-#if VHOST_ARCH_CAN_ACCEL_UACCESS
vhost_vq_map_prefetch(vq);
-#endif
return 1;
}

@@ -2060,16 +2014,6 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,

mutex_lock(&vq->mutex);

-#if VHOST_ARCH_CAN_ACCEL_UACCESS
- /* Unregister MMU notifer to allow invalidation callback
- * can access vq->uaddrs[] without holding a lock.
- */
- if (d->mm)
- mmu_notifier_unregister(&d->mmu_notifier, d->mm);
-
- vhost_uninit_vq_maps(vq);
-#endif
-
switch (ioctl) {
case VHOST_SET_VRING_NUM:
r = vhost_vring_set_num(d, vq, argp);
@@ -2081,13 +2025,6 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
BUG();
}

-#if VHOST_ARCH_CAN_ACCEL_UACCESS
- vhost_setup_vq_uaddr(vq);
-
- if (d->mm)
- mmu_notifier_register(&d->mmu_notifier, d->mm);
-#endif
-
mutex_unlock(&vq->mutex);

return r;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 819296332913..584bb13c4d6d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -86,7 +86,8 @@ enum vhost_uaddr_type {
struct vhost_map {
int npages;
void *addr;
- struct page **pages;
+ struct rcu_head head;
+ struct page *pages[];
};

struct vhost_uaddr {

Michael S. Tsirkin

unread,
Jul 21, 2019, 8:28:16 AM7/21/19
to pau...@linux.vnet.ibm.com, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Hi Paul, others,

So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
is what happens if userspace starts cycling through lots of these
ioctls. Given we actually use rcu as an optimization, we could just
disable the optimization temporarily - but the question would be how to
detect an excessive rate without working too hard :) .

I guess we could define as excessive any rate where callback is
outstanding at the time when new structure is allocated. I have very
little understanding of rcu internals - so I wanted to check that the
following more or less implements this heuristic before I spend time
actually testing it.

Could others pls take a look and let me know?

Thanks!

Signed-off-by: Michael S. Tsirkin <m...@redhat.com>


diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 477b4eb44af5..067909521d72 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -125,6 +125,25 @@ void synchronize_rcu(void)
}
EXPORT_SYMBOL_GPL(synchronize_rcu);

+/*
+ * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
+ */
+bool call_rcu_outstanding(void)
+{
+ unsigned long flags;
+ struct rcu_data *rdp;
+ bool outstanding;
+
+ local_irq_save(flags);
+ rdp = this_cpu_ptr(&rcu_data);
+ outstanding = rcu_segcblist_empty(&rdp->cblist);
+ outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
+ local_irq_restore(flags);
+
+ return outstanding;
+}
+EXPORT_SYMBOL_GPL(call_rcu_outstanding);
+
/*
* Post an RCU callback to be invoked after the end of an RCU grace
* period. But since we have but one CPU, that would be after any
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a14e5fbbea46..d4b9d61e637d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
{
}

+/*
+ * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
+ */
+bool call_rcu_outstanding(void)
+{
+ unsigned long flags;
+ struct rcu_data *rdp;
+ bool outstanding;
+
+ local_irq_save(flags);
+ rdp = this_cpu_ptr(&rcu_data);
+ outstanding = rcu_segcblist_empty(&rdp->cblist);
+ local_irq_restore(flags);
+
+ return outstanding;
+}
+EXPORT_SYMBOL_GPL(call_rcu_outstanding);
+
/*
* Helper function for call_rcu() and friends. The cpu argument will
* normally be -1, indicating "currently running CPU". It may specify

Paul E. McKenney

unread,
Jul 21, 2019, 9:18:36 AM7/21/19
to Michael S. Tsirkin, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> Hi Paul, others,
>
> So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> is what happens if userspace starts cycling through lots of these
> ioctls. Given we actually use rcu as an optimization, we could just
> disable the optimization temporarily - but the question would be how to
> detect an excessive rate without working too hard :) .
>
> I guess we could define as excessive any rate where callback is
> outstanding at the time when new structure is allocated. I have very
> little understanding of rcu internals - so I wanted to check that the
> following more or less implements this heuristic before I spend time
> actually testing it.
>
> Could others pls take a look and let me know?

These look good as a way of seeing if there are any outstanding callbacks,
but in the case of Tree RCU, call_rcu_outstanding() would almost never
return false on a busy system.

Here are some alternatives:

o RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
The idea is to make kfree_rcu() locally buffer requests into
batches of (say) 1,000, but processing smaller batches when RCU
is idle, or when some smallish amout of time has passed with
no more kfree_rcu() request from that CPU. RCU than takes in
the batch using not call_rcu(), but rather queue_rcu_work().
The resulting batch of kfree() calls would therefore execute in
workqueue context rather than in softirq context, which should
be much easier on the system.

In theory, this would allow people to use kfree_rcu() without
worrying quite so much about overload. It would also not be
that hard to implement.

o Subsystems vulnerable to user-induced kfree_rcu() flooding use
call_rcu() instead of kfree_rcu(). Keep a count of the number
of things waiting for a grace period, and when this gets too
large, disable the optimization. It will then drain down, at
which point the optimization can be re-enabled.

But please note that callbacks are -not- guaranteed to run on
the CPU that queued them. So yes, you would need a per-CPU
counter, but you would need to periodically sum it up to check
against the global state. Or keep track of the CPU that
did the call_rcu() so that you can atomically decrement in
the callback the same counter that was atomically incremented
just before the call_rcu(). Or any number of other approaches.

Also, the overhead is important. For example, as far as I know,
current RCU gracefully handles close(open(...)) in a tight userspace
loop. But there might be trouble due to tight userspace loops around
lighter-weight operations.

So an important question is "Just how fast is your ioctl?" If it takes
(say) 100 microseconds to execute, there should be absolutely no problem.
On the other hand, if it can execute in 50 nanoseconds, this very likely
does need serious attention.

Other thoughts?

Thanx, Paul

syzbot

unread,
Jul 21, 2019, 1:00:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

Netfs 'nfs' registered for caching
[ 6.654718][ T1] NFS: Registering the id_resolver key type
[ 6.655664][ T1] Key type id_resolver registered
[ 6.656411][ T1] Key type id_legacy registered
[ 6.657095][ T1] nfs4filelayout_init: NFSv4 File Layout Driver
Registering...
[ 6.658167][ T1] Installing knfsd (copyright (C) 1996
ok...@monad.swb.de).
[ 6.661334][ T1] ntfs: driver 2.1.32 [Flags: R/W].
[ 6.663024][ T1] *** VALIDATE autofs ***
[ 6.663836][ T1] fuse: init (API version 7.31)
[ 6.664614][ T1] *** VALIDATE fuse ***
[ 6.665185][ T1] *** VALIDATE fuse ***
[ 6.667772][ T1] JFS: nTxBlock = 8192, nTxLock = 65536
[ 6.679202][ T1] SGI XFS with ACLs, security attributes, realtime, no
debug enabled
[ 6.683709][ T1] 9p: Installing v9fs 9p2000 file system support
[ 6.684800][ T1] FS-Cache: Netfs '9p' registered for caching
[ 6.687600][ T1] *** VALIDATE gfs2 ***
[ 6.688988][ T1] gfs2: GFS2 installed
[ 6.692715][ T1] FS-Cache: Netfs 'ceph' registered for caching
[ 6.693644][ T1] ceph: loaded (mds proto 32)
[ 6.700660][ T1] NET: Registered protocol family 38
[ 6.702319][ T1] async_tx: api initialized (async)
[ 6.703042][ T1] Key type asymmetric registered
[ 6.703792][ T1] Asymmetric key parser 'x509' registered
[ 6.704623][ T1] Asymmetric key parser 'pkcs8' registered
[ 6.705453][ T1] Key type pkcs7_test registered
[ 6.706133][ T1] Asymmetric key parser 'tpm_parser' registered
[ 6.707154][ T1] Block layer SCSI generic (bsg) driver version 0.4
loaded (major 246)
[ 6.708727][ T1] io scheduler mq-deadline registered
[ 6.709565][ T1] io scheduler kyber registered
[ 6.710470][ T1] io scheduler bfq registered
[ 6.715471][ T1] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[ 6.717158][ T1] ACPI: Power Button [PWRF]
[ 6.718718][ T1] input: Sleep Button as
/devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
[ 6.720211][ T1] ACPI: Sleep Button [SLPF]
[ 6.725221][ T1] ioatdma: Intel(R) QuickData Technology Driver 5.00
[ 6.736250][ T1] PCI Interrupt Link [LNKC] enabled at IRQ 11
[ 6.737233][ T1] virtio-pci 0000:00:03.0: virtio_pci: leaving for
legacy driver
[ 6.748497][ T1] PCI Interrupt Link [LNKD] enabled at IRQ 10
[ 6.749516][ T1] virtio-pci 0000:00:04.0: virtio_pci: leaving for
legacy driver
[ 7.049735][ T1] HDLC line discipline maxframe=4096
[ 7.050611][ T1] N_HDLC line discipline registered.
[ 7.051325][ T1] Serial: 8250/16550 driver, 4 ports, IRQ sharing
enabled
[ 7.074489][ T1] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud =
115200) is a 16550A
[ 7.100213][ T1] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud =
115200) is a 16550A
[ 7.125187][ T1] 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud =
115200) is a 16550A
[ 7.150176][ T1] 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud =
115200) is a 16550A
[ 7.158672][ T1] Non-volatile memory driver v1.3
[ 7.160306][ T1] Linux agpgart interface v0.103
[ 7.169098][ T1] [drm] Initialized vgem 1.0.0 20120112 for vgem on
minor 0
[ 7.171162][ T1] [drm] Supports vblank timestamp caching Rev 2
(21.10.2013).
[ 7.172477][ T1] [drm] Driver supports precise vblank timestamp query.
[ 7.175846][ T1] [drm] Initialized vkms 1.0.0 20180514 for vkms on
minor 1
[ 7.177381][ T1] usbcore: registered new interface driver udl
[ 7.225746][ T1] brd: module loaded
[ 7.255819][ T1] loop: module loaded
[ 7.319754][ T1] zram: Added device: zram0
[ 7.325808][ T1] null: module loaded
[ 7.331371][ T1] nfcsim 0.2 initialized
[ 7.334100][ T1] Loading iSCSI transport class v2.0-870.
[ 7.354172][ T1] scsi host0: Virtio SCSI HBA
[ 7.388398][ T1] st: Version 20160209, fixed bufsize 32768, s/g segs
256
[ 7.391105][ T395] kasan: CONFIG_KASAN_INLINE enabled
[ 7.392521][ T395] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 7.392541][ T395] general protection fault: 0000 [#1] SMP KASAN
[ 7.394608][ T1] kobject: 'scsi_disk' (000000007d66f221):
kobject_uevent_env
[ 7.396476][ T395] CPU: 1 PID: 395 Comm: kworker/u4:5 Not tainted
5.2.0-next-20190719+ #1
[ 7.398521][ T1] kobject: 'scsi_disk' (000000007d66f221):
fill_kobj_path: path = '/class/scsi_disk'
[ 7.399249][ T395] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 7.399249][ T395] Workqueue: events_unbound async_run_entry_fn
[ 7.403495][ T1] kobject: 'sd' (000000004f1d2418):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.399249][ T395] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.407441][ T1] kobject: 'sd' (000000004f1d2418): kobject_uevent_env
[ 7.399249][ T395] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.399249][ T395] RSP: 0000:ffff8880a8927768 EFLAGS: 00010246
[ 7.399249][ T395] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.399249][ T395] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff88821954afb8
[ 7.411735][ T1] kobject: 'sd' (000000004f1d2418): fill_kobj_path:
path = '/bus/scsi/drivers/sd'
[ 7.399249][ T395] RBP: ffff8880a8927780 R08: ffff8880a891c180 R09:
ffffed101440da8d
[ 7.399249][ T395] R10: ffffed101440da8c R11: ffff8880a206d467 R12:
ffff88821954ac80
[ 7.418568][ T1] kobject: 'sr' (000000007ac3aa80):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.399249][ T395] R13: ffff88821954ac80 R14: ffff88821935ac70 R15:
0000000000000200
[ 7.399249][ T395] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.399249][ T395] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.399249][ T395] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.422194][ T1] kobject: 'sr' (000000007ac3aa80): kobject_uevent_env
[ 7.399249][ T395] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.426716][ T1] kobject: 'sr' (000000007ac3aa80): fill_kobj_path:
path = '/bus/scsi/drivers/sr'
[ 7.399249][ T395] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.431033][ T1] kobject: 'scsi_generic' (00000000efa5177b):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.399249][ T395] Call Trace:
[ 7.435514][ T1] kobject: 'scsi_generic' (00000000efa5177b):
kobject_uevent_env
[ 7.399249][ T395] dma_max_mapping_size+0xba/0x100
[ 7.439555][ T1] kobject: 'scsi_generic' (00000000efa5177b):
fill_kobj_path: path = '/class/scsi_generic'
[ 7.399249][ T395] __scsi_init_queue+0x1cb/0x580
[ 7.445688][ T1] kobject: 'nvme-wq' (00000000321104f1):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.444765][ T395] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 7.450078][ T1] kobject: 'nvme-wq' (00000000321104f1):
kobject_uevent_env
[ 7.444765][ T395] scsi_mq_alloc_queue+0xd2/0x180
[ 7.453398][ T1] kobject: 'nvme-wq' (00000000321104f1):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.444765][ T395] scsi_alloc_sdev+0x837/0xc60
[ 7.456693][ T1] kobject: 'nvme-wq' (00000000321104f1):
kobject_uevent_env
[ 7.444765][ T395] scsi_probe_and_add_lun+0x2440/0x39f0
[ 7.460577][ T1] kobject: 'nvme-wq' (00000000321104f1):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-wq'
[ 7.444765][ T395] ? __kasan_check_read+0x11/0x20
[ 7.465574][ T1] kobject: 'nvme-reset-wq' (000000009e377536):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.444765][ T395] ? mark_lock+0xc0/0x11e0
[ 7.468184][ T1] kobject: 'nvme-reset-wq' (000000009e377536):
kobject_uevent_env
[ 7.444765][ T395] ? scsi_alloc_sdev+0xc60/0xc60
[ 7.444765][ T395] ? mark_held_locks+0xa4/0xf0
[ 7.444765][ T395] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.472101][ T1] kobject: 'nvme-reset-wq' (000000009e377536):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.444765][ T395] ? __pm_runtime_resume+0x11b/0x180
[ 7.475508][ T1] kobject: 'nvme-reset-wq' (000000009e377536):
kobject_uevent_env
[ 7.444765][ T395] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.479539][ T1] kobject: 'nvme-reset-wq' (000000009e377536):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-reset-wq'
[ 7.444765][ T395] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.484215][ T1] kobject: 'nvme-delete-wq' (0000000094b8a735):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.444765][ T395] ? trace_hardirqs_on+0x67/0x220
[ 7.486960][ T1] kobject: 'nvme-delete-wq' (0000000094b8a735):
kobject_uevent_env
[ 7.444765][ T395] ? __kasan_check_read+0x11/0x20
[ 7.444765][ T395] ? __pm_runtime_resume+0x11b/0x180
[ 7.489597][ T1] kobject: 'nvme-delete-wq' (0000000094b8a735):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.444765][ T395] __scsi_scan_target+0x29a/0xfa0
[ 7.493919][ T1] kobject: 'nvme-delete-wq' (0000000094b8a735):
kobject_uevent_env
[ 7.444765][ T395] ? __pm_runtime_resume+0x11b/0x180
[ 7.497555][ T1] kobject: 'nvme-delete-wq' (0000000094b8a735):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-delete-wq'
[ 7.444765][ T395] ? __kasan_check_read+0x11/0x20
[ 7.444765][ T395] ? mark_lock+0xc0/0x11e0
[ 7.444765][ T395] ? scsi_probe_and_add_lun+0x39f0/0x39f0
[ 7.502112][ T1] kobject: 'nvme' (00000000f0455acf):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.444765][ T395] ? mark_held_locks+0xa4/0xf0
[ 7.506227][ T1] kobject: 'nvme' (00000000f0455acf):
kobject_uevent_env
[ 7.444765][ T395] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.509858][ T1] kobject: 'nvme' (00000000f0455acf): fill_kobj_path:
path = '/class/nvme'
[ 7.444765][ T395] ? __pm_runtime_resume+0x11b/0x180
[ 7.514297][ T1] kobject: 'nvme-subsystem' (00000000800549c9):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.444765][ T395] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.444765][ T395] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.517742][ T1] kobject: 'nvme-subsystem' (00000000800549c9):
kobject_uevent_env
[ 7.444765][ T395] ? trace_hardirqs_on+0x67/0x220
[ 7.522297][ T1] kobject: 'nvme-subsystem' (00000000800549c9):
fill_kobj_path: path = '/class/nvme-subsystem'
[ 7.444765][ T395] scsi_scan_channel.part.0+0x11a/0x190
[ 7.524740][ T1] kobject: 'nvme' (000000006ea6dfe8):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.444765][ T395] scsi_scan_host_selected+0x313/0x450
[ 7.528775][ T1] kobject: 'drivers' (00000000b9ba5e93):
kobject_add_internal: parent: 'nvme', set: '<NULL>'
[ 7.444765][ T395] ? scsi_scan_host+0x450/0x450
[ 7.444765][ T395] do_scsi_scan_host+0x1ef/0x260
[ 7.444765][ T395] ? scsi_scan_host+0x450/0x450
[ 7.444765][ T395] do_scan_async+0x41/0x500
[ 7.531749][ T1] kobject: 'nvme' (000000006ea6dfe8):
kobject_uevent_env
[ 7.444765][ T395] ? scsi_scan_host+0x450/0x450
[ 7.535306][ T1] kobject: 'nvme' (000000006ea6dfe8): fill_kobj_path:
path = '/bus/pci/drivers/nvme'
[ 7.444765][ T395] async_run_entry_fn+0x124/0x570
[ 7.539461][ T1] kobject: 'ahci' (00000000fc4e0c75):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.444765][ T395] process_one_work+0x9af/0x16d0
[ 7.542331][ T1] kobject: 'drivers' (00000000d24510ad):
kobject_add_internal: parent: 'ahci', set: '<NULL>'
[ 7.444765][ T395] ? pwq_dec_nr_in_flight+0x320/0x320
[ 7.545597][ T1] kobject: 'ahci' (00000000fc4e0c75):
kobject_uevent_env
[ 7.444765][ T395] ? lock_acquire+0x190/0x400
[ 7.549605][ T1] kobject: 'ahci' (00000000fc4e0c75): fill_kobj_path:
path = '/bus/pci/drivers/ahci'
[ 7.444765][ T395] worker_thread+0x98/0xe40
[ 7.553619][ T1] kobject: 'ata_piix' (00000000c23155a5):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.444765][ T395] kthread+0x361/0x430
[ 7.557492][ T1] kobject: 'drivers' (00000000df9eabf0):
kobject_add_internal: parent: 'ata_piix', set: '<NULL>'
[ 7.444765][ T395] ? process_one_work+0x16d0/0x16d0
[ 7.560039][ T1] kobject: 'ata_piix' (00000000c23155a5):
kobject_uevent_env
[ 7.444765][ T395] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 7.562915][ T1] kobject: 'ata_piix' (00000000c23155a5):
fill_kobj_path: path = '/bus/pci/drivers/ata_piix'
[ 7.444765][ T395] ret_from_fork+0x24/0x30
[ 7.566617][ T1] kobject: 'pata_amd' (00000000435e0c38):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.444765][ T395] Modules linked in:
[ 7.570569][ T1] kobject: 'drivers' (000000007865a296):
kobject_add_internal: parent: 'pata_amd', set: '<NULL>'
[ 7.571858][ T395] ---[ end trace a17f906cd582e7cf ]---
[ 7.574402][ T1] kobject: 'pata_amd' (00000000435e0c38):
kobject_uevent_env
[ 7.575705][ T395] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.577512][ T1] kobject: 'pata_amd' (00000000435e0c38):
fill_kobj_path: path = '/bus/pci/drivers/pata_amd'
[ 7.578683][ T395] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.581254][ T1] kobject: 'pata_oldpiix' (00000000b98538bc):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.582303][ T395] RSP: 0000:ffff8880a8927768 EFLAGS: 00010246
[ 7.585158][ T1] kobject: 'drivers' (000000003f316dc1):
kobject_add_internal: parent: 'pata_oldpiix', set: '<NULL>'
[ 7.586138][ T395] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.588896][ T1] kobject: 'pata_oldpiix' (00000000b98538bc):
kobject_uevent_env
[ 7.590235][ T395] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff88821954afb8
[ 7.592141][ T1] kobject: 'pata_oldpiix' (00000000b98538bc):
fill_kobj_path: path = '/bus/pci/drivers/pata_oldpiix'
[ 7.594418][ T395] RBP: ffff8880a8927780 R08: ffff8880a891c180 R09:
ffffed101440da8d
[ 7.597094][ T1] kobject: 'pata_sch' (00000000b4da326b):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.598175][ T395] R10: ffffed101440da8c R11: ffff8880a206d467 R12:
ffff88821954ac80
[ 7.601007][ T1] kobject: 'drivers' (00000000b4cfdb65):
kobject_add_internal: parent: 'pata_sch', set: '<NULL>'
[ 7.601963][ T395] R13: ffff88821954ac80 R14: ffff88821935ac70 R15:
0000000000000200
[ 7.604700][ T1] kobject: 'pata_sch' (00000000b4da326b):
kobject_uevent_env
[ 7.606034][ T395] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.607901][ T1] kobject: 'pata_sch' (00000000b4da326b):
fill_kobj_path: path = '/bus/pci/drivers/pata_sch'
[ 7.609567][ T395] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.612295][ T1] kobject: 'ata_generic' (000000004e5121cc):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.617266][ T395] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.620137][ T1] kobject: 'drivers' (00000000cdd8f23c):
kobject_add_internal: parent: 'ata_generic', set: '<NULL>'
[ 7.621632][ T395] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.624496][ T1] kobject: 'ata_generic' (000000004e5121cc):
kobject_uevent_env
[ 7.626457][ T395] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.628459][ T1] kobject: 'ata_generic' (000000004e5121cc):
fill_kobj_path: path = '/bus/pci/drivers/ata_generic'
[ 7.630533][ T395] Kernel panic - not syncing: Fatal exception
[ 7.633507][ T1] kobject: 'mtd' (000000005acb8710):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.639034][ T395] Kernel Offset: disabled
[ 7.640511][ T395] Rebooting in 86400 seconds..


Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=125eaa78600000


Tested on:

commit: 54efad20 Add linux-next specific files for 20190719
git tree: linux-next
kernel config: https://syzkaller.appspot.com/x/.config?x=94c3de539954a651
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=10aa8e6c600000

syzbot

unread,
Jul 21, 2019, 1:21:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

0] (64MB)
[ 6.707030][ T1] RAPL PMU: API unit is 2^-32 Joules, 0 fixed
counters, 10737418240 ms ovfl timer
[ 6.710999][ T1] kvm: already loaded the other module
[ 6.711966][ T1] clocksource: tsc: mask: 0xffffffffffffffff
max_cycles: 0x212735223b2, max_idle_ns: 440795277976 ns
[ 6.713738][ T1] clocksource: Switched to clocksource tsc
[ 6.714745][ T1] mce: Machine check injector initialized
[ 6.719577][ T1] check: Scanning for low memory corruption every 60
seconds
[ 6.812455][ T1] Initialise system trusted keyrings
[ 6.814532][ T1] workingset: timestamp_bits=40 max_order=21
bucket_order=0
[ 6.816183][ T1] zbud: loaded
[ 6.819051][ T1] *** VALIDATE devpts ***
[ 6.822576][ T1] DLM installed
[ 6.824693][ T1] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 6.827989][ T1] FS-Cache: Netfs 'nfs' registered for caching
[ 6.830833][ T1] NFS: Registering the id_resolver key type
[ 6.832156][ T1] Key type id_resolver registered
[ 6.832951][ T1] Key type id_legacy registered
[ 6.834512][ T1] nfs4filelayout_init: NFSv4 File Layout Driver
Registering...
[ 6.835705][ T1] Installing knfsd (copyright (C) 1996
ok...@monad.swb.de).
[ 6.841659][ T1] ntfs: driver 2.1.32 [Flags: R/W].
[ 6.843507][ T1] *** VALIDATE autofs ***
[ 6.845080][ T1] fuse: init (API version 7.31)
[ 6.846068][ T1] *** VALIDATE fuse ***
[ 6.846843][ T1] *** VALIDATE fuse ***
[ 6.849812][ T1] JFS: nTxBlock = 8192, nTxLock = 65536
[ 6.860866][ T1] SGI XFS with ACLs, security attributes, realtime, no
debug enabled
[ 6.866231][ T1] 9p: Installing v9fs 9p2000 file system support
[ 6.867565][ T1] FS-Cache: Netfs '9p' registered for caching
[ 6.870922][ T1] *** VALIDATE gfs2 ***
[ 6.872482][ T1] gfs2: GFS2 installed
[ 6.876268][ T1] FS-Cache: Netfs 'ceph' registered for caching
[ 6.877167][ T1] ceph: loaded (mds proto 32)
[ 6.884283][ T1] NET: Registered protocol family 38
[ 6.885993][ T1] async_tx: api initialized (async)
[ 6.887090][ T1] Key type asymmetric registered
[ 6.887798][ T1] Asymmetric key parser 'x509' registered
[ 6.888638][ T1] Asymmetric key parser 'pkcs8' registered
[ 6.889637][ T1] Key type pkcs7_test registered
[ 6.890466][ T1] Asymmetric key parser 'tpm_parser' registered
[ 6.891510][ T1] Block layer SCSI generic (bsg) driver version 0.4
loaded (major 246)
[ 6.893385][ T1] io scheduler mq-deadline registered
[ 6.894206][ T1] io scheduler kyber registered
[ 6.895137][ T1] io scheduler bfq registered
[ 6.900130][ T1] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[ 6.904354][ T1] ACPI: Power Button [PWRF]
[ 6.906168][ T1] input: Sleep Button as
/devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
[ 6.907525][ T1] ACPI: Sleep Button [SLPF]
[ 6.912563][ T1] ioatdma: Intel(R) QuickData Technology Driver 5.00
[ 6.924655][ T1] PCI Interrupt Link [LNKC] enabled at IRQ 11
[ 6.925802][ T1] virtio-pci 0000:00:03.0: virtio_pci: leaving for
legacy driver
[ 6.939404][ T1] PCI Interrupt Link [LNKD] enabled at IRQ 10
[ 6.940474][ T1] virtio-pci 0000:00:04.0: virtio_pci: leaving for
legacy driver
[ 7.204838][ T1] HDLC line discipline maxframe=4096
[ 7.206194][ T1] N_HDLC line discipline registered.
[ 7.206960][ T1] Serial: 8250/16550 driver, 4 ports, IRQ sharing
enabled
[ 7.230121][ T1] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud =
115200) is a 16550A
[ 7.255874][ T1] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud =
115200) is a 16550A
[ 7.281343][ T1] 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud =
115200) is a 16550A
[ 7.306898][ T1] 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud =
115200) is a 16550A
[ 7.314925][ T1] Non-volatile memory driver v1.3
[ 7.316912][ T1] Linux agpgart interface v0.103
[ 7.325997][ T1] [drm] Initialized vgem 1.0.0 20120112 for vgem on
minor 0
[ 7.328314][ T1] [drm] Supports vblank timestamp caching Rev 2
(21.10.2013).
[ 7.330543][ T1] [drm] Driver supports precise vblank timestamp query.
[ 7.334747][ T1] [drm] Initialized vkms 1.0.0 20180514 for vkms on
minor 1
[ 7.336954][ T1] usbcore: registered new interface driver udl
[ 7.381278][ T1] brd: module loaded
[ 7.411789][ T1] loop: module loaded
[ 7.477784][ T1] zram: Added device: zram0
[ 7.485350][ T1] null: module loaded
[ 7.491400][ T1] nfcsim 0.2 initialized
[ 7.493672][ T1] Loading iSCSI transport class v2.0-870.
[ 7.520458][ T1] scsi host0: Virtio SCSI HBA
[ 7.555704][ T1] st: Version 20160209, fixed bufsize 32768, s/g segs
256
[ 7.558211][ T757] kasan: CONFIG_KASAN_INLINE enabled
[ 7.559783][ T757] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 7.561796][ T757] general protection fault: 0000 [#1] SMP KASAN
[ 7.562012][ T1] kobject: 'scsi_disk' (0000000073ab792a):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.563390][ T757] CPU: 1 PID: 757 Comm: kworker/u4:4 Not tainted
5.2.0-next-20190719+ #1
[ 7.565985][ T1] kobject: 'scsi_disk' (0000000073ab792a):
kobject_uevent_env
[ 7.568031][ T757] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 7.569916][ T1] kobject: 'scsi_disk' (0000000073ab792a):
fill_kobj_path: path = '/class/scsi_disk'
[ 7.569737][ T757] Workqueue: events_unbound async_run_entry_fn
[ 7.574817][ T1] kobject: 'sd' (00000000c8f0f90f):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.569737][ T757] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.578715][ T1] kobject: 'sd' (00000000c8f0f90f): kobject_uevent_env
[ 7.569737][ T757] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.569737][ T757] RSP: 0000:ffff8880a97cf768 EFLAGS: 00010246
[ 7.569737][ T757] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.569737][ T757] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff8882194de778
[ 7.569737][ T757] RBP: ffff8880a97cf780 R08: ffff8880a8688200 R09:
ffffed101461b8ec
[ 7.569737][ T757] R10: ffffed101461b8eb R11: ffff8880a30dc75f R12:
ffff8882194de440
[ 7.582238][ T1] kobject: 'sd' (00000000c8f0f90f): fill_kobj_path:
path = '/bus/scsi/drivers/sd'
[ 7.569737][ T757] R13: ffff8882194de440 R14: ffff8882192e3030 R15:
0000000000000200
[ 7.569737][ T757] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.590000][ T1] kobject: 'sr' (0000000088460190):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.569737][ T757] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.593863][ T1] kobject: 'sr' (0000000088460190): kobject_uevent_env
[ 7.569737][ T757] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.597652][ T1] kobject: 'sr' (0000000088460190): fill_kobj_path:
path = '/bus/scsi/drivers/sr'
[ 7.569737][ T757] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.569737][ T757] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.569737][ T757] Call Trace:
[ 7.569737][ T757] dma_max_mapping_size+0xba/0x100
[ 7.569737][ T757] __scsi_init_queue+0x1cb/0x580
[ 7.601826][ T1] kobject: 'scsi_generic' (00000000b0750853):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.569737][ T757] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 7.606294][ T1] kobject: 'scsi_generic' (00000000b0750853):
kobject_uevent_env
[ 7.569737][ T757] scsi_mq_alloc_queue+0xd2/0x180
[ 7.609877][ T1] kobject: 'scsi_generic' (00000000b0750853):
fill_kobj_path: path = '/class/scsi_generic'
[ 7.569737][ T757] scsi_alloc_sdev+0x837/0xc60
[ 7.616903][ T1] kobject: 'nvme-wq' (00000000a9cd1e73):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.616103][ T757] scsi_probe_and_add_lun+0x2440/0x39f0
[ 7.619314][ T1] kobject: 'nvme-wq' (00000000a9cd1e73):
kobject_uevent_env
[ 7.616103][ T757] ? __kasan_check_read+0x11/0x20
[ 7.621843][ T1] kobject: 'nvme-wq' (00000000a9cd1e73):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.616103][ T757] ? mark_lock+0xc0/0x11e0
[ 7.626271][ T1] kobject: 'nvme-wq' (00000000a9cd1e73):
kobject_uevent_env
[ 7.616103][ T757] ? scsi_alloc_sdev+0xc60/0xc60
[ 7.629545][ T1] kobject: 'nvme-wq' (00000000a9cd1e73):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-wq'
[ 7.616103][ T757] ? mark_held_locks+0xa4/0xf0
[ 7.633636][ T1] kobject: 'nvme-reset-wq' (00000000e4ef4b55):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.616103][ T757] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.637309][ T1] kobject: 'nvme-reset-wq' (00000000e4ef4b55):
kobject_uevent_env
[ 7.616103][ T757] ? __pm_runtime_resume+0x11b/0x180
[ 7.640205][ T1] kobject: 'nvme-reset-wq' (00000000e4ef4b55):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.616103][ T757] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.643837][ T1] kobject: 'nvme-reset-wq' (00000000e4ef4b55):
kobject_uevent_env
[ 7.616103][ T757] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.646755][ T1] kobject: 'nvme-reset-wq' (00000000e4ef4b55):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-reset-wq'
[ 7.616103][ T757] ? trace_hardirqs_on+0x67/0x220
[ 7.616103][ T757] ? __kasan_check_read+0x11/0x20
[ 7.616103][ T757] ? __pm_runtime_resume+0x11b/0x180
[ 7.616103][ T757] __scsi_scan_target+0x29a/0xfa0
[ 7.651434][ T1] kobject: 'nvme-delete-wq' (0000000011b1ce89):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.616103][ T757] ? __pm_runtime_resume+0x11b/0x180
[ 7.655091][ T1] kobject: 'nvme-delete-wq' (0000000011b1ce89):
kobject_uevent_env
[ 7.616103][ T757] ? __kasan_check_read+0x11/0x20
[ 7.658095][ T1] kobject: 'nvme-delete-wq' (0000000011b1ce89):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.616103][ T757] ? mark_lock+0xc0/0x11e0
[ 7.616103][ T757] ? scsi_probe_and_add_lun+0x39f0/0x39f0
[ 7.616103][ T757] ? mark_held_locks+0xa4/0xf0
[ 7.616103][ T757] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.616103][ T757] ? __pm_runtime_resume+0x11b/0x180
[ 7.662483][ T1] kobject: 'nvme-delete-wq' (0000000011b1ce89):
kobject_uevent_env
[ 7.616103][ T757] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.616103][ T757] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.665824][ T1] kobject: 'nvme-delete-wq' (0000000011b1ce89):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-delete-wq'
[ 7.616103][ T757] ? trace_hardirqs_on+0x67/0x220
[ 7.670036][ T1] kobject: 'nvme' (00000000ea56c0f8):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.616103][ T757] scsi_scan_channel.part.0+0x11a/0x190
[ 7.672512][ T1] kobject: 'nvme' (00000000ea56c0f8):
kobject_uevent_env
[ 7.616103][ T757] scsi_scan_host_selected+0x313/0x450
[ 7.676555][ T1] kobject: 'nvme' (00000000ea56c0f8): fill_kobj_path:
path = '/class/nvme'
[ 7.616103][ T757] ? scsi_scan_host+0x450/0x450
[ 7.680659][ T1] kobject: 'nvme-subsystem' (00000000649c1507):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.680591][ T757] do_scsi_scan_host+0x1ef/0x260
[ 7.684800][ T1] kobject: 'nvme-subsystem' (00000000649c1507):
kobject_uevent_env
[ 7.680591][ T757] ? scsi_scan_host+0x450/0x450
[ 7.687396][ T1] kobject: 'nvme-subsystem' (00000000649c1507):
fill_kobj_path: path = '/class/nvme-subsystem'
[ 7.680591][ T757] do_scan_async+0x41/0x500
[ 7.690153][ T1] kobject: 'nvme' (0000000080c668b3):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.680591][ T757] ? scsi_scan_host+0x450/0x450
[ 7.693347][ T1] kobject: 'drivers' (00000000f5883e13):
kobject_add_internal: parent: 'nvme', set: '<NULL>'
[ 7.680591][ T757] async_run_entry_fn+0x124/0x570
[ 7.695991][ T1] kobject: 'nvme' (0000000080c668b3):
kobject_uevent_env
[ 7.680591][ T757] process_one_work+0x9af/0x16d0
[ 7.680591][ T757] ? pwq_dec_nr_in_flight+0x320/0x320
[ 7.680591][ T757] ? lock_acquire+0x190/0x400
[ 7.680591][ T757] worker_thread+0x98/0xe40
[ 7.700652][ T1] kobject: 'nvme' (0000000080c668b3): fill_kobj_path:
path = '/bus/pci/drivers/nvme'
[ 7.680591][ T757] kthread+0x361/0x430
[ 7.704433][ T1] kobject: 'ahci' (0000000045788b07):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.680591][ T757] ? process_one_work+0x16d0/0x16d0
[ 7.707872][ T1] kobject: 'drivers' (000000002fb03785):
kobject_add_internal: parent: 'ahci', set: '<NULL>'
[ 7.680591][ T757] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 7.680591][ T757] ret_from_fork+0x24/0x30
[ 7.680591][ T757] Modules linked in:
[ 7.709973][ T757] ---[ end trace 92321c352aac498b ]---
[ 7.711148][ T1] kobject: 'ahci' (0000000045788b07):
kobject_uevent_env
[ 7.713733][ T757] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.715077][ T1] kobject: 'ahci' (0000000045788b07): fill_kobj_path:
path = '/bus/pci/drivers/ahci'
[ 7.716929][ T757] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.718080][ T1] kobject: 'ata_piix' (000000008ffd5e31):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.720833][ T757] RSP: 0000:ffff8880a97cf768 EFLAGS: 00010246
[ 7.722002][ T1] kobject: 'drivers' (00000000b211c7ff):
kobject_add_internal: parent: 'ata_piix', set: '<NULL>'
[ 7.724430][ T757] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.725716][ T1] kobject: 'ata_piix' (000000008ffd5e31):
kobject_uevent_env
[ 7.728073][ T757] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff8882194de778
[ 7.729431][ T1] kobject: 'ata_piix' (000000008ffd5e31):
fill_kobj_path: path = '/bus/pci/drivers/ata_piix'
[ 7.731119][ T757] RBP: ffff8880a97cf780 R08: ffff8880a8688200 R09:
ffffed101461b8ec
[ 7.732363][ T1] kobject: 'pata_amd' (00000000bd8a1d2b):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.733575][ T757] R10: ffffed101461b8eb R11: ffff8880a30dc75f R12:
ffff8882194de440
[ 7.734732][ T1] kobject: 'drivers' (0000000046eb38bd):
kobject_add_internal: parent: 'pata_amd', set: '<NULL>'
[ 7.735860][ T757] R13: ffff8882194de440 R14: ffff8882192e3030 R15:
0000000000000200
[ 7.738236][ T1] kobject: 'pata_amd' (00000000bd8a1d2b):
kobject_uevent_env
[ 7.739169][ T757] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.741526][ T1] kobject: 'pata_amd' (00000000bd8a1d2b):
fill_kobj_path: path = '/bus/pci/drivers/pata_amd'
[ 7.742822][ T757] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.745606][ T1] kobject: 'pata_oldpiix' (000000002c9f2e63):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.747007][ T757] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.747018][ T757] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.748074][ T1] kobject: 'drivers' (00000000560e7e70):
kobject_add_internal: parent: 'pata_oldpiix', set: '<NULL>'
[ 7.748938][ T757] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.750585][ T1] kobject: 'pata_oldpiix' (000000002c9f2e63):
kobject_uevent_env
[ 7.752233][ T757] Kernel panic - not syncing: Fatal exception
[ 7.753850][ T1] kobject: 'pata_oldpiix' (000000002c9f2e63):
fill_kobj_path: path = '/bus/pci/drivers/pata_oldpiix'
[ 7.760477][ T757] Kernel Offset: disabled
[ 7.760477][ T757] Rebooting in 86400 seconds..


Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=16757158600000


Tested on:

commit: 54efad20 Add linux-next specific files for 20190719
git tree: linux-next
kernel config: https://syzkaller.appspot.com/x/.config?x=94c3de539954a651
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=1535cd1fa00000

syzbot

unread,
Jul 21, 2019, 1:39:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

T1] PCI: CLS 0 bytes, default 64
[ 6.581155][ T1] PCI-DMA: Using software bounce buffering for IO
(SWIOTLB)
[ 6.582293][ T1] software IO TLB: mapped [mem 0xaa800000-0xae800000]
(64MB)
[ 6.585662][ T1] RAPL PMU: API unit is 2^-32 Joules, 0 fixed
counters, 10737418240 ms ovfl timer
[ 6.589173][ T1] kvm: already loaded the other module
[ 6.590139][ T1] clocksource: tsc: mask: 0xffffffffffffffff
max_cycles: 0x212735223b2, max_idle_ns: 440795277976 ns
[ 6.593533][ T1] clocksource: Switched to clocksource tsc
[ 6.594618][ T1] mce: Machine check injector initialized
[ 6.599092][ T1] check: Scanning for low memory corruption every 60
seconds
[ 6.694280][ T1] Initialise system trusted keyrings
[ 6.696024][ T1] workingset: timestamp_bits=40 max_order=21
bucket_order=0
[ 6.697628][ T1] zbud: loaded
[ 6.701140][ T1] *** VALIDATE devpts ***
[ 6.703782][ T1] DLM installed
[ 6.705927][ T1] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 6.710137][ T1] FS-Cache: Netfs 'nfs' registered for caching
[ 6.712197][ T1] NFS: Registering the id_resolver key type
[ 6.713261][ T1] Key type id_resolver registered
[ 6.714251][ T1] Key type id_legacy registered
[ 6.715014][ T1] nfs4filelayout_init: NFSv4 File Layout Driver
Registering...
[ 6.716173][ T1] Installing knfsd (copyright (C) 1996
ok...@monad.swb.de).
[ 6.719398][ T1] ntfs: driver 2.1.32 [Flags: R/W].
[ 6.721289][ T1] *** VALIDATE autofs ***
[ 6.722006][ T1] fuse: init (API version 7.31)
[ 6.722754][ T1] *** VALIDATE fuse ***
[ 6.723325][ T1] *** VALIDATE fuse ***
[ 6.725776][ T1] JFS: nTxBlock = 8192, nTxLock = 65536
[ 6.735274][ T1] SGI XFS with ACLs, security attributes, realtime, no
debug enabled
[ 6.741151][ T1] 9p: Installing v9fs 9p2000 file system support
[ 6.742357][ T1] FS-Cache: Netfs '9p' registered for caching
[ 6.745954][ T1] *** VALIDATE gfs2 ***
[ 6.747460][ T1] gfs2: GFS2 installed
[ 6.750833][ T1] FS-Cache: Netfs 'ceph' registered for caching
[ 6.752125][ T1] ceph: loaded (mds proto 32)
[ 6.759724][ T1] NET: Registered protocol family 38
[ 6.761491][ T1] async_tx: api initialized (async)
[ 6.762401][ T1] Key type asymmetric registered
[ 6.763245][ T1] Asymmetric key parser 'x509' registered
[ 6.764127][ T1] Asymmetric key parser 'pkcs8' registered
[ 6.765044][ T1] Key type pkcs7_test registered
[ 6.765778][ T1] Asymmetric key parser 'tpm_parser' registered
[ 6.766786][ T1] Block layer SCSI generic (bsg) driver version 0.4
loaded (major 246)
[ 6.768537][ T1] io scheduler mq-deadline registered
[ 6.769387][ T1] io scheduler kyber registered
[ 6.770721][ T1] io scheduler bfq registered
[ 6.775381][ T1] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[ 6.776978][ T1] ACPI: Power Button [PWRF]
[ 6.778580][ T1] input: Sleep Button as
/devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
[ 6.780182][ T1] ACPI: Sleep Button [SLPF]
[ 6.784928][ T1] ioatdma: Intel(R) QuickData Technology Driver 5.00
[ 6.797404][ T1] PCI Interrupt Link [LNKC] enabled at IRQ 11
[ 6.798498][ T1] virtio-pci 0000:00:03.0: virtio_pci: leaving for
legacy driver
[ 6.814359][ T1] PCI Interrupt Link [LNKD] enabled at IRQ 10
[ 6.815380][ T1] virtio-pci 0000:00:04.0: virtio_pci: leaving for
legacy driver
[ 7.080746][ T1] HDLC line discipline maxframe=4096
[ 7.081927][ T1] N_HDLC line discipline registered.
[ 7.082666][ T1] Serial: 8250/16550 driver, 4 ports, IRQ sharing
enabled
[ 7.105786][ T1] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud =
115200) is a 16550A
[ 7.132356][ T1] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud =
115200) is a 16550A
[ 7.157590][ T1] 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud =
115200) is a 16550A
[ 7.183130][ T1] 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud =
115200) is a 16550A
[ 7.193008][ T1] Non-volatile memory driver v1.3
[ 7.194444][ T1] Linux agpgart interface v0.103
[ 7.202953][ T1] [drm] Initialized vgem 1.0.0 20120112 for vgem on
minor 0
[ 7.205071][ T1] [drm] Supports vblank timestamp caching Rev 2
(21.10.2013).
[ 7.206646][ T1] [drm] Driver supports precise vblank timestamp query.
[ 7.210631][ T1] [drm] Initialized vkms 1.0.0 20180514 for vkms on
minor 1
[ 7.212553][ T1] usbcore: registered new interface driver udl
[ 7.257119][ T1] brd: module loaded
[ 7.287268][ T1] loop: module loaded
[ 7.349166][ T1] zram: Added device: zram0
[ 7.355467][ T1] null: module loaded
[ 7.361389][ T1] nfcsim 0.2 initialized
[ 7.365442][ T1] Loading iSCSI transport class v2.0-870.
[ 7.390378][ T1] scsi host0: Virtio SCSI HBA
[ 7.426601][ T1] st: Version 20160209, fixed bufsize 32768, s/g segs
256
[ 7.429094][ T315] kasan: CONFIG_KASAN_INLINE enabled
[ 7.430455][ T315] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 7.432483][ T315] general protection fault: 0000 [#1] SMP KASAN
[ 7.432727][ T1] kobject: 'sd' (00000000bb2758a3):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.434151][ T315] CPU: 1 PID: 315 Comm: kworker/u4:5 Not tainted
5.2.0-next-20190719+ #1
[ 7.436914][ T1] kobject: 'sd' (00000000bb2758a3): kobject_uevent_env
[ 7.438882][ T315] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 7.440676][ T1] kobject: 'sd' (00000000bb2758a3): fill_kobj_path:
path = '/bus/scsi/drivers/sd'
[ 7.440416][ T315] Workqueue: events_unbound async_run_entry_fn
[ 7.445448][ T1] kobject: 'sr' (00000000766fd886):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.440416][ T315] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.449555][ T1] kobject: 'sr' (00000000766fd886): kobject_uevent_env
[ 7.440416][ T315] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.440416][ T315] RSP: 0000:ffff8880a9377768 EFLAGS: 00010246
[ 7.440416][ T315] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.440416][ T315] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff8880a51ed578
[ 7.440416][ T315] RBP: ffff8880a9377780 R08: ffff8880a95fa400 R09:
ffffed1014490a8d
[ 7.440416][ T315] R10: ffffed1014490a8c R11: ffff8880a2485467 R12:
ffff8880a51ed240
[ 7.440416][ T315] R13: ffff8880a51ed240 R14: ffff888219319030 R15:
0000000000000200
[ 7.453062][ T1] kobject: 'sr' (00000000766fd886): fill_kobj_path:
path = '/bus/scsi/drivers/sr'
[ 7.440416][ T315] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.459503][ T1] kobject: 'scsi_generic' (0000000080fd1008):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.440416][ T315] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.440416][ T315] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.440416][ T315] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.440416][ T315] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.440416][ T315] Call Trace:
[ 7.463432][ T1] kobject: 'scsi_generic' (0000000080fd1008):
kobject_uevent_env
[ 7.440416][ T315] dma_max_mapping_size+0xba/0x100
[ 7.467570][ T1] kobject: 'scsi_generic' (0000000080fd1008):
fill_kobj_path: path = '/class/scsi_generic'
[ 7.440416][ T315] __scsi_init_queue+0x1cb/0x580
[ 7.473510][ T315] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 7.490094][ T315] scsi_mq_alloc_queue+0xd2/0x180
[ 7.491366][ T1] kobject: 'nvme-wq' (00000000294b3437):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.490576][ T315] scsi_alloc_sdev+0x837/0xc60
[ 7.494421][ T1] kobject: 'nvme-wq' (00000000294b3437):
kobject_uevent_env
[ 7.490576][ T315] scsi_probe_and_add_lun+0x2440/0x39f0
[ 7.497408][ T1] kobject: 'nvme-wq' (00000000294b3437):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.490576][ T315] ? __kasan_check_read+0x11/0x20
[ 7.501772][ T1] kobject: 'nvme-wq' (00000000294b3437):
kobject_uevent_env
[ 7.490576][ T315] ? mark_lock+0xc0/0x11e0
[ 7.504737][ T1] kobject: 'nvme-wq' (00000000294b3437):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-wq'
[ 7.490576][ T315] ? scsi_alloc_sdev+0xc60/0xc60
[ 7.508923][ T1] kobject: 'nvme-reset-wq' (00000000b2bddd49):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.490576][ T315] ? mark_held_locks+0xa4/0xf0
[ 7.512583][ T1] kobject: 'nvme-reset-wq' (00000000b2bddd49):
kobject_uevent_env
[ 7.490576][ T315] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.515735][ T1] kobject: 'nvme-reset-wq' (00000000b2bddd49):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.490576][ T315] ? __pm_runtime_resume+0x11b/0x180
[ 7.520092][ T1] kobject: 'nvme-reset-wq' (00000000b2bddd49):
kobject_uevent_env
[ 7.490576][ T315] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.523343][ T1] kobject: 'nvme-reset-wq' (00000000b2bddd49):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-reset-wq'
[ 7.490576][ T315] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.528303][ T1] kobject: 'nvme-delete-wq' (00000000e37696ea):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.490576][ T315] ? trace_hardirqs_on+0x67/0x220
[ 7.532120][ T1] kobject: 'nvme-delete-wq' (00000000e37696ea):
kobject_uevent_env
[ 7.490576][ T315] ? __kasan_check_read+0x11/0x20
[ 7.535118][ T1] kobject: 'nvme-delete-wq' (00000000e37696ea):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.490576][ T315] ? __pm_runtime_resume+0x11b/0x180
[ 7.539384][ T1] kobject: 'nvme-delete-wq' (00000000e37696ea):
kobject_uevent_env
[ 7.490576][ T315] __scsi_scan_target+0x29a/0xfa0
[ 7.490576][ T315] ? __pm_runtime_resume+0x11b/0x180
[ 7.490576][ T315] ? __kasan_check_read+0x11/0x20
[ 7.490576][ T315] ? mark_lock+0xc0/0x11e0
[ 7.490576][ T315] ? scsi_probe_and_add_lun+0x39f0/0x39f0
[ 7.490576][ T315] ? mark_held_locks+0xa4/0xf0
[ 7.542992][ T1] kobject: 'nvme-delete-wq' (00000000e37696ea):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-delete-wq'
[ 7.490576][ T315] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.545638][ T1] kobject: 'nvme' (000000008eeaa443):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.490576][ T315] ? __pm_runtime_resume+0x11b/0x180
[ 7.547851][ T1] kobject: 'nvme' (000000008eeaa443):
kobject_uevent_env
[ 7.490576][ T315] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.550392][ T1] kobject: 'nvme' (000000008eeaa443): fill_kobj_path:
path = '/class/nvme'
[ 7.490576][ T315] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.554796][ T1] kobject: 'nvme-subsystem' (000000008ab090ed):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.490576][ T315] ? trace_hardirqs_on+0x67/0x220
[ 7.558678][ T1] kobject: 'nvme-subsystem' (000000008ab090ed):
kobject_uevent_env
[ 7.490576][ T315] scsi_scan_channel.part.0+0x11a/0x190
[ 7.490576][ T315] scsi_scan_host_selected+0x313/0x450
[ 7.490576][ T315] ? scsi_scan_host+0x450/0x450
[ 7.490576][ T315] do_scsi_scan_host+0x1ef/0x260
[ 7.560780][ T315] ? scsi_scan_host+0x450/0x450
[ 7.560780][ T315] do_scan_async+0x41/0x500
[ 7.560780][ T315] ? scsi_scan_host+0x450/0x450
[ 7.562271][ T1] kobject: 'nvme-subsystem' (000000008ab090ed):
fill_kobj_path: path = '/class/nvme-subsystem'
[ 7.560780][ T315] async_run_entry_fn+0x124/0x570
[ 7.565621][ T1] kobject: 'nvme' (00000000f7f113b9):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.560780][ T315] process_one_work+0x9af/0x16d0
[ 7.569583][ T1] kobject: 'drivers' (000000002a478149):
kobject_add_internal: parent: 'nvme', set: '<NULL>'
[ 7.560780][ T315] ? pwq_dec_nr_in_flight+0x320/0x320
[ 7.560780][ T315] ? lock_acquire+0x190/0x400
[ 7.560780][ T315] worker_thread+0x98/0xe40
[ 7.560780][ T315] kthread+0x361/0x430
[ 7.573151][ T1] kobject: 'nvme' (00000000f7f113b9):
kobject_uevent_env
[ 7.560780][ T315] ? process_one_work+0x16d0/0x16d0
[ 7.575836][ T1] kobject: 'nvme' (00000000f7f113b9): fill_kobj_path:
path = '/bus/pci/drivers/nvme'
[ 7.560780][ T315] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 7.578399][ T1] kobject: 'ahci' (0000000044dbe570):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.560780][ T315] ret_from_fork+0x24/0x30
[ 7.580806][ T1] kobject: 'drivers' (000000001b791347):
kobject_add_internal: parent: 'ahci', set: '<NULL>'
[ 7.560780][ T315] Modules linked in:
[ 7.584574][ T1] kobject: 'ahci' (0000000044dbe570):
kobject_uevent_env
[ 7.587379][ T315] ---[ end trace 97fa41584581bdc5 ]---
[ 7.588471][ T1] kobject: 'ahci' (0000000044dbe570): fill_kobj_path:
path = '/bus/pci/drivers/ahci'
[ 7.591175][ T315] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.591192][ T315] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.592727][ T1] kobject: 'ata_piix' (000000002662371a):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.593895][ T315] RSP: 0000:ffff8880a9377768 EFLAGS: 00010246
[ 7.594973][ T1] kobject: 'drivers' (000000000d1374b8):
kobject_add_internal: parent: 'ata_piix', set: '<NULL>'
[ 7.596026][ T315] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.597814][ T1] kobject: 'ata_piix' (000000002662371a):
kobject_uevent_env
[ 7.599039][ T315] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff8880a51ed578
[ 7.601473][ T1] kobject: 'ata_piix' (000000002662371a):
fill_kobj_path: path = '/bus/pci/drivers/ata_piix'
[ 7.603162][ T315] RBP: ffff8880a9377780 R08: ffff8880a95fa400 R09:
ffffed1014490a8d
[ 7.605812][ T1] kobject: 'pata_amd' (000000005b6648f4):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.607021][ T315] R10: ffffed1014490a8c R11: ffff8880a2485467 R12:
ffff8880a51ed240
[ 7.609871][ T1] kobject: 'drivers' (00000000f893eb55):
kobject_add_internal: parent: 'pata_amd', set: '<NULL>'
[ 7.610828][ T315] R13: ffff8880a51ed240 R14: ffff888219319030 R15:
0000000000000200
[ 7.612683][ T1] kobject: 'pata_amd' (000000005b6648f4):
kobject_uevent_env
[ 7.613932][ T315] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.616403][ T1] kobject: 'pata_amd' (000000005b6648f4):
fill_kobj_path: path = '/bus/pci/drivers/pata_amd'
[ 7.618083][ T315] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.623013][ T1] kobject: 'pata_oldpiix' (000000004d80e1f2):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.625604][ T315] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.627277][ T1] kobject: 'drivers' (000000005ebcb50f):
kobject_add_internal: parent: 'pata_oldpiix', set: '<NULL>'
[ 7.629719][ T315] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.631811][ T1] kobject: 'pata_oldpiix' (000000004d80e1f2):
kobject_uevent_env
[ 7.633577][ T315] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.635870][ T1] kobject: 'pata_oldpiix' (000000004d80e1f2):
fill_kobj_path: path = '/bus/pci/drivers/pata_oldpiix'
[ 7.638660][ T315] Kernel panic - not syncing: Fatal exception
[ 7.640767][ T1] kobject: 'pata_sch' (00000000a24ef220):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.641695][ T315] Kernel Offset: disabled
[ 7.641695][ T315] Rebooting in 86400 seconds..


Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=16300934600000


Tested on:

commit: 54efad20 Add linux-next specific files for 20190719
git tree: linux-next
kernel config: https://syzkaller.appspot.com/x/.config?x=94c3de539954a651
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=1335cd1fa00000

Michael S. Tsirkin

unread,
Jul 21, 2019, 1:53:34 PM7/21/19
to Paul E. McKenney, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > Hi Paul, others,
> >
> > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > is what happens if userspace starts cycling through lots of these
> > ioctls. Given we actually use rcu as an optimization, we could just
> > disable the optimization temporarily - but the question would be how to
> > detect an excessive rate without working too hard :) .
> >
> > I guess we could define as excessive any rate where callback is
> > outstanding at the time when new structure is allocated. I have very
> > little understanding of rcu internals - so I wanted to check that the
> > following more or less implements this heuristic before I spend time
> > actually testing it.
> >
> > Could others pls take a look and let me know?
>
> These look good as a way of seeing if there are any outstanding callbacks,
> but in the case of Tree RCU, call_rcu_outstanding() would almost never
> return false on a busy system.


Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?
I'm really looking for something we can do this merge window
and without adding too much code, and kfree_rcu is intended to
fix a bug.
Adding call_rcu and careful accounting is something that I'm not
happy adding with merge window already open.

>
> Also, the overhead is important. For example, as far as I know,
> current RCU gracefully handles close(open(...)) in a tight userspace
> loop. But there might be trouble due to tight userspace loops around
> lighter-weight operations.
>
> So an important question is "Just how fast is your ioctl?" If it takes
> (say) 100 microseconds to execute, there should be absolutely no problem.
> On the other hand, if it can execute in 50 nanoseconds, this very likely
> does need serious attention.
>
> Other thoughts?
>
> Thanx, Paul

Hmm the answer to this would be I'm not sure.
It's setup time stuff we never tested it.

Paul E. McKenney

unread,
Jul 21, 2019, 3:28:52 PM7/21/19
to Michael S. Tsirkin, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > > Hi Paul, others,
> > >
> > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > > is what happens if userspace starts cycling through lots of these
> > > ioctls. Given we actually use rcu as an optimization, we could just
> > > disable the optimization temporarily - but the question would be how to
> > > detect an excessive rate without working too hard :) .
> > >
> > > I guess we could define as excessive any rate where callback is
> > > outstanding at the time when new structure is allocated. I have very
> > > little understanding of rcu internals - so I wanted to check that the
> > > following more or less implements this heuristic before I spend time
> > > actually testing it.
> > >
> > > Could others pls take a look and let me know?
> >
> > These look good as a way of seeing if there are any outstanding callbacks,
> > but in the case of Tree RCU, call_rcu_outstanding() would almost never
> > return false on a busy system.
>
> Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
> and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?

Or the function could simply return the number of callbacks queued
on the current CPU, and let the caller decide how many is too many.
OK, then I suggest having the interface return you the number of
callbacks. That allows you to experiment with the cutoff.

Give or take the ioctl overhead...

> > Also, the overhead is important. For example, as far as I know,
> > current RCU gracefully handles close(open(...)) in a tight userspace
> > loop. But there might be trouble due to tight userspace loops around
> > lighter-weight operations.
> >
> > So an important question is "Just how fast is your ioctl?" If it takes
> > (say) 100 microseconds to execute, there should be absolutely no problem.
> > On the other hand, if it can execute in 50 nanoseconds, this very likely
> > does need serious attention.
> >
> > Other thoughts?
> >
> > Thanx, Paul
>
> Hmm the answer to this would be I'm not sure.
> It's setup time stuff we never tested it.

Is it possible to measure it easily?

Thanx, Paul

Matthew Wilcox

unread,
Jul 21, 2019, 5:09:13 PM7/21/19
to Paul E. McKenney, Michael S. Tsirkin, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> Also, the overhead is important. For example, as far as I know,
> current RCU gracefully handles close(open(...)) in a tight userspace
> loop. But there might be trouble due to tight userspace loops around
> lighter-weight operations.

I thought you believed that RCU was antifragile, in that it would scale
better as it was used more heavily?

Would it make sense to have call_rcu() check to see if there are many
outstanding requests on this CPU and if so process them before returning?
That would ensure that frequent callers usually ended up doing their
own processing.

Paul E. McKenney

unread,
Jul 21, 2019, 7:31:26 PM7/21/19
to Matthew Wilcox, Michael S. Tsirkin, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote:
> On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > Also, the overhead is important. For example, as far as I know,
> > current RCU gracefully handles close(open(...)) in a tight userspace
> > loop. But there might be trouble due to tight userspace loops around
> > lighter-weight operations.
>
> I thought you believed that RCU was antifragile, in that it would scale
> better as it was used more heavily?

You are referring to this? https://paulmck.livejournal.com/47933.html

If so, the last few paragraphs might be worth re-reading. ;-)

And in this case, the heuristics RCU uses to decide when to schedule
invocation of the callbacks needs some help. One component of that help
is a time-based limit to the number of consecutive callback invocations
(see my crude prototype and Eric Dumazet's more polished patch). Another
component is an overload warning.

Why would an overload warning be needed if RCU's callback-invocation
scheduling heurisitics were upgraded? Because someone could boot a
100-CPU system with the rcu_nocbs=0-99, bind all of the resulting
rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload
on all of the CPUs. Given the constraints, CPU 0 cannot keep up.

So warnings are required as well.

> Would it make sense to have call_rcu() check to see if there are many
> outstanding requests on this CPU and if so process them before returning?
> That would ensure that frequent callers usually ended up doing their
> own processing.

Unfortunately, no. Here is a code fragment illustrating why:

void my_cb(struct rcu_head *rhp)
{
unsigned long flags;

spin_lock_irqsave(&my_lock, flags);
handle_cb(rhp);
spin_unlock_irqrestore(&my_lock, flags);
}

. . .

spin_lock_irqsave(&my_lock, flags);
p = look_something_up();
remove_that_something(p);
call_rcu(p, my_cb);
spin_unlock_irqrestore(&my_lock, flags);

Invoking the extra callbacks directly from call_rcu() would thus result
in self-deadlock. Documentation/RCU/UP.txt contains a few more examples
along these lines.

syzbot

unread,
Jul 21, 2019, 10:47:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

traffic allowed by default
[ 4.615570][ T1] nfc: nfc_init: NFC Core ver 0.1
[ 4.615570][ T1] NET: Registered protocol family 39
[ 4.615570][ T1] clocksource: Switched to clocksource kvm-clock
[ 6.584394][ T1] *** VALIDATE bpf ***
[ 6.586332][ T1] VFS: Disk quotas dquot_6.6.0
[ 6.587384][ T1] VFS: Dquot-cache hash table entries: 512 (order 0,
4096 bytes)
[ 6.589492][ T1] FS-Cache: Loaded
[ 6.590119][ T1] *** VALIDATE ramfs ***
[ 6.591110][ T1] *** VALIDATE hugetlbfs ***
[ 6.593535][ T1] CacheFiles: Loaded
[ 6.594833][ T1] TOMOYO: 2.6.0
[ 6.595689][ T1] Mandatory Access Control activated.
[ 6.598054][ T1] AppArmor: AppArmor Filesystem Enabled
[ 6.599413][ T1] pnp: PnP ACPI init
[ 6.610825][ T1] pnp: PnP ACPI: found 7 devices
[ 6.654409][ T1] thermal_sys: Registered thermal governor 'step_wise'
[ 6.654415][ T1] thermal_sys: Registered thermal governor 'user_space'
[ 6.665592][ T1] clocksource: acpi_pm: mask: 0xffffff max_cycles:
0xffffff, max_idle_ns: 2085701024 ns
[ 6.667023][ T1] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7
window]
[ 6.668055][ T1] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff
window]
[ 6.669159][ T1] pci_bus 0000:00: resource 6 [mem
0x000a0000-0x000bffff window]
[ 6.670308][ T1] pci_bus 0000:00: resource 7 [mem
0xc0000000-0xfebfffff window]
[ 6.672686][ T1] NET: Registered protocol family 2
[ 6.675694][ T1] tcp_listen_portaddr_hash hash table entries: 4096
(order: 6, 294912 bytes, vmalloc)
[ 6.677691][ T1] TCP established hash table entries: 65536 (order: 7,
524288 bytes, vmalloc)
[ 6.682898][ T1] TCP bind hash table entries: 65536 (order: 10,
4194304 bytes, vmalloc)
[ 6.687483][ T1] TCP: Hash tables configured (established 65536 bind
65536)
[ 6.689546][ T1] UDP hash table entries: 4096 (order: 7, 655360
bytes, vmalloc)
[ 6.691374][ T1] UDP-Lite hash table entries: 4096 (order: 7, 655360
bytes, vmalloc)
[ 6.694498][ T1] NET: Registered protocol family 1
[ 6.698894][ T1] RPC: Registered named UNIX socket transport module.
[ 6.699995][ T1] RPC: Registered udp transport module.
[ 6.700741][ T1] RPC: Registered tcp transport module.
[ 6.701521][ T1] RPC: Registered tcp NFSv4.1 backchannel transport
module.
[ 6.704430][ T1] NET: Registered protocol family 44
[ 6.705371][ T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[ 6.706408][ T1] PCI: CLS 0 bytes, default 64
[ 6.709974][ T1] PCI-DMA: Using software bounce buffering for IO
(SWIOTLB)
[ 6.711128][ T1] software IO TLB: mapped [mem 0xaa800000-0xae800000]
(64MB)
[ 6.714445][ T1] RAPL PMU: API unit is 2^-32 Joules, 0 fixed
counters, 10737418240 ms ovfl timer
[ 6.722363][ T1] kvm: already loaded the other module
[ 6.723474][ T1] clocksource: tsc: mask: 0xffffffffffffffff
max_cycles: 0x212735223b2, max_idle_ns: 440795277976 ns
[ 6.725079][ T1] clocksource: Switched to clocksource tsc
[ 6.726058][ T1] mce: Machine check injector initialized
[ 6.730584][ T1] check: Scanning for low memory corruption every 60
seconds
[ 6.822395][ T1] Initialise system trusted keyrings
[ 6.824665][ T1] workingset: timestamp_bits=40 max_order=21
bucket_order=0
[ 6.826283][ T1] zbud: loaded
[ 6.829913][ T1] *** VALIDATE devpts ***
[ 6.832874][ T1] DLM installed
[ 6.835702][ T1] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 6.839161][ T1] FS-Cache: Netfs 'nfs' registered for caching
[ 6.841307][ T1] NFS: Registering the id_resolver key type
[ 6.842691][ T1] Key type id_resolver registered
[ 6.843653][ T1] Key type id_legacy registered
[ 6.844742][ T1] nfs4filelayout_init: NFSv4 File Layout Driver
Registering...
[ 6.846209][ T1] Installing knfsd (copyright (C) 1996
ok...@monad.swb.de).
[ 6.849565][ T1] ntfs: driver 2.1.32 [Flags: R/W].
[ 6.851272][ T1] *** VALIDATE autofs ***
[ 6.852303][ T1] fuse: init (API version 7.31)
[ 6.853221][ T1] *** VALIDATE fuse ***
[ 6.853958][ T1] *** VALIDATE fuse ***
[ 6.856493][ T1] JFS: nTxBlock = 8192, nTxLock = 65536
[ 6.868350][ T1] SGI XFS with ACLs, security attributes, realtime, no
debug enabled
[ 6.875705][ T1] 9p: Installing v9fs 9p2000 file system support
[ 6.876983][ T1] FS-Cache: Netfs '9p' registered for caching
[ 6.879940][ T1] *** VALIDATE gfs2 ***
[ 6.882315][ T1] gfs2: GFS2 installed
[ 6.885477][ T1] FS-Cache: Netfs 'ceph' registered for caching
[ 6.886500][ T1] ceph: loaded (mds proto 32)
[ 6.893886][ T1] NET: Registered protocol family 38
[ 6.895527][ T1] async_tx: api initialized (async)
[ 6.896586][ T1] Key type asymmetric registered
[ 6.897445][ T1] Asymmetric key parser 'x509' registered
[ 6.898278][ T1] Asymmetric key parser 'pkcs8' registered
[ 6.899197][ T1] Key type pkcs7_test registered
[ 6.899924][ T1] Asymmetric key parser 'tpm_parser' registered
[ 6.900935][ T1] Block layer SCSI generic (bsg) driver version 0.4
loaded (major 246)
[ 6.902926][ T1] io scheduler mq-deadline registered
[ 6.903832][ T1] io scheduler kyber registered
[ 6.904825][ T1] io scheduler bfq registered
[ 6.909765][ T1] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[ 6.911332][ T1] ACPI: Power Button [PWRF]
[ 6.913086][ T1] input: Sleep Button as
/devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
[ 6.914642][ T1] ACPI: Sleep Button [SLPF]
[ 6.919524][ T1] ioatdma: Intel(R) QuickData Technology Driver 5.00
[ 6.931493][ T1] PCI Interrupt Link [LNKC] enabled at IRQ 11
[ 6.932937][ T1] virtio-pci 0000:00:03.0: virtio_pci: leaving for
legacy driver
[ 6.946447][ T1] PCI Interrupt Link [LNKD] enabled at IRQ 10
[ 6.947514][ T1] virtio-pci 0000:00:04.0: virtio_pci: leaving for
legacy driver
[ 7.205802][ T1] HDLC line discipline maxframe=4096
[ 7.206887][ T1] N_HDLC line discipline registered.
[ 7.207680][ T1] Serial: 8250/16550 driver, 4 ports, IRQ sharing
enabled
[ 7.230918][ T1] 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud =
115200) is a 16550A
[ 7.257951][ T1] 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud =
115200) is a 16550A
[ 7.283359][ T1] 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud =
115200) is a 16550A
[ 7.308497][ T1] 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud =
115200) is a 16550A
[ 7.318043][ T1] Non-volatile memory driver v1.3
[ 7.319754][ T1] Linux agpgart interface v0.103
[ 7.328087][ T1] [drm] Initialized vgem 1.0.0 20120112 for vgem on
minor 0
[ 7.330092][ T1] [drm] Supports vblank timestamp caching Rev 2
(21.10.2013).
[ 7.331658][ T1] [drm] Driver supports precise vblank timestamp query.
[ 7.335338][ T1] [drm] Initialized vkms 1.0.0 20180514 for vkms on
minor 1
[ 7.337206][ T1] usbcore: registered new interface driver udl
[ 7.381642][ T1] brd: module loaded
[ 7.412784][ T1] loop: module loaded
[ 7.475435][ T1] zram: Added device: zram0
[ 7.481552][ T1] null: module loaded
[ 7.487511][ T1] nfcsim 0.2 initialized
[ 7.490791][ T1] Loading iSCSI transport class v2.0-870.
[ 7.510895][ T1] scsi host0: Virtio SCSI HBA
[ 7.546912][ T1] st: Version 20160209, fixed bufsize 32768, s/g segs
256
[ 7.548422][ T356] kasan: CONFIG_KASAN_INLINE enabled
[ 7.550108][ T356] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 7.550123][ T356] general protection fault: 0000 [#1] SMP KASAN
[ 7.552194][ T1] kobject: 'scsi_tape' (0000000062d1f350):
kobject_uevent_env
[ 7.553667][ T356] CPU: 1 PID: 356 Comm: kworker/u4:3 Not tainted
5.2.0-next-20190719+ #1
[ 7.555450][ T1] kobject: 'scsi_tape' (0000000062d1f350):
fill_kobj_path: path = '/class/scsi_tape'
[ 7.555721][ T356] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 7.559776][ T1] kobject: 'st' (00000000828497fa):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.555721][ T356] Workqueue: events_unbound async_run_entry_fn
[ 7.555721][ T356] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.564741][ T1] kobject: 'st' (00000000828497fa): kobject_uevent_env
[ 7.555721][ T356] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.567640][ T1] kobject: 'st' (00000000828497fa): fill_kobj_path:
path = '/bus/scsi/drivers/st'
[ 7.555721][ T356] RSP: 0000:ffff8880a92ff768 EFLAGS: 00010246
[ 7.574255][ T1] kobject: 'scsi_disk' (00000000690eda97):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.555721][ T356] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.577804][ T1] kobject: 'scsi_disk' (00000000690eda97):
kobject_uevent_env
[ 7.555721][ T356] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff8880a5178cf8
[ 7.582133][ T1] kobject: 'scsi_disk' (00000000690eda97):
fill_kobj_path: path = '/class/scsi_disk'
[ 7.555721][ T356] RBP: ffff8880a92ff780 R08: ffff8880a8ed26c0 R09:
ffffed1014496a8d
[ 7.555721][ T356] R10: ffffed1014496a8c R11: ffff8880a24b5467 R12:
ffff8880a51789c0
[ 7.555721][ T356] R13: ffff8880a51789c0 R14: ffff8882192f6ff0 R15:
0000000000000200
[ 7.555721][ T356] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.586107][ T1] kobject: 'sd' (00000000d2701516):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.555721][ T356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.590046][ T1] kobject: 'sd' (00000000d2701516): kobject_uevent_env
[ 7.555721][ T356] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.593834][ T1] kobject: 'sd' (00000000d2701516): fill_kobj_path:
path = '/bus/scsi/drivers/sd'
[ 7.555721][ T356] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.598437][ T1] kobject: 'sr' (00000000994b3124):
kobject_add_internal: parent: 'drivers', set: 'drivers'
[ 7.555721][ T356] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.601667][ T1] kobject: 'sr' (00000000994b3124): kobject_uevent_env
[ 7.555721][ T356] Call Trace:
[ 7.555721][ T356] dma_max_mapping_size+0xba/0x100
[ 7.555721][ T356] __scsi_init_queue+0x1cb/0x580
[ 7.605779][ T1] kobject: 'sr' (00000000994b3124): fill_kobj_path:
path = '/bus/scsi/drivers/sr'
[ 7.555721][ T356] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 7.555721][ T356] scsi_mq_alloc_queue+0xd2/0x180
[ 7.610170][ T1] kobject: 'scsi_generic' (000000002202da53):
kobject_add_internal: parent: 'class', set: 'class'
[ 7.555721][ T356] scsi_alloc_sdev+0x837/0xc60
[ 7.613687][ T1] kobject: 'scsi_generic' (000000002202da53):
kobject_uevent_env
[ 7.555721][ T356] scsi_probe_and_add_lun+0x2440/0x39f0
[ 7.615692][ T1] kobject: 'scsi_generic' (000000002202da53):
fill_kobj_path: path = '/class/scsi_generic'
[ 7.555721][ T356] ? __kasan_check_read+0x11/0x20
[ 7.619253][ T356] ? mark_lock+0xc0/0x11e0
[ 7.619253][ T356] ? scsi_alloc_sdev+0xc60/0xc60
[ 7.619253][ T356] ? mark_held_locks+0xa4/0xf0
[ 7.619253][ T356] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.619253][ T356] ? __pm_runtime_resume+0x11b/0x180
[ 7.619253][ T356] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.619253][ T356] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.619253][ T356] ? trace_hardirqs_on+0x67/0x220
[ 7.619253][ T356] ? __kasan_check_read+0x11/0x20
[ 7.619253][ T356] ? __pm_runtime_resume+0x11b/0x180
[ 7.619253][ T356] __scsi_scan_target+0x29a/0xfa0
[ 7.619253][ T356] ? __pm_runtime_resume+0x11b/0x180
[ 7.619253][ T356] ? __kasan_check_read+0x11/0x20
[ 7.619253][ T356] ? mark_lock+0xc0/0x11e0
[ 7.619253][ T356] ? scsi_probe_and_add_lun+0x39f0/0x39f0
[ 7.619253][ T356] ? mark_held_locks+0xa4/0xf0
[ 7.619253][ T356] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.619253][ T356] ? __pm_runtime_resume+0x11b/0x180
[ 7.619253][ T356] ? _raw_spin_unlock_irqrestore+0x67/0xd0
[ 7.619253][ T356] ? lockdep_hardirqs_on+0x418/0x5d0
[ 7.619253][ T356] ? trace_hardirqs_on+0x67/0x220
[ 7.619253][ T356] scsi_scan_channel.part.0+0x11a/0x190
[ 7.619253][ T356] scsi_scan_host_selected+0x313/0x450
[ 7.619253][ T356] ? scsi_scan_host+0x450/0x450
[ 7.619253][ T356] do_scsi_scan_host+0x1ef/0x260
[ 7.619253][ T356] ? scsi_scan_host+0x450/0x450
[ 7.619253][ T356] do_scan_async+0x41/0x500
[ 7.619253][ T356] ? scsi_scan_host+0x450/0x450
[ 7.619253][ T356] async_run_entry_fn+0x124/0x570
[ 7.619253][ T356] process_one_work+0x9af/0x16d0
[ 7.619253][ T356] ? pwq_dec_nr_in_flight+0x320/0x320
[ 7.619253][ T356] ? lock_acquire+0x190/0x400
[ 7.619253][ T356] worker_thread+0x98/0xe40
[ 7.619253][ T356] kthread+0x361/0x430
[ 7.619253][ T356] ? process_one_work+0x16d0/0x16d0
[ 7.619253][ T356] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 7.619253][ T356] ret_from_fork+0x24/0x30
[ 7.619253][ T356] Modules linked in:
[ 7.669181][ T356] ---[ end trace 05b6fa33a01bd0da ]---
[ 7.670248][ T356] RIP: 0010:dma_direct_max_mapping_size+0x7c/0x1a7
[ 7.671637][ T356] Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 23 01
00 00 49 8b 9c 24 38 03 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1
ea 03 <80> 3c 02 00 0f 85 0a 01 00 00 49 8d bc 24 48 03 00 00 48 8b 1b 48
[ 7.674369][ T1] kobject: 'nvme-wq' (00000000f0114d9a):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.676560][ T356] RSP: 0000:ffff8880a92ff768 EFLAGS: 00010246
[ 7.679345][ T1] kobject: 'nvme-wq' (00000000f0114d9a):
kobject_uevent_env
[ 7.680767][ T356] RAX: dffffc0000000000 RBX: 0000000000000000 RCX:
ffffffff815ff7d1
[ 7.682598][ T1] kobject: 'nvme-wq' (00000000f0114d9a):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.684477][ T356] RDX: 0000000000000000 RSI: ffffffff815ff7f0 RDI:
ffff8880a5178cf8
[ 7.687276][ T1] kobject: 'nvme-wq' (00000000f0114d9a):
kobject_uevent_env
[ 7.689176][ T356] RBP: ffff8880a92ff780 R08: ffff8880a8ed26c0 R09:
ffffed1014496a8d
[ 7.690963][ T1] kobject: 'nvme-wq' (00000000f0114d9a):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-wq'
[ 7.692900][ T356] R10: ffffed1014496a8c R11: ffff8880a24b5467 R12:
ffff8880a51789c0
[ 7.692912][ T356] R13: ffff8880a51789c0 R14: ffff8882192f6ff0 R15:
0000000000000200
[ 7.696286][ T1] kobject: 'nvme-reset-wq' (00000000ebc7a02c):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.697623][ T356] FS: 0000000000000000(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 7.699658][ T1] kobject: 'nvme-reset-wq' (00000000ebc7a02c):
kobject_uevent_env
[ 7.702207][ T356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.702219][ T356] CR2: 0000000000000000 CR3: 0000000008c6d000 CR4:
00000000001406e0
[ 7.704512][ T1] kobject: 'nvme-reset-wq' (00000000ebc7a02c):
kobject_uevent_env: uevent_suppress caused the event to drop!
[ 7.706333][ T356] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 7.708017][ T1] kobject: 'nvme-reset-wq' (00000000ebc7a02c):
kobject_uevent_env
[ 7.709910][ T356] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 7.712769][ T1] kobject: 'nvme-reset-wq' (00000000ebc7a02c):
fill_kobj_path: path = '/devices/virtual/workqueue/nvme-reset-wq'
[ 7.714623][ T356] Kernel panic - not syncing: Fatal exception
[ 7.717100][ T1] kobject: 'nvme-delete-wq' (0000000027abd3c7):
kobject_add_internal: parent: 'workqueue', set: 'devices'
[ 7.722559][ T356] Kernel Offset: disabled
[ 7.722703][ T356] Rebooting in 86400 seconds..


Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=16f0f064600000


Tested on:

commit: 54efad20 Add linux-next specific files for 20190719
git tree: linux-next
kernel config: https://syzkaller.appspot.com/x/.config?x=94c3de539954a651
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=14126e6c600000

syzbot

unread,
Jul 21, 2019, 10:57:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
BUG: Bad rss-counter state

BUG: Bad rss-counter state mm:000000001e045fe2 idx:0 val:241
BUG: Bad rss-counter state mm:000000001e045fe2 idx:1 val:546
BUG: non-zero pgtables_bytes on freeing mm: 73728


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=1630016c600000
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=1425ae3fa00000

syzbot

unread,
Jul 21, 2019, 11:20:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
syzbot+e58112...@syzkaller.appspotmail.com

Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=111f2934600000

Note: testing is done by a robot and is best-effort only.

syzbot

unread,
Jul 21, 2019, 11:31:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
BUG: Bad rss-counter state

BUG: Bad rss-counter state mm:00000000a5632dc1 idx:0 val:241
BUG: Bad rss-counter state mm:00000000a5632dc1 idx:1 val:542
BUG: non-zero pgtables_bytes on freeing mm: 73728


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=17816d00600000
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=10a7d6afa00000

syzbot

unread,
Jul 21, 2019, 11:55:01 PM7/21/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
syzbot+e58112...@syzkaller.appspotmail.com

Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=117b9934600000

syzbot

unread,
Jul 22, 2019, 12:13:01 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

2011
[ 40.174932][ T1] Call Trace:
[ 40.174932][ T1] dump_stack+0x172/0x1f0
[ 40.174932][ T1] __schedule+0xd03/0x1580
[ 40.174932][ T1] ? __sched_text_start+0x8/0x8
[ 40.174932][ T1] ? try_to_wake_up+0xc8/0x13f0
[ 40.174932][ T1] ? preempt_schedule+0x4b/0x60
[ 40.174932][ T1] preempt_schedule_common+0x4f/0xe0
[ 40.174932][ T1] preempt_schedule+0x4b/0x60
[ 40.174932][ T1] ___preempt_schedule+0x16/0x18
[ 40.174932][ T1] _raw_spin_unlock_irqrestore+0xbd/0xe0
[ 40.174932][ T1] try_to_wake_up+0xc8/0x13f0
[ 40.174932][ T1] ? migrate_swap_stop+0x920/0x920
[ 40.174932][ T1] ? kasan_check_read+0x11/0x20
[ 40.174932][ T1] wake_up_process+0x10/0x20
[ 40.174932][ T1] __kthread_create_on_node+0x281/0x460
[ 40.174932][ T1] ? kthread_parkme+0xb0/0xb0
[ 40.174932][ T1] ? find_held_lock+0x35/0x130
[ 40.174932][ T1] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[ 40.174932][ T1] ? cancel_delayed_work+0x2d0/0x2d0
[ 40.174932][ T1] kthread_create_on_node+0xbb/0xf0
[ 40.174932][ T1] ? __kthread_create_on_node+0x460/0x460
[ 40.174932][ T1] ? kmem_cache_alloc_node_trace+0x34f/0x720
[ 40.174932][ T1] ? __mutex_unlock_slowpath+0xf8/0x6b0
[ 40.174932][ T1] ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[ 40.174932][ T1] init_rescuer.part.0+0x7d/0x190
[ 40.174932][ T1] alloc_workqueue+0x669/0xf00
[ 40.174932][ T1] ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[ 40.174932][ T1] ? workqueue_sysfs_register+0x3f0/0x3f0
[ 40.174932][ T1] ? cpumask_next+0x41/0x50
[ 40.174932][ T1] ? __sanitizer_cov_trace_cmp4+0x16/0x20
[ 40.174932][ T1] ? wq_watchdog_reset_touched+0x116/0x180
[ 40.174932][ T1] init_mm_internals+0x24/0x3c9
[ 40.174932][ T1] kernel_init_freeable+0x2c5/0x5c3
[ 40.174932][ T1] ? rest_init+0x37b/0x37b
[ 40.174932][ T1] kernel_init+0x12/0x1c5
[ 40.174932][ T1] ret_from_fork+0x24/0x30
[ 40.174961][ T7] mmdrop (____ptrval____) before 4
[ 40.180248][ T7] CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted
5.2.0-rc2+ #1
[ 40.184932][ T7] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.184932][ T7] Workqueue: 0x0 (events)
[ 40.184932][ T7] Call Trace:
[ 40.184932][ T7] dump_stack+0x172/0x1f0
[ 40.184932][ T7] finish_task_switch+0x239/0x790
[ 40.184932][ T7] ? dump_stack+0x1de/0x1f0
[ 40.184932][ T7] __schedule+0x7d3/0x1580
[ 40.184932][ T7] ? __sched_text_start+0x8/0x8
[ 40.184932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.184932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.184932][ T7] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[ 40.184932][ T7] ? kthread_data+0x58/0xc0
[ 40.184932][ T7] schedule+0xa8/0x260
[ 40.184932][ T7] worker_thread+0x248/0xe40
[ 40.184932][ T7] ? trace_hardirqs_on+0x67/0x220
[ 40.184932][ T7] kthread+0x354/0x420
[ 40.184932][ T7] ? process_one_work+0x1790/0x1790
[ 40.184932][ T7] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.184932][ T7] ret_from_fork+0x24/0x30
[ 40.184975][ T7] mmgrab (____ptrval____) to 4
[ 40.189838][ T7] CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted
5.2.0-rc2+ #1
[ 40.194932][ T7] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.194932][ T7] Workqueue: 0x0 (events)
[ 40.194932][ T7] Call Trace:
[ 40.194932][ T7] dump_stack+0x172/0x1f0
[ 40.194932][ T7] __schedule+0xd03/0x1580
[ 40.194932][ T7] ? __sched_text_start+0x8/0x8
[ 40.194932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.194932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.194932][ T7] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[ 40.194932][ T7] ? kthread_data+0x58/0xc0
[ 40.194932][ T7] schedule+0xa8/0x260
[ 40.194932][ T7] worker_thread+0x248/0xe40
[ 40.194932][ T7] ? trace_hardirqs_on+0x67/0x220
[ 40.194932][ T7] kthread+0x354/0x420
[ 40.194932][ T7] ? process_one_work+0x1790/0x1790
[ 40.194932][ T7] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.194932][ T7] ret_from_fork+0x24/0x30
[ 40.194962][ T8] mmdrop (____ptrval____) before 4
[ 40.200371][ T8] CPU: 0 PID: 8 Comm: kworker/u4:0 Not tainted
5.2.0-rc2+ #1
[ 40.204932][ T8] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.204932][ T8] Call Trace:
[ 40.204932][ T8] dump_stack+0x172/0x1f0
[ 40.204932][ T8] finish_task_switch+0x239/0x790
[ 40.204932][ T8] ? dump_stack+0x1de/0x1f0
[ 40.204932][ T8] __schedule+0x7d3/0x1580
[ 40.204932][ T8] ? __sched_text_start+0x8/0x8
[ 40.204932][ T8] ? ___preempt_schedule+0x16/0x18
[ 40.204932][ T8] schedule+0xa8/0x260
[ 40.204932][ T8] kthread+0x27a/0x420
[ 40.204932][ T8] ? process_one_work+0x1790/0x1790
[ 40.204932][ T8] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.204932][ T8] ret_from_fork+0x24/0x30
[ 40.204976][ T8] mmgrab (____ptrval____) to 4
[ 40.209956][ T8] CPU: 0 PID: 8 Comm: kworker/u4:0 Not tainted
5.2.0-rc2+ #1
[ 40.214932][ T8] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.214932][ T8] Call Trace:
[ 40.214932][ T8] dump_stack+0x172/0x1f0
[ 40.214932][ T8] __schedule+0xd03/0x1580
[ 40.214932][ T8] ? __sched_text_start+0x8/0x8
[ 40.214932][ T8] ? ___preempt_schedule+0x16/0x18
[ 40.214932][ T8] schedule+0xa8/0x260
[ 40.214932][ T8] kthread+0x27a/0x420
[ 40.214932][ T8] ? process_one_work+0x1790/0x1790
[ 40.214932][ T8] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.214932][ T8] ret_from_fork+0x24/0x30
[ 40.214963][ T2] mmdrop (____ptrval____) before 4
[ 40.224976][ T2] CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.2.0-rc2+
#1
[ 40.232212][ T2] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.234932][ T2] Call Trace:
[ 40.234932][ T2] dump_stack+0x172/0x1f0
[ 40.234932][ T2] finish_task_switch+0x239/0x790
[ 40.234932][ T2] ? dump_stack+0x1de/0x1f0
[ 40.234932][ T2] __schedule+0x7d3/0x1580
[ 40.234932][ T2] ? __sched_text_start+0x8/0x8
[ 40.234932][ T2] schedule+0xa8/0x260
[ 40.234932][ T2] kthreadd+0x5cc/0x740
[ 40.234932][ T2] ? kthread_create_on_cpu+0x1f0/0x1f0
[ 40.234932][ T2] ? calculate_sigpending+0x87/0xa0
[ 40.234932][ T2] ? kthread_create_on_cpu+0x1f0/0x1f0
[ 40.234932][ T2] ret_from_fork+0x24/0x30
[ 40.235024][ T2] mmgrab (____ptrval____) to 4
[ 40.239926][ T2] CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.2.0-rc2+
#1
[ 40.244932][ T2] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.244932][ T2] Call Trace:
[ 40.244932][ T2] dump_stack+0x172/0x1f0
[ 40.244932][ T2] __schedule+0xd03/0x1580
[ 40.244932][ T2] ? __sched_text_start+0x8/0x8
[ 40.244932][ T2] schedule+0xa8/0x260
[ 40.244932][ T2] kthreadd+0x5cc/0x740
[ 40.244932][ T2] ? kthread_create_on_cpu+0x1f0/0x1f0
[ 40.244932][ T2] ? calculate_sigpending+0x87/0xa0
[ 40.244932][ T2] ? kthread_create_on_cpu+0x1f0/0x1f0
[ 40.244932][ T2] ret_from_fork+0x24/0x30
[ 40.244974][ T1] mmdrop (____ptrval____) before 4
[ 40.254967][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc2+
#1
[ 40.262365][ T1] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.264932][ T1] Call Trace:
[ 40.264932][ T1] dump_stack+0x172/0x1f0
[ 40.264932][ T1] finish_task_switch+0x239/0x790
[ 40.264932][ T1] ? dump_stack+0x1de/0x1f0
[ 40.264932][ T1] __schedule+0x7d3/0x1580
[ 40.264932][ T1] ? __sched_text_start+0x8/0x8
[ 40.264932][ T1] ? try_to_wake_up+0xc8/0x13f0
[ 40.264932][ T1] ? preempt_schedule+0x4b/0x60
[ 40.264932][ T1] preempt_schedule_common+0x4f/0xe0
[ 40.264932][ T1] preempt_schedule+0x4b/0x60
[ 40.264932][ T1] ___preempt_schedule+0x16/0x18
[ 40.264932][ T1] _raw_spin_unlock_irqrestore+0xbd/0xe0
[ 40.264932][ T1] try_to_wake_up+0xc8/0x13f0
[ 40.264932][ T1] ? migrate_swap_stop+0x920/0x920
[ 40.264932][ T1] ? kasan_check_read+0x11/0x20
[ 40.264932][ T1] wake_up_process+0x10/0x20
[ 40.264932][ T1] __kthread_create_on_node+0x281/0x460
[ 40.264932][ T1] ? kthread_parkme+0xb0/0xb0
[ 40.264932][ T1] ? find_held_lock+0x35/0x130
[ 40.264932][ T1] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[ 40.264932][ T1] ? cancel_delayed_work+0x2d0/0x2d0
[ 40.264932][ T1] kthread_create_on_node+0xbb/0xf0
[ 40.264932][ T1] ? __kthread_create_on_node+0x460/0x460
[ 40.264932][ T1] ? kmem_cache_alloc_node_trace+0x34f/0x720
[ 40.264932][ T1] ? __mutex_unlock_slowpath+0xf8/0x6b0
[ 40.264932][ T1] ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[ 40.264932][ T1] init_rescuer.part.0+0x7d/0x190
[ 40.264932][ T1] alloc_workqueue+0x669/0xf00
[ 40.264932][ T1] ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[ 40.264932][ T1] ? workqueue_sysfs_register+0x3f0/0x3f0
[ 40.264932][ T1] ? cpumask_next+0x41/0x50
[ 40.264932][ T1] ? __sanitizer_cov_trace_cmp4+0x16/0x20
[ 40.264932][ T1] ? wq_watchdog_reset_touched+0x116/0x180
[ 40.264932][ T1] init_mm_internals+0x24/0x3c9
[ 40.264932][ T1] kernel_init_freeable+0x2c5/0x5c3
[ 40.264932][ T1] ? rest_init+0x37b/0x37b
[ 40.264932][ T1] kernel_init+0x12/0x1c5
[ 40.264932][ T1] ret_from_fork+0x24/0x30
[ 40.264975][ T1] mmgrab (____ptrval____) to 4
[ 40.270037][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc2+
#1
[ 40.274932][ T1] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.274932][ T1] Call Trace:
[ 40.274932][ T1] dump_stack+0x172/0x1f0
[ 40.274932][ T1] __schedule+0xd03/0x1580
[ 40.274932][ T1] ? __sched_text_start+0x8/0x8
[ 40.274932][ T1] ? try_to_wake_up+0xc8/0x13f0
[ 40.274932][ T1] ? preempt_schedule+0x4b/0x60
[ 40.274932][ T1] preempt_schedule_common+0x4f/0xe0
[ 40.274932][ T1] preempt_schedule+0x4b/0x60
[ 40.274932][ T1] ___preempt_schedule+0x16/0x18
[ 40.274932][ T1] _raw_spin_unlock_irqrestore+0xbd/0xe0
[ 40.274932][ T1] try_to_wake_up+0xc8/0x13f0
[ 40.274932][ T1] ? migrate_swap_stop+0x920/0x920
[ 40.274932][ T1] ? kasan_check_read+0x11/0x20
[ 40.274932][ T1] wake_up_process+0x10/0x20
[ 40.274932][ T1] __kthread_create_on_node+0x281/0x460
[ 40.274932][ T1] ? kthread_parkme+0xb0/0xb0
[ 40.274932][ T1] ? find_held_lock+0x35/0x130
[ 40.274932][ T1] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[ 40.274932][ T1] ? cancel_delayed_work+0x2d0/0x2d0
[ 40.274932][ T1] kthread_create_on_node+0xbb/0xf0
[ 40.274932][ T1] ? __kthread_create_on_node+0x460/0x460
[ 40.274932][ T1] ? kmem_cache_alloc_node_trace+0x34f/0x720
[ 40.274932][ T1] ? __mutex_unlock_slowpath+0xf8/0x6b0
[ 40.274932][ T1] ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[ 40.274932][ T1] init_rescuer.part.0+0x7d/0x190
[ 40.274932][ T1] alloc_workqueue+0x669/0xf00
[ 40.274932][ T1] ? _raw_spin_unlock_irqrestore+0xa4/0xe0
[ 40.274932][ T1] ? workqueue_sysfs_register+0x3f0/0x3f0
[ 40.274932][ T1] ? cpumask_next+0x41/0x50
[ 40.274932][ T1] ? __sanitizer_cov_trace_cmp4+0x16/0x20
[ 40.274932][ T1] ? wq_watchdog_reset_touched+0x116/0x180
[ 40.274932][ T1] init_mm_internals+0x24/0x3c9
[ 40.274932][ T1] kernel_init_freeable+0x2c5/0x5c3
[ 40.274932][ T1] ? rest_init+0x37b/0x37b
[ 40.274932][ T1] kernel_init+0x12/0x1c5
[ 40.274932][ T1] ret_from_fork+0x24/0x30
[ 40.274962][ T7] mmdrop (____ptrval____) before 4
[ 40.284974][ T7] CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted
5.2.0-rc2+ #1
[ 40.293165][ T7] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.294932][ T7] Workqueue: 0x0 (events)
[ 40.294932][ T7] Call Trace:
[ 40.294932][ T7] dump_stack+0x172/0x1f0
[ 40.294932][ T7] finish_task_switch+0x239/0x790
[ 40.294932][ T7] ? dump_stack+0x1de/0x1f0
[ 40.294932][ T7] __schedule+0x7d3/0x1580
[ 40.294932][ T7] ? __sched_text_start+0x8/0x8
[ 40.294932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.294932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.294932][ T7] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[ 40.294932][ T7] ? kthread_data+0x58/0xc0
[ 40.294932][ T7] schedule+0xa8/0x260
[ 40.294932][ T7] worker_thread+0x248/0xe40
[ 40.294932][ T7] ? trace_hardirqs_on+0x67/0x220
[ 40.294932][ T7] kthread+0x354/0x420
[ 40.294932][ T7] ? process_one_work+0x1790/0x1790
[ 40.294932][ T7] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.294932][ T7] ret_from_fork+0x24/0x30
[ 40.294980][ T7] mmgrab (____ptrval____) to 4
[ 40.300165][ T7] CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted
5.2.0-rc2+ #1
[ 40.304932][ T7] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.304932][ T7] Workqueue: 0x0 (events)
[ 40.304932][ T7] Call Trace:
[ 40.304932][ T7] dump_stack+0x172/0x1f0
[ 40.304932][ T7] __schedule+0xd03/0x1580
[ 40.304932][ T7] ? __sched_text_start+0x8/0x8
[ 40.304932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.304932][ T7] ? _raw_spin_unlock_irq+0x28/0x90
[ 40.304932][ T7] ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[ 40.304932][ T7] ? kthread_data+0x58/0xc0
[ 40.304932][ T7] schedule+0xa8/0x260
[ 40.304932][ T7] worker_thread+0x248/0xe40
[ 40.304932][ T7] ? trace_hardirqs_on+0x67/0x220
[ 40.304932][ T7] kthread+0x354/0x420
[ 40.304932][ T7] ? process_one_work+0x1790/0x1790
[ 40.304932][ T7] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.304932][ T7] ret_from_fork+0x24/0x30
[ 40.304967][ T8] mmdrop (____ptrval____) before 4
[ 40.310087][ T8] CPU: 0 PID: 8 Comm: kworker/u4:0 Not tainted
5.2.0-rc2+ #1
[ 40.314932][ T8] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.314932][ T8] Call Trace:
[ 40.314932][ T8] dump_stack+0x172/0x1f0
[ 40.314932][ T8] finish_task_switch+0x239/0x790
[ 40.314932][ T8] ? dump_stack+0x1de/0x1f0
[ 40.314932][ T8] __schedule+0x7d3/0x1580
[ 40.314932][ T8] ? __sched_text_start+0x8/0x8
[ 40.314932][ T8] ? ___preempt_schedule+0x16/0x18
[ 40.314932][ T8] schedule+0xa8/0x260
[ 40.314932][ T8] kthread+0x27a/0x420
[ 40.314932][ T8] ? process_one_work+0x1790/0x1790
[ 40.314932][ T8] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.314932][ T8] ret_from_fork+0x24/0x30
[ 40.314976][ T8] mmgrab (____ptrval____) to 4
[ 40.319745][ T8] CPU: 0 PID: 8 Comm: kworker/u4:0 Not tainted
5.2.0-rc2+ #1
[ 40.324932][ T8] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 40.324932][ T8] Call Trace:
[ 40.324932][ T8] dump_stack+0x172/0x1f0
[ 40.324932][ T8] __schedule+0xd03/0x1580
[ 40.324932][ T8] ? __sched_text_start+0x8/0x8
[ 40.324932][ T8] ? ___preempt_schedule+0x16/0x18
[ 40.324932][ T8] schedule+0xa8/0x260
[ 40.324932][ T8] kthread+0x27a/0x420
[ 40.324932][ T8] ? process_one_work+0x1790/0x1790
[ 40.324932][ T8] ? kthread_cancel_delayed_work_sync+0x20/0x20
[ 40.324932][ T8] ret_from_fork+0x24/0x30
[ 40.325021][ T2] mmdrop (____ptrval____) before 4
[ 40.330391][ T2] CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.2.0-rc2+
#1
[ 40.334932][ T2] Hardware name: Google Google Compute Engine/Google
Compute E

Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=1020e6afa00000


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=10941810600000

syzbot

unread,
Jul 22, 2019, 12:42:01 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
BUG: Bad rss-counter state

BUG: Bad rss-counter state mm:00000000249f007e idx:0 val:241
BUG: Bad rss-counter state mm:00000000249f007e idx:1 val:546
BUG: non-zero pgtables_bytes on freeing mm: 73728


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=11c58d00600000
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=1220e6afa00000

syzbot

unread,
Jul 22, 2019, 12:55:08 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
BUG: Bad rss-counter state

8021q: adding VLAN 0 to HW filter on device batadv0
BUG: Bad rss-counter state mm:00000000a8bba24f idx:0 val:241
BUG: Bad rss-counter state mm:00000000a8bba24f idx:1 val:546
BUG: non-zero pgtables_bytes on freeing mm: 69632


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=15ef951fa00000
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=13811958600000

Jason Wang

unread,
Jul 22, 2019, 1:22:18 AM7/22/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org

On 2019/7/21 下午6:02, Michael S. Tsirkin wrote:
> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
>> syzbot has bisected this bug to:
>>
>> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
>> Author: Jason Wang <jaso...@redhat.com>
>> Date: Fri May 24 08:12:18 2019 +0000
>>
>> vhost: access vq metadata through kernel virtual address
>>
>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
>> start commit: 6d21a41b Add linux-next specific files for 20190718
>> git tree: linux-next
>> final crash: https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
>> console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
>> dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
>>
>> Reported-by: syzbot+e58112...@syzkaller.appspotmail.com
>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
>> address")
>>
>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>
> OK I poked at this for a bit, I see several things that
> we need to fix, though I'm not yet sure it's the reason for
> the failures:
>
>
> 1. mmu_notifier_register shouldn't be called from vhost_vring_set_num_addr
> That's just a bad hack,


This is used to avoid holding lock when checking whether the addresses
are overlapped. Otherwise we need to take spinlock for each invalidation
request even if it was the va range that is not interested for us. This
will be very slow e.g during guest boot.


> in particular I don't think device
> mutex is taken and so poking at two VQs will corrupt
> memory.


The caller vhost_net_ioctl() (or scsi and vsock) will hold device mutex
before calling us.


> So what to do? How about a per vq notifier?
> Of course we also have synchronize_rcu
> in the notifier which is slow and is now going to be called twice.
> I think call_rcu would be more appropriate here.
> We then need rcu_barrier on module unload.


So this seems unnecessary.


> OTOH if we make pages linear with map then we are good
> with kfree_rcu which is even nicer.


It could be an optimization on top.


>
> 2. Doesn't map leak after vhost_map_unprefetch?
> And why does it poke at contents of the map?
> No one should use it right?


Yes, it's not hard to fix just kfree map in this function.


>
> 3. notifier unregister happens last in vhost_dev_cleanup,
> but register happens first. This looks wrong to me.


I'm not sure I get the the exact issue here.


>
> 4. OK so we use the invalidate count to try and detect that
> some invalidate is in progress.
> I am not 100% sure why do we care.
> Assuming we do, uaddr can change between start and end
> and then the counter can get negative, or generally
> out of sync.


Yes, so the fix is as simple as zero the invalidate_count after
unregister  the mmu notifier in vhost_set_vring_num_addr().


>
> So what to do about all this?
> I am inclined to say let's just drop the uaddr optimization
> for now. E.g. kvm invalidates unconditionally.
> 3 should be fixed independently.


Maybe it's better to try to fix with the exist uaddr optimization first.

I did spot two other issues:

1) we don't check the return value mmu_register in vhost_set_vring_num()

2) we try to setup vq address even if set_vring_addr() fail


For the bug it self, it looks to me that the mm refcount was messed up
since we try to register and unregister MMU notifier. But I haven't
figured out why, will do more investigation.

Thanks


>
>

Jason Wang

unread,
Jul 22, 2019, 1:24:40 AM7/22/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I don't get why must use kfree_rcu() instead of synchronize_rcu() here.


>
> Signed-off-by: Michael S. Tsirkin<m...@redhat.com>


Let me try to figure out the root cause then decide whether or not to go
for this way.

Thanks


Michael S. Tsirkin

unread,
Jul 22, 2019, 3:52:17 AM7/22/19
to Paul E. McKenney, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
We could add an option that simply fails if overloaded, right?
Have caller recover...

--
MST

Michael S. Tsirkin

unread,
Jul 22, 2019, 3:56:33 AM7/22/19
to Paul E. McKenney, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
OK - and for tiny just assume 1 is too much?

Michael S. Tsirkin

unread,
Jul 22, 2019, 4:02:26 AM7/22/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
KVM seems to do exactly that.
I tried and guest does not seem to boot any slower.
Do you observe any slowdown?

Now I took a hard look at the uaddr hackery it really makes
me nervious. So I think for this release we want something
safe, and optimizations on top. As an alternative revert the
optimization and try again for next merge window.


--
MST

Michael S. Tsirkin

unread,
Jul 22, 2019, 4:08:56 AM7/22/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
synchronize_rcu has very high latency on busy systems.
It is not something that should be used on a syscall path.
KVM had to switch to SRCU to keep it sane.
Otherwise one guest can trivially slow down another one.

>
> >
> > Signed-off-by: Michael S. Tsirkin<m...@redhat.com>
>
>
> Let me try to figure out the root cause then decide whether or not to go for
> this way.
>
> Thanks

The root cause of the crash is relevant, but we still need
to fix issues 1-4.

More issues (my patch tries to fix them too):

5. page not dirtied when mappings are torn down outside
of invalidate callback

6. potential cross-VM DOS by one guest keeping system busy
and increasing synchronize_rcu latency to the point where
another guest stars timing out and crashes



--
MST

syzbot

unread,
Jul 22, 2019, 4:26:00 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
syzbot+e58112...@syzkaller.appspotmail.com

Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=12616d00600000

syzbot

unread,
Jul 22, 2019, 4:51:01 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
syzbot+e58112...@syzkaller.appspotmail.com

Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=10e00620600000

syzbot

unread,
Jul 22, 2019, 5:13:01 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

drivers/video/fbdev/core/bitblit.o
CC net/netfilter/nft_compat.o
CC drivers/usb/phy/phy.o
CC drivers/usb/roles/class.o
CC drivers/usb/phy/phy-generic.o
CC drivers/video/fbdev/core/softcursor.o
CC drivers/video/fbdev/core/tileblit.o
CC drivers/usb/serial/usb-serial.o
CC drivers/usb/serial/generic.o
CC drivers/xen/xen-pciback/pci_stub.o
CC drivers/tty/tty_ldsem.o
CC drivers/tty/tty_baudrate.o
CC drivers/usb/core/endpoint.o
CC drivers/watchdog/i6300esb.o
CC drivers/usb/core/devio.o
AR net/sctp/built-in.a
CC drivers/watchdog/iTCO_wdt.o
CC drivers/video/fbdev/core/cfbfillrect.o
CC net/netfilter/nft_connlimit.o
CC drivers/scsi/scsi_logging.o
CC net/netfilter/nft_numgen.o
CC drivers/virtio/virtio_pci_legacy.o
CC drivers/virtio/virtio_balloon.o
CC drivers/net/loopback.o
CC drivers/usb/host/uhci-hcd.o
CC drivers/usb/host/xhci.o
CC drivers/usb/core/notify.o
CC drivers/watchdog/iTCO_vendor_support.o
CC net/sched/sch_choke.o
CC drivers/tty/serial/8250/8250_early.o
CC drivers/tty/tty_jobctrl.o
CC drivers/usb/serial/bus.o
CC drivers/tty/n_null.o
CC drivers/video/fbdev/core/cfbcopyarea.o
AR drivers/video/console/built-in.a
CC net/netfilter/nft_ct.o
CC drivers/tty/serial/8250/8250_mid.o
CC drivers/tty/serial/8250/8250_lpss.o
AR drivers/xen/events/built-in.a
AR drivers/usb/roles/built-in.a
CC drivers/usb/core/generic.o
CC drivers/usb/gadget/usbstring.o
CC drivers/video/fbdev/core/cfbimgblt.o
CC drivers/gpu/drm/i915/icl_dsi.o
CC drivers/usb/storage/uas.o
AR drivers/usb/typec/altmodes/built-in.a
AR drivers/usb/typec/mux/built-in.a
CC drivers/usb/host/xhci-mem.o
CC fs/xfs/xfs_acl.o
CC fs/xfs/xfs_sysctl.o
CC drivers/video/fbdev/core/sysfillrect.o
CC drivers/usb/typec/tcpm/tcpm.o
CC drivers/virtio/virtio_input.o
CC drivers/usb/core/quirks.o
CC drivers/usb/host/xhci-ext-caps.o
CC drivers/scsi/scsi_pm.o
CC drivers/usb/serial/ch341.o
AR drivers/usb/mon/built-in.a
AR drivers/usb/phy/built-in.a
CC drivers/video/fbdev/core/syscopyarea.o
CC drivers/usb/serial/cp210x.o
CC net/netfilter/nft_flow_offload.o
AR drivers/usb/gadget/udc/built-in.a
AR drivers/video/fbdev/omap2/omapfb/displays/built-in.a
AR drivers/video/fbdev/omap2/omapfb/dss/built-in.a
CC drivers/scsi/scsi_common.o
AR drivers/video/fbdev/omap2/omapfb/built-in.a
AR drivers/video/fbdev/omap2/built-in.a
CC drivers/video/fbdev/xen-fbfront.o
CC drivers/tty/pty.o
CC drivers/usb/host/xhci-ring.o
AR drivers/watchdog/built-in.a
CC drivers/usb/host/xhci-hub.o
CC drivers/usb/host/xhci-dbg.o
CC drivers/scsi/raid_class.o
CC drivers/net/netconsole.o
CC net/netfilter/nft_limit.o
CC drivers/usb/gadget/config.o
CC drivers/usb/core/devices.o
CC drivers/usb/gadget/epautoconf.o
CC drivers/usb/gadget/composite.o
CC drivers/xen/xenbus/xenbus_client.o
CC drivers/usb/core/phy.o
CC net/wireless/wext-core.o
CC drivers/usb/host/xhci-trace.o
AR drivers/tty/serial/8250/built-in.a
CC drivers/net/tun.o
CC drivers/net/tap.o
CC drivers/xen/cpu_hotplug.o
CC drivers/xen/grant-table.o
CC drivers/xen/xenfs/super.o
CC drivers/xen/xenbus/xenbus_comms.o
CC drivers/xen/xen-pciback/pciback_ops.o
CC drivers/xen/xen-pciback/xenbus.o
CC drivers/usb/serial/ftdi_sio.o
CC drivers/video/fbdev/core/sysimgblt.o
AR drivers/usb/dwc3/built-in.a
CC drivers/usb/serial/keyspan.o
CC drivers/usb/core/port.o
CC drivers/usb/serial/option.o
CC drivers/xen/xen-pciback/conf_space.o
CC drivers/gpu/drm/i915/intel_crt.o
CC drivers/xen/xen-pciback/conf_space_header.o
AR drivers/virtio/built-in.a
CC drivers/usb/gadget/functions.o
CC fs/xfs/xfs_ioctl32.o
CC drivers/usb/gadget/configfs.o
CC drivers/usb/storage/scsiglue.o
CC drivers/usb/storage/protocol.o
CC net/sched/sch_qfq.o
CC drivers/usb/core/hcd-pci.o
CC drivers/usb/core/usb-acpi.o
CC fs/xfs/xfs_pnfs.o
CC drivers/tty/tty_audit.o
CC net/netfilter/nft_nat.o
CC drivers/xen/xenfs/xenstored.o
CC drivers/xen/xenfs/xensyms.o
CC drivers/usb/storage/transport.o
CC drivers/video/fbdev/core/fb_sys_fops.o
CC drivers/gpu/drm/i915/intel_ddi.o
CC drivers/usb/storage/usb.o
CC drivers/usb/storage/initializers.o
CC drivers/usb/storage/sierra_ms.o
AR drivers/tty/vt/built-in.a
CC drivers/usb/storage/option_ms.o
AR drivers/tty/serial/built-in.a
CC drivers/xen/xen-pciback/conf_space_capability.o
CC drivers/video/fbdev/vesafb.o
CC drivers/xen/xen-pciback/conf_space_quirks.o
CC drivers/scsi/scsi_transport_spi.o
CC net/sched/sch_codel.o
CC drivers/xen/features.o
CC drivers/usb/serial/oti6858.o
CC drivers/usb/gadget/u_f.o
CC drivers/usb/storage/usual-tables.o
CC drivers/gpu/drm/i915/intel_dp_aux_backlight.o
CC drivers/usb/typec/ucsi/ucsi.o
CC drivers/xen/xenbus/xenbus_xs.o
CC drivers/xen/xenbus/xenbus_probe.o
CC drivers/usb/storage/realtek_cr.o
CC drivers/xen/xenbus/xenbus_probe_backend.o
CC drivers/xen/xen-pciback/vpci.o
AR drivers/usb/gadget/function/built-in.a
CC drivers/xen/balloon.o
CC net/netfilter/nft_objref.o
CC net/netfilter/nft_queue.o
CC drivers/xen/xenbus/xenbus_dev_frontend.o
CC drivers/xen/xen-pciback/passthrough.o
CC drivers/tty/sysrq.o
CC drivers/video/fbdev/simplefb.o
AR drivers/xen/xenfs/built-in.a
CC drivers/video/fbdev/efifb.o
CC drivers/video/fbdev/vfb.o
CC drivers/tty/n_hdlc.o
CC net/sched/sch_fq_codel.o
CC net/netfilter/nft_quota.o
CC drivers/usb/serial/pl2303.o
CC drivers/gpu/drm/i915/intel_dp_link_training.o
CC drivers/usb/serial/qcserial.o
CC drivers/gpu/drm/i915/intel_dp_mst.o
CC drivers/usb/typec/class.o
CC drivers/usb/typec/mux.o
CC drivers/xen/manage.o
CC drivers/usb/typec/bus.o
CC net/sched/sch_cake.o
CC drivers/xen/preempt.o
CC drivers/usb/host/xhci-debugfs.o
CC drivers/xen/time.o
CC drivers/xen/xenbus/xenbus_dev_backend.o
CC drivers/usb/host/xhci-pci.o
AR drivers/usb/core/built-in.a
CC drivers/usb/serial/sierra.o
CC net/wireless/wext-proc.o
CC drivers/usb/host/xhci-plat.o
CC drivers/usb/serial/usb-serial-simple.o
CC net/netfilter/nft_reject.o
CC drivers/xen/xenbus/xenbus_probe_frontend.o
CC drivers/usb/serial/usb_wwan.o
CC net/sched/sch_fq.o
AR drivers/xen/xen-pciback/built-in.a
CC drivers/scsi/scsi_transport_fc.o
CC drivers/net/veth.o
CC drivers/net/virtio_net.o
CC net/netfilter/nft_reject_inet.o
CC drivers/tty/ttynull.o
CC drivers/usb/typec/ucsi/trace.o
CC drivers/scsi/scsi_transport_iscsi.o
CC net/sched/sch_hhf.o
CC drivers/xen/pci.o
CC drivers/xen/dbgp.o
CC drivers/xen/mem-reservation.o
CC drivers/usb/typec/ucsi/ucsi_acpi.o
CC drivers/xen/acpi.o
CC drivers/xen/xen-acpi-pad.o
AR drivers/video/fbdev/core/built-in.a
CC drivers/xen/pcpu.o
CC drivers/xen/biomerge.o
CC drivers/scsi/scsi_transport_sas.o
AR drivers/usb/storage/built-in.a
CC drivers/scsi/scsi_transport_srp.o
CC drivers/xen/xen-balloon.o
CC drivers/gpu/drm/i915/intel_dp.o
CC drivers/xen/evtchn.o
CC drivers/net/geneve.o
CC drivers/net/vxlan.o
CC drivers/gpu/drm/i915/intel_dsi.o
CC drivers/scsi/libiscsi.o
CC drivers/scsi/hpsa.o
AR drivers/video/fbdev/built-in.a
CC drivers/gpu/drm/i915/intel_dsi_dcs_backlight.o
CC drivers/scsi/virtio_scsi.o
AR drivers/video/built-in.a
CC drivers/scsi/st.o
CC drivers/scsi/sd_dif.o
CC drivers/usb/usbip/usbip_common.o
CC drivers/scsi/sd.o
CC drivers/scsi/sd_zbc.o
CC drivers/usb/usbip/usbip_event.o
CC drivers/scsi/sr.o
CC drivers/scsi/sr_ioctl.o
CC drivers/scsi/sr_vendor.o
CC drivers/net/gtp.o
CC drivers/scsi/sg.o
CC drivers/gpu/drm/i915/intel_dsi_vbt.o
CC net/netfilter/nft_tunnel.o
CC drivers/usb/usbip/vhci_sysfs.o
CC drivers/xen/gntdev.o
CC drivers/gpu/drm/i915/intel_dvo.o
CC drivers/xen/gntalloc.o
CC drivers/gpu/drm/i915/intel_hdmi.o
CC drivers/gpu/drm/i915/intel_lspcon.o
CC drivers/gpu/drm/i915/intel_i2c.o
AR drivers/tty/built-in.a
CC drivers/xen/sys-hypervisor.o
CC drivers/gpu/drm/i915/intel_lvds.o
CC drivers/gpu/drm/i915/intel_sdvo.o
CC drivers/gpu/drm/i915/intel_panel.o
CC net/netfilter/nft_counter.o
CC net/sched/sch_pie.o
CC net/wireless/shipped-certs.o
AR drivers/usb/gadget/built-in.a
CC drivers/xen/platform-pci.o
CC drivers/scsi/scsi_sysfs.o
AR drivers/usb/typec/ucsi/built-in.a
AR drivers/xen/xenbus/built-in.a
CC drivers/net/nlmon.o
CC net/sched/sch_cbs.o
CC net/sched/sch_etf.o
CC net/netfilter/nft_log.o
AR fs/xfs/built-in.a
CC drivers/usb/usbip/vhci_tx.o
CC net/netfilter/nft_masq.o
AR fs/built-in.a
AR drivers/usb/serial/built-in.a
CC drivers/gpu/drm/i915/intel_tv.o
CC drivers/net/vrf.o
CC drivers/xen/mcelog.o
CC drivers/xen/swiotlb-xen.o
CC drivers/xen/privcmd.o
CC net/sched/sch_taprio.o
CC net/sched/cls_u32.o
CC drivers/xen/privcmd-buf.o
CC drivers/xen/xen-acpi-processor.o
CC drivers/gpu/drm/i915/vlv_dsi.o
CC drivers/usb/usbip/vhci_rx.o
CC drivers/usb/usbip/vhci_hcd.o
CC net/netfilter/nft_redir.o
CC net/netfilter/nft_hash.o
CC drivers/gpu/drm/i915/vlv_dsi_pll.o
CC drivers/gpu/drm/i915/intel_vdsc.o
CC net/sched/cls_route.o
CC net/sched/cls_fw.o
CC drivers/xen/efi.o
CC drivers/xen/pvcalls-back.o
CC drivers/xen/xlate_mmu.o
CC drivers/xen/pvcalls-front.o
CC drivers/net/vsockmon.o
CC drivers/usb/usbip/stub_dev.o
CC drivers/net/xen-netfront.o
CC net/netfilter/nft_fib.o
CC drivers/xen/xen-front-pgdir-shbuf.o
AR drivers/usb/typec/tcpm/built-in.a
AR drivers/usb/typec/built-in.a
CC drivers/usb/usbip/stub_main.o
CC drivers/usb/usbip/stub_rx.o
CC drivers/gpu/drm/i915/i915_gpu_error.o
CC drivers/usb/usbip/stub_tx.o
CC net/sched/cls_rsvp.o
CC net/sched/cls_tcindex.o
CC drivers/net/net_failover.o
CC net/sched/cls_rsvp6.o
CC net/sched/cls_basic.o
CC net/sched/cls_flow.o
CC drivers/usb/usbip/vudc_dev.o
CC net/sched/cls_cgroup.o
CC drivers/usb/usbip/vudc_sysfs.o
CC drivers/gpu/drm/i915/i915_vgpu.o
CC drivers/gpu/drm/i915/i915_perf.o
CC net/sched/cls_bpf.o
CC drivers/usb/usbip/vudc_tx.o
CC drivers/usb/usbip/vudc_rx.o
CC net/netfilter/nft_fib_inet.o
CC net/sched/cls_flower.o
CC net/netfilter/nft_fib_netdev.o
AR drivers/usb/host/built-in.a
CC net/netfilter/nft_socket.o
CC drivers/usb/usbip/vudc_transfer.o
CC net/sched/cls_matchall.o
CC net/sched/ematch.o
CC net/sched/em_cmp.o
CC net/sched/em_nbyte.o
CC drivers/gpu/drm/i915/i915_oa_hsw.o
CC drivers/gpu/drm/i915/i915_oa_bdw.o
CC drivers/gpu/drm/i915/i915_oa_chv.o
CC net/netfilter/nft_osf.o
CC drivers/usb/usbip/vudc_main.o
CC net/sched/em_u32.o
CC drivers/gpu/drm/i915/i915_oa_sklgt2.o
CC drivers/gpu/drm/i915/i915_oa_sklgt3.o
CC net/sched/em_meta.o
CC net/netfilter/nft_tproxy.o
CC drivers/gpu/drm/i915/i915_oa_sklgt4.o
CC net/sched/em_text.o
CC net/netfilter/nft_xfrm.o
CC net/sched/em_canid.o
CC net/netfilter/nft_chain_nat.o
CC net/sched/em_ipset.o
CC drivers/gpu/drm/i915/i915_oa_bxt.o
CC net/sched/em_ipt.o
CC drivers/gpu/drm/i915/i915_oa_kblgt2.o
CC net/netfilter/nft_dup_netdev.o
CC net/netfilter/nft_fwd_netdev.o
CC net/netfilter/nf_flow_table_core.o
CC drivers/gpu/drm/i915/i915_oa_kblgt3.o
CC net/netfilter/nf_flow_table_ip.o
CC drivers/gpu/drm/i915/i915_oa_glk.o
CC drivers/gpu/drm/i915/i915_oa_cflgt2.o
CC drivers/gpu/drm/i915/i915_oa_cflgt3.o
CC net/netfilter/nf_flow_table_inet.o
CC net/netfilter/x_tables.o
CC drivers/gpu/drm/i915/i915_oa_cnl.o
CC drivers/gpu/drm/i915/i915_oa_icl.o
AR drivers/usb/usbip/built-in.a
AR drivers/usb/built-in.a
CC net/netfilter/xt_tcpudp.o
CC drivers/gpu/drm/i915/intel_lpe_audio.o
CC net/netfilter/xt_mark.o
CC net/netfilter/xt_connmark.o
CC net/netfilter/xt_set.o
CC net/netfilter/xt_AUDIT.o
CC net/netfilter/xt_CHECKSUM.o
CC net/netfilter/xt_nat.o
CC net/netfilter/xt_CLASSIFY.o
CC net/netfilter/xt_CONNSECMARK.o
CC net/netfilter/xt_CT.o
CC net/netfilter/xt_DSCP.o
CC net/netfilter/xt_HL.o
CC net/netfilter/xt_LED.o
CC net/netfilter/xt_HMARK.o
CC net/netfilter/xt_LOG.o
CC net/netfilter/xt_NETMAP.o
CC net/netfilter/xt_NFLOG.o
AR drivers/xen/built-in.a
CC net/netfilter/xt_NFQUEUE.o
CC net/netfilter/xt_RATEEST.o
CC net/netfilter/xt_MASQUERADE.o
CC net/netfilter/xt_REDIRECT.o
CC net/netfilter/xt_SECMARK.o
CC net/netfilter/xt_TPROXY.o
CC net/netfilter/xt_TCPMSS.o
CC net/netfilter/xt_TCPOPTSTRIP.o
CC net/netfilter/xt_TEE.o
CC net/netfilter/xt_TRACE.o
CC net/netfilter/xt_IDLETIMER.o
CC net/netfilter/xt_addrtype.o
CC net/netfilter/xt_bpf.o
CC net/netfilter/xt_cluster.o
CC net/netfilter/xt_comment.o
CC net/netfilter/xt_connbytes.o
CC net/netfilter/xt_connlabel.o
CC net/netfilter/xt_connlimit.o
CC net/netfilter/xt_conntrack.o
CC net/netfilter/xt_cpu.o
CC net/netfilter/xt_dccp.o
CC net/netfilter/xt_dscp.o
CC net/netfilter/xt_devgroup.o
CC net/netfilter/xt_ecn.o
CC net/netfilter/xt_hashlimit.o
CC net/netfilter/xt_esp.o
CC net/netfilter/xt_helper.o
CC net/netfilter/xt_hl.o
CC net/netfilter/xt_iprange.o
CC net/netfilter/xt_ipcomp.o
CC net/netfilter/xt_ipvs.o
CC net/netfilter/xt_l2tp.o
CC net/netfilter/xt_length.o
CC net/netfilter/xt_mac.o
CC net/netfilter/xt_limit.o
CC net/netfilter/xt_multiport.o
CC net/netfilter/xt_nfacct.o
CC net/netfilter/xt_osf.o
CC net/netfilter/xt_cgroup.o
CC net/netfilter/xt_owner.o
CC net/netfilter/xt_physdev.o
CC net/netfilter/xt_pkttype.o
CC net/netfilter/xt_policy.o
CC net/netfilter/xt_quota.o
CC net/netfilter/xt_rateest.o
CC net/netfilter/xt_realm.o
CC net/netfilter/xt_recent.o
CC net/netfilter/xt_sctp.o
CC net/netfilter/xt_socket.o
CC net/netfilter/xt_state.o
CC net/netfilter/xt_statistic.o
CC net/netfilter/xt_string.o
CC net/netfilter/xt_time.o
CC net/netfilter/xt_u32.o
CC net/netfilter/xt_tcpmss.o
AR net/sched/built-in.a
AR drivers/net/built-in.a
AR drivers/gpu/drm/i915/built-in.a
AR drivers/gpu/drm/built-in.a
AR drivers/gpu/built-in.a
AR net/netfilter/built-in.a
AR net/wireless/built-in.a
AR net/built-in.a
AR drivers/scsi/built-in.a
Makefile:1071: recipe for target 'drivers' failed
make: *** [drivers] Error 2
make: *** Waiting for unfinished jobs....


Error text is too large and was truncated, full error text is at:
https://syzkaller.appspot.com/x/error.txt?x=10fda9d0600000


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=130a6e6c600000

syzbot

unread,
Jul 22, 2019, 5:23:01 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
possible deadlock in down_trylock

======================================================
WARNING: possible circular locking dependency detected
5.2.0-rc2+ #1 Not tainted
------------------------------------------------------
syz-executor.0/8912 is trying to acquire lock:
0000000020035c1a ((console_sem).lock){-.-.}, at: down_trylock+0x13/0x70
/kernel/locking/semaphore.c:136

but task is already holding lock:
0000000038667bb4 (&rq->lock){-.-.}, at: rq_lock /kernel/sched/sched.h:1168
[inline]
0000000038667bb4 (&rq->lock){-.-.}, at: __schedule+0x1f5/0x15c0
/kernel/sched/core.c:3397

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&rq->lock){-.-.}:
__raw_spin_lock /./include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 /kernel/locking/spinlock.c:151
rq_lock /kernel/sched/sched.h:1168 [inline]
task_fork_fair+0x6a/0x520 /kernel/sched/fair.c:10208
sched_fork+0x3af/0x900 /kernel/sched/core.c:2347
copy_process.part.0+0x1a25/0x67d0 /kernel/fork.c:2010
copy_process /kernel/fork.c:1808 [inline]
_do_fork+0x25d/0xfe0 /kernel/fork.c:2377
kernel_thread+0x34/0x40 /kernel/fork.c:2436
rest_init+0x28/0x37b /init/main.c:417
arch_call_rest_init+0xe/0x1b
start_kernel+0x854/0x893 /init/main.c:761
x86_64_start_reservations+0x29/0x2b /arch/x86/kernel/head64.c:470
x86_64_start_kernel+0x77/0x7b /arch/x86/kernel/head64.c:451
secondary_startup_64+0xa4/0xb0 /arch/x86/kernel/head_64.S:243

-> #1 (&p->pi_lock){-.-.}:
__raw_spin_lock_irqsave /./include/linux/spinlock_api_smp.h:110
[inline]
_raw_spin_lock_irqsave+0x95/0xcd /kernel/locking/spinlock.c:159
try_to_wake_up+0x90/0x13f0 /kernel/sched/core.c:2000
wake_up_process+0x10/0x20 /kernel/sched/core.c:2114
__up.isra.0+0x136/0x1a0 /kernel/locking/semaphore.c:262
up+0x9c/0xe0 /kernel/locking/semaphore.c:187
__up_console_sem+0xb7/0x1c0 /kernel/printk/printk.c:244
console_unlock+0x663/0xec0 /kernel/printk/printk.c:2481
vprintk_emit+0x2a0/0x700 /kernel/printk/printk.c:1986
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
check_stack_usage /kernel/exit.c:765 [inline]
do_exit.cold+0x123/0x264 /kernel/exit.c:927
do_group_exit+0x135/0x370 /kernel/exit.c:981
__do_sys_exit_group /kernel/exit.c:992 [inline]
__se_sys_exit_group /kernel/exit.c:990 [inline]
__x64_sys_exit_group+0x44/0x50 /kernel/exit.c:990
do_syscall_64+0xfd/0x680 /arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 ((console_sem).lock){-.-.}:
lock_acquire+0x16f/0x3f0 /kernel/locking/lockdep.c:4303
__raw_spin_lock_irqsave /./include/linux/spinlock_api_smp.h:110
[inline]
_raw_spin_lock_irqsave+0x95/0xcd /kernel/locking/spinlock.c:159
down_trylock+0x13/0x70 /kernel/locking/semaphore.c:136
__down_trylock_console_sem+0xa8/0x210 /kernel/printk/printk.c:227
console_trylock+0x15/0xa0 /kernel/printk/printk.c:2297
console_trylock_spinning /kernel/printk/printk.c:1706 [inline]
vprintk_emit+0x283/0x700 /kernel/printk/printk.c:1985
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
mmgrab /./include/linux/sched/mm.h:38 [inline]
context_switch /kernel/sched/core.c:2803 [inline]
__schedule+0x15a9/0x15c0 /kernel/sched/core.c:3445
preempt_schedule_common+0x4f/0xe0 /kernel/sched/core.c:3590
preempt_schedule+0x4b/0x60 /kernel/sched/core.c:3616
___preempt_schedule+0x16/0x18
vprintk_emit+0x2cd/0x700 /kernel/printk/printk.c:1987
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
vhost_dev_set_owner+0x15a/0xa30 /drivers/vhost/vhost.c:713
vhost_net_set_owner /drivers/vhost/net.c:1693 [inline]
vhost_net_ioctl+0xca9/0x1900 /drivers/vhost/net.c:1742
vfs_ioctl /fs/ioctl.c:46 [inline]
file_ioctl /fs/ioctl.c:509 [inline]
do_vfs_ioctl+0xd5f/0x1380 /fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 /fs/ioctl.c:713
__do_sys_ioctl /fs/ioctl.c:720 [inline]
__se_sys_ioctl /fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 /fs/ioctl.c:718
do_syscall_64+0xfd/0x680 /arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
(console_sem).lock --> &p->pi_lock --> &rq->lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&rq->lock);
lock(&p->pi_lock);
lock(&rq->lock);
lock((console_sem).lock);

*** DEADLOCK ***

2 locks held by syz-executor.0/8912:
#0: 000000006817fb74 (&dev->mutex#4){+.+.}, at: vhost_net_set_owner
/drivers/vhost/net.c:1685 [inline]
#0: 000000006817fb74 (&dev->mutex#4){+.+.}, at:
vhost_net_ioctl+0x469/0x1900 /drivers/vhost/net.c:1742
#1: 0000000038667bb4 (&rq->lock){-.-.}, at: rq_lock
/kernel/sched/sched.h:1168 [inline]
#1: 0000000038667bb4 (&rq->lock){-.-.}, at: __schedule+0x1f5/0x15c0
/kernel/sched/core.c:3397

stack backtrace:
CPU: 0 PID: 8912 Comm: syz-executor.0 Not tainted 5.2.0-rc2+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack /lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 /lib/dump_stack.c:113
print_circular_bug.cold+0x1cc/0x28f /kernel/locking/lockdep.c:1565
check_prev_add /kernel/locking/lockdep.c:2310 [inline]
check_prevs_add /kernel/locking/lockdep.c:2418 [inline]
validate_chain /kernel/locking/lockdep.c:2800 [inline]
__lock_acquire+0x3755/0x5490 /kernel/locking/lockdep.c:3793
lock_acquire+0x16f/0x3f0 /kernel/locking/lockdep.c:4303
__raw_spin_lock_irqsave /./include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x95/0xcd /kernel/locking/spinlock.c:159
down_trylock+0x13/0x70 /kernel/locking/semaphore.c:136
__down_trylock_console_sem+0xa8/0x210 /kernel/printk/printk.c:227
console_trylock+0x15/0xa0 /kernel/printk/printk.c:2297
console_trylock_spinning /kernel/printk/printk.c:1706 [inline]
vprintk_emit+0x283/0x700 /kernel/printk/printk.c:1985
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
mmgrab /./include/linux/sched/mm.h:38 [inline]
context_switch /kernel/sched/core.c:2803 [inline]
__schedule+0x15a9/0x15c0 /kernel/sched/core.c:3445
preempt_schedule_common+0x4f/0xe0 /kernel/sched/core.c:3590
preempt_schedule+0x4b/0x60 /kernel/sched/core.c:3616
___preempt_schedule+0x16/0x18
vprintk_emit+0x2cd/0x700 /kernel/printk/printk.c:1987
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
vhost_dev_set_owner+0x15a/0xa30 /drivers/vhost/vhost.c:713
vhost_net_set_owner /drivers/vhost/net.c:1693 [inline]
vhost_net_ioctl+0xca9/0x1900 /drivers/vhost/net.c:1742
vfs_ioctl /fs/ioctl.c:46 [inline]
file_ioctl /fs/ioctl.c:509 [inline]
do_vfs_ioctl+0xd5f/0x1380 /fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 /fs/ioctl.c:713
__do_sys_ioctl /fs/ioctl.c:720 [inline]
__se_sys_ioctl /fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 /fs/ioctl.c:718
do_syscall_64+0xfd/0x680 /arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459819
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f29622e6c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000459819
RDX: 0000000000000000 RSI: 000000000000af01 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f29622e76d4
R13: 00000000004c46a9 R14: 00000000004d8758 R15: 00000000ffffffff


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=10780934600000
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=14b30eafa00000

syzbot

unread,
Jul 22, 2019, 6:55:01 AM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer still triggered
crash:
possible deadlock in down_trylock

======================================================
WARNING: possible circular locking dependency detected
5.2.0-rc2+ #1 Not tainted
------------------------------------------------------
syz-executor.2/8765 is trying to acquire lock:
000000001aec7bd9 ((console_sem).lock){-.-.}, at: down_trylock+0x13/0x70
/kernel/locking/semaphore.c:136

but task is already holding lock:
00000000d206df9f (&rq->lock){-.-.}, at: rq_lock /kernel/sched/sched.h:1168
[inline]
00000000d206df9f (&rq->lock){-.-.}, at: __schedule+0x1f5/0x1600
/kernel/sched/core.c:3397

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&rq->lock){-.-.}:
__raw_spin_lock /./include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2f/0x40 /kernel/locking/spinlock.c:151
rq_lock /kernel/sched/sched.h:1168 [inline]
task_fork_fair+0x6a/0x520 /kernel/sched/fair.c:10208
sched_fork+0x3af/0x900 /kernel/sched/core.c:2347
copy_process.part.0+0x1a25/0x6840 /kernel/fork.c:2012
copy_process /kernel/fork.c:1810 [inline]
_do_fork+0x25d/0xfe0 /kernel/fork.c:2379
kernel_thread+0x34/0x40 /kernel/fork.c:2438
rest_init+0x28/0x37b /init/main.c:417
arch_call_rest_init+0xe/0x1b
start_kernel+0x854/0x893 /init/main.c:761
x86_64_start_reservations+0x29/0x2b /arch/x86/kernel/head64.c:470
x86_64_start_kernel+0x77/0x7b /arch/x86/kernel/head64.c:451
secondary_startup_64+0xa4/0xb0 /arch/x86/kernel/head_64.S:243

-> #1 (&p->pi_lock){-.-.}:
__raw_spin_lock_irqsave /./include/linux/spinlock_api_smp.h:110
[inline]
_raw_spin_lock_irqsave+0x95/0xcd /kernel/locking/spinlock.c:159
try_to_wake_up+0x90/0x13f0 /kernel/sched/core.c:2000
wake_up_process+0x10/0x20 /kernel/sched/core.c:2114
__up.isra.0+0x136/0x1a0 /kernel/locking/semaphore.c:262
up+0x9c/0xe0 /kernel/locking/semaphore.c:187
__up_console_sem+0xb7/0x1c0 /kernel/printk/printk.c:244
console_unlock+0x663/0xec0 /kernel/printk/printk.c:2481
vprintk_emit+0x2a0/0x700 /kernel/printk/printk.c:1986
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
check_stack_usage /kernel/exit.c:765 [inline]
do_exit.cold+0x5d/0x254 /kernel/exit.c:927
do_group_exit+0x135/0x370 /kernel/exit.c:981
__do_sys_exit_group /kernel/exit.c:992 [inline]
__se_sys_exit_group /kernel/exit.c:990 [inline]
__x64_sys_exit_group+0x44/0x50 /kernel/exit.c:990
do_syscall_64+0xfd/0x680 /arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 ((console_sem).lock){-.-.}:
lock_acquire+0x16f/0x3f0 /kernel/locking/lockdep.c:4303
__raw_spin_lock_irqsave /./include/linux/spinlock_api_smp.h:110
[inline]
_raw_spin_lock_irqsave+0x95/0xcd /kernel/locking/spinlock.c:159
down_trylock+0x13/0x70 /kernel/locking/semaphore.c:136
__down_trylock_console_sem+0xa8/0x210 /kernel/printk/printk.c:227
console_trylock+0x15/0xa0 /kernel/printk/printk.c:2297
console_trylock_spinning /kernel/printk/printk.c:1706 [inline]
vprintk_emit+0x283/0x700 /kernel/printk/printk.c:1985
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
mmgrab /./include/linux/sched/mm.h:38 [inline]
context_switch /kernel/sched/core.c:2803 [inline]
__schedule+0x1562/0x1600 /kernel/sched/core.c:3445
preempt_schedule_common+0x4f/0xe0 /kernel/sched/core.c:3590
preempt_schedule+0x4b/0x60 /kernel/sched/core.c:3616
___preempt_schedule+0x16/0x18
vprintk_emit+0x2cd/0x700 /kernel/printk/printk.c:1987
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
vhost_debug_mm+0xd5/0x110 /drivers/vhost/vhost.c:52
vhost_dev_set_owner+0x167/0xa90 /drivers/vhost/vhost.c:720
vhost_net_set_owner /drivers/vhost/net.c:1693 [inline]
vhost_net_ioctl+0xca9/0x1900 /drivers/vhost/net.c:1742
vfs_ioctl /fs/ioctl.c:46 [inline]
file_ioctl /fs/ioctl.c:509 [inline]
do_vfs_ioctl+0xd5f/0x1380 /fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 /fs/ioctl.c:713
__do_sys_ioctl /fs/ioctl.c:720 [inline]
__se_sys_ioctl /fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 /fs/ioctl.c:718
do_syscall_64+0xfd/0x680 /arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
(console_sem).lock --> &p->pi_lock --> &rq->lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&rq->lock);
lock(&p->pi_lock);
lock(&rq->lock);
lock((console_sem).lock);

*** DEADLOCK ***

2 locks held by syz-executor.2/8765:
#0: 00000000948e0edd (&dev->mutex#4){+.+.}, at: vhost_net_set_owner
/drivers/vhost/net.c:1685 [inline]
#0: 00000000948e0edd (&dev->mutex#4){+.+.}, at:
vhost_net_ioctl+0x469/0x1900 /drivers/vhost/net.c:1742
#1: 00000000d206df9f (&rq->lock){-.-.}, at: rq_lock
/kernel/sched/sched.h:1168 [inline]
#1: 00000000d206df9f (&rq->lock){-.-.}, at: __schedule+0x1f5/0x1600
/kernel/sched/core.c:3397

stack backtrace:
CPU: 1 PID: 8765 Comm: syz-executor.2 Not tainted 5.2.0-rc2+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack /lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 /lib/dump_stack.c:113
print_circular_bug.cold+0x1cc/0x28f /kernel/locking/lockdep.c:1565
check_prev_add /kernel/locking/lockdep.c:2310 [inline]
check_prevs_add /kernel/locking/lockdep.c:2418 [inline]
validate_chain /kernel/locking/lockdep.c:2800 [inline]
__lock_acquire+0x3755/0x5490 /kernel/locking/lockdep.c:3793
lock_acquire+0x16f/0x3f0 /kernel/locking/lockdep.c:4303
__raw_spin_lock_irqsave /./include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x95/0xcd /kernel/locking/spinlock.c:159
down_trylock+0x13/0x70 /kernel/locking/semaphore.c:136
__down_trylock_console_sem+0xa8/0x210 /kernel/printk/printk.c:227
console_trylock+0x15/0xa0 /kernel/printk/printk.c:2297
console_trylock_spinning /kernel/printk/printk.c:1706 [inline]
vprintk_emit+0x283/0x700 /kernel/printk/printk.c:1985
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
mmgrab /./include/linux/sched/mm.h:38 [inline]
context_switch /kernel/sched/core.c:2803 [inline]
__schedule+0x1562/0x1600 /kernel/sched/core.c:3445
preempt_schedule_common+0x4f/0xe0 /kernel/sched/core.c:3590
preempt_schedule+0x4b/0x60 /kernel/sched/core.c:3616
___preempt_schedule+0x16/0x18
vprintk_emit+0x2cd/0x700 /kernel/printk/printk.c:1987
vprintk_default+0x28/0x30 /kernel/printk/printk.c:2013
vprintk_func+0x7e/0x189 /kernel/printk/printk_safe.c:386
printk+0xba/0xed /kernel/printk/printk.c:2046
vhost_debug_mm+0xd5/0x110 /drivers/vhost/vhost.c:52
vhost_dev_set_owner+0x167/0xa90 /drivers/vhost/vhost.c:720
vhost_net_set_owner /drivers/vhost/net.c:1693 [inline]
vhost_net_ioctl+0xca9/0x1900 /drivers/vhost/net.c:1742
vfs_ioctl /fs/ioctl.c:46 [inline]
file_ioctl /fs/ioctl.c:509 [inline]
do_vfs_ioctl+0xd5f/0x1380 /fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 /fs/ioctl.c:713
__do_sys_ioctl /fs/ioctl.c:720 [inline]
__se_sys_ioctl /fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 /fs/ioctl.c:718
do_syscall_64+0xfd/0x680 /arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459819
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fbe1f8bdc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000459819
RDX: 0000000000000000 RSI: 000000000000af01 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fbe1f8be6d4
R13: 00000000004c46a9 R14: 00000000004d8758 R15: 00000000ffffffff


Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=14cb2620600000
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=14182678600000

Paul E. McKenney

unread,
Jul 22, 2019, 7:51:58 AM7/22/19
to Michael S. Tsirkin, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
For example, return EBUSY from your ioctl? That should work. You could
also sleep for a jiffy or two to let things catch up in this BUSY (or
similar) case. Or try three times, waiting a jiffy between each try,
and return EBUSY if all three tries failed.

Or just keep it simple and return EBUSY on the first try. ;-)

All of this assumes that this ioctl is the cause of the overload, which
during early boot seems to me to be a safe assumption.

Thanx, Paul

Paul E. McKenney

unread,
Jul 22, 2019, 7:57:54 AM7/22/19
to Michael S. Tsirkin, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I bet that for tiny you won't need to rate-limit at all. The reason
is that grace periods are quite short.

In fact, for TINY (that is, !SMP && !PREEMPT), synchronize_rcu() is a
no-op. So in TINY, given that your ioctl is executing at process level,
you could just invoke synchronize_rcu() and then kfree():

#ifdef CONFIG_TINY_RCU
synchronize_rcu(); /* No other CPUs, so a QS is a GP! */
kfree(whatever);
return; /* Or whatever control flow is appropriate. */
#endif
/* More complicated stuff for !TINY. */

Thanx, Paul

Jason Gunthorpe

unread,
Jul 22, 2019, 9:41:54 AM7/22/19
to Paul E. McKenney, Michael S. Tsirkin, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote:

> > > > Would it make sense to have call_rcu() check to see if there are many
> > > > outstanding requests on this CPU and if so process them before returning?
> > > > That would ensure that frequent callers usually ended up doing their
> > > > own processing.
> > >
> > > Unfortunately, no. Here is a code fragment illustrating why:

That is only true in the general case though, kfree_rcu() doesn't have
this problem since we know what the callback is doing. In general a
caller of kfree_rcu() should not need to hold any locks while calling
it.

We could apply the same idea more generally and have some
'call_immediate_or_rcu()' which has restrictions on the caller's
context.

I think if we have some kind of problem here it would be better to
handle it inside the core code and only require that callers use the
correct RCU API.

I can think of many places where kfree_rcu() is being used under user
control..

Jason

Jason Gunthorpe

unread,
Jul 22, 2019, 10:11:54 AM7/22/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > syzbot has bisected this bug to:
> >
> > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > Author: Jason Wang <jaso...@redhat.com>
> > Date: Fri May 24 08:12:18 2019 +0000
> >
> > vhost: access vq metadata through kernel virtual address
> >
> > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > start commit: 6d21a41b Add linux-next specific files for 20190718
> > git tree: linux-next
> > final crash: https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> >
> > Reported-by: syzbot+e58112...@syzkaller.appspotmail.com
> > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > address")
> >
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>
>
> OK I poked at this for a bit, I see several things that
> we need to fix, though I'm not yet sure it's the reason for
> the failures:

This stuff looks quite similar to the hmm_mirror use model and other
places in the kernel. I'm still hoping we can share this code a bit more.

There is another bug, this sequence here:

vhost_vring_set_num_addr()
mmu_notifier_unregister()
[..]
mmu_notifier_register()

Which I think is trying to create a lock to protect dev->vqs..

Has the problem that mmu_notifier_unregister() doesn't guarantee that
invalidate_start/end are fully paired.

So after any unregister the code has to clean up any resulting
unbalanced invalidate_count before it can call mmu_notifier_register
again. ie zero the invalidate_count.

It also seems really weird that vhost_map_prefetch() can fail, ie due
to __get_user_pages_fast needing to block, but that just silently
(permanently?) disables the optimization?? At least the usage here
would be better done with a seqcount lock and a normal blocking call
to get_user_pages_fast()...

Jason

Joel Fernandes

unread,
Jul 22, 2019, 11:14:43 AM7/22/19
to Paul E. McKenney, Matthew Wilcox, Michael S. Tsirkin, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
[snip]
> > Would it make sense to have call_rcu() check to see if there are many
> > outstanding requests on this CPU and if so process them before returning?
> > That would ensure that frequent callers usually ended up doing their
> > own processing.

Other than what Paul already mentioned about deadlocks, I am not sure if this
would even work for all cases since call_rcu() has to wait for a grace
period.

So, if the number of outstanding requests are higher than a certain amount,
then you *still* have to wait for some RCU configurations for the grace
period duration and cannot just execute the callback in-line. Did I miss
something?

Can waiting in-line for a grace period duration be tolerated in the vhost case?

thanks,

- Joel

Michael S. Tsirkin

unread,
Jul 22, 2019, 11:47:37 AM7/22/19
to Joel Fernandes, Paul E. McKenney, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
No, but it has many other ways to recover (try again later, drop a
packet, use a slower copy to/from user).

--
MST

Paul E. McKenney

unread,
Jul 22, 2019, 11:52:50 AM7/22/19
to Jason Gunthorpe, Michael S. Tsirkin, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Mon, Jul 22, 2019 at 10:41:52AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote:
>
> > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > That would ensure that frequent callers usually ended up doing their
> > > > > own processing.
> > > >
> > > > Unfortunately, no. Here is a code fragment illustrating why:
>
> That is only true in the general case though, kfree_rcu() doesn't have
> this problem since we know what the callback is doing. In general a
> caller of kfree_rcu() should not need to hold any locks while calling
> it.

Good point, at least as long as the slab allocators don't call kfree_rcu()
while holding any of the slab locks.

However, that would require a separate list for the kfree_rcu() callbacks,
and concurrent access to those lists of kfree_rcu() callbacks. So this
might work, but would add some complexity and also yet another restriction
between RCU and another kernel subsystem. So I would like to try the
other approaches first, for example, the time-based approach in my
prototype and Eric Dumazet's more polished patch.

But the immediate-invocation possibility is still there if needed.

> We could apply the same idea more generally and have some
> 'call_immediate_or_rcu()' which has restrictions on the caller's
> context.
>
> I think if we have some kind of problem here it would be better to
> handle it inside the core code and only require that callers use the
> correct RCU API.

Agreed. Especially given that there are a number of things that can
be done within RCU.

> I can think of many places where kfree_rcu() is being used under user
> control..

And same for call_rcu().

And this is not the first time we have run into this. The last time
was about 15 years ago, if I remember correctly, and that one led to
some of the quiescent-state forcing and callback-invocation batch size
tricks still in use today. My only real surprise is that it took so
long for this to come up again. ;-)

Please note also that in the common case on default configurations,
callback invocation is done on the CPU that posted the callback.
This means that callback invocation normally applies backpressure
to the callback-happy workload.

So why then is there a problem?

The problem is not the lack of backpressure, but rather that the
scheduling of callback invocation needs to be a bit more considerate
of the needs of the rest of the system. In the common case, that is.
Except that the uncommon case is real-time configurations, in which care
is needed anyway. But I am in the midst of helping those out as well,
details on the "dev" branch of -rcu.

Thanx, Paul

Paul E. McKenney

unread,
Jul 22, 2019, 11:55:47 AM7/22/19
to Michael S. Tsirkin, Joel Fernandes, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
True enough! And your idea of taking recovery action based on the number
of callbacks seems like a good one while we are getting RCU's callback
scheduling improved.

By the way, was this a real problem that you could make happen on real
hardware? If not, I would suggest just letting RCU get improved over
the next couple of releases.

If it is something that you actually made happen, please let me know
what (if anything) you need from me for your callback-counting EBUSY
scheme.

Thanx, Paul

Jason Gunthorpe

unread,
Jul 22, 2019, 12:04:51 PM7/22/19
to Paul E. McKenney, Michael S. Tsirkin, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> So why then is there a problem?

I'm not sure there is a real problem, I thought Michael was just
asking how to design with RCU in the case where the user controls the
kfree_rcu??

Sounds like the answer is "don't worry about it" ?

Thanks,
Jason

Michael S. Tsirkin

unread,
Jul 22, 2019, 12:13:53 PM7/22/19
to Paul E. McKenney, Joel Fernandes, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
So basically use kfree_rcu but add a comment saying e.g. "WARNING:
in the future callers of kfree_rcu might need to check that
not too many callbacks get queued. In that case, we can
disable the optimization, or recover in some other way.
Watch this space."


> If it is something that you actually made happen, please let me know
> what (if anything) you need from me for your callback-counting EBUSY
> scheme.
>
> Thanx, Paul

If you mean kfree_rcu causing OOM then no, it's all theoretical.
If you mean synchronize_rcu stalling to the point where guest will OOPs,
then yes, that's not too hard to trigger.

Michael S. Tsirkin

unread,
Jul 22, 2019, 12:15:34 PM7/22/19
to Jason Gunthorpe, Paul E. McKenney, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> > So why then is there a problem?
>
> I'm not sure there is a real problem, I thought Michael was just
> asking how to design with RCU in the case where the user controls the
> kfree_rcu??


Right it's all based on documentation saying we should worry :)

Paul E. McKenney

unread,
Jul 22, 2019, 12:16:05 PM7/22/19
to Jason Gunthorpe, Michael S. Tsirkin, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote:
Unless you can force failures, you should be good.

And either way, improvements to RCU's handling of this sort of situation
are in the works. And rcutorture has gained tests of this stuff in the
last year or so as well, see its "fwd_progress" module parameter and
the related code.

Thanx, Paul

Paul E. McKenney

unread,
Jul 22, 2019, 12:26:02 PM7/22/19
to Michael S. Tsirkin, Joel Fernandes, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
That sounds fair.

> > If it is something that you actually made happen, please let me know
> > what (if anything) you need from me for your callback-counting EBUSY
> > scheme.
> >
> > Thanx, Paul
>
> If you mean kfree_rcu causing OOM then no, it's all theoretical.
> If you mean synchronize_rcu stalling to the point where guest will OOPs,
> then yes, that's not too hard to trigger.

Is synchronize_rcu() being stalled by the userspace loop that is invoking
your ioctl that does kfree_rcu()? Or instead by the resulting callback
invocation?

Thanx, Paul

Michael S. Tsirkin

unread,
Jul 22, 2019, 12:32:31 PM7/22/19
to Paul E. McKenney, Joel Fernandes, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Sorry, let me clarify. We currently have synchronize_rcu in a userspace
loop. I have a patch replacing that with kfree_rcu. This isn't the
first time synchronize_rcu is stalling a VM for a long while so I didn't
investigate further.

--
MST

Paul E. McKenney

unread,
Jul 22, 2019, 2:58:47 PM7/22/19
to Michael S. Tsirkin, Joel Fernandes, Matthew Wilcox, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
> > > If you mean kfree_rcu causing OOM then no, it's all theoretical.
> > > If you mean synchronize_rcu stalling to the point where guest will OOPs,
> > > then yes, that's not too hard to trigger.
> >
> > Is synchronize_rcu() being stalled by the userspace loop that is invoking
> > your ioctl that does kfree_rcu()? Or instead by the resulting callback
> > invocation?
>
> Sorry, let me clarify. We currently have synchronize_rcu in a userspace
> loop. I have a patch replacing that with kfree_rcu. This isn't the
> first time synchronize_rcu is stalling a VM for a long while so I didn't
> investigate further.

Ah, so a bunch of synchronize_rcu() calls within a single system call
inside the host is stalling the guest, correct?

If so, one straightforward approach is to do an rcu_barrier() every
(say) 1000 kfree_rcu() calls within that loop in the system call.
This will decrease the overhead by almost a factor of 1000 compared to
a synchronize_rcu() on each trip through that loop, and will prevent
callback overload.

Or if the situation is different (for example, the guest does a long
sequence of system calls, each of which does a single kfree_rcu() or
some such), please let me know what the situation is.

Thanx, Paul

syzbot

unread,
Jul 22, 2019, 7:51:01 PM7/22/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot tried to test the proposed patch but build/boot failed:

KASAN: null-ptr-deref Read in vhost_debug_mm

==================================================================
BUG: KASAN: null-ptr-deref in atomic_read
/./include/asm-generic/atomic-instrumented.h:26 [inline]
BUG: KASAN: null-ptr-deref in vhost_debug_mm+0x45/0x110
/drivers/vhost/vhost.c:52
Read of size 4 at addr 0000000000000058 by task syz-fuzzer/8693

CPU: 0 PID: 8693 Comm: syz-fuzzer Not tainted 5.2.0-rc2+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack /lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 /lib/dump_stack.c:113
__kasan_report.cold+0x5/0x40 /mm/kasan/report.c:321
kasan_report+0x12/0x20 /mm/kasan/common.c:614
check_memory_region_inline /mm/kasan/generic.c:185 [inline]
check_memory_region+0x123/0x190 /mm/kasan/generic.c:191
kasan_check_read+0x11/0x20 /mm/kasan/common.c:94
atomic_read /./include/asm-generic/atomic-instrumented.h:26 [inline]
vhost_debug_mm+0x45/0x110 /drivers/vhost/vhost.c:52
vhost_dev_cleanup+0x1e8/0xcd0 /drivers/vhost/vhost.c:962
vhost_vsock_dev_release+0x324/0x470 /drivers/vhost/vsock.c:628
__fput+0x2ff/0x890 /fs/file_table.c:280
____fput+0x16/0x20 /fs/file_table.c:313
task_work_run+0x145/0x1c0 /kernel/task_work.c:113
tracehook_notify_resume /./include/linux/tracehook.h:188 [inline]
exit_to_usermode_loop+0x273/0x2c0 /arch/x86/entry/common.c:168
prepare_exit_to_usermode /arch/x86/entry/common.c:199 [inline]
syscall_return_slowpath /arch/x86/entry/common.c:279 [inline]
do_syscall_64+0x58e/0x680 /arch/x86/entry/common.c:304
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x47fcb4
Code: ff ff cc cc cc cc e8 2b 41 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b
54 24 20 45 31 d2 45 31 c0 45 31 c9 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff
ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
RSP: 002b:000000c4201173e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000047fcb4
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 000000c420117428 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000c4203e2cb9
R13: 000000c4203e2cbf R14: 000000c4203e2cb8 R15: 000000c4203e2cc8
==================================================================


[....] Starting enhanced syslogd: rsyslogd [?25l [?1c 7 [1G[ [32m ok
[39;49m 8 [?25h [?0c.
[ 65.063118][ T24] audit: type=1800 audit(1563839374.387:25): pid=8533
uid=0 auid=4294967295 ses=4294967295 subj==unconfined op=collect_data
cause=failed(directio) comm="startpar" name="cron" dev="sda1" ino=2414 res=0
[ 65.104240][ T24] audit: type=1800 audit(1563839374.397:26): pid=8533
uid=0 auid=4294967295 ses=4294967295 subj==unconfined op=collect_data
cause=failed(directio) comm="startpar" name="mcstrans" dev="sda1" ino=2457
res=0
[ 65.137449][ T24] audit: type=1800 audit(1563839374.397:27): pid=8533
uid=0 auid=4294967295 ses=4294967295 subj==unconfined op=collect_data
cause=failed(directio) comm="startpar" name="restorecond" dev="sda1"
ino=2436 res=0
[....] Starting periodic command scheduler: cron [?25l [?1c 7 [1G[ [32m ok
[39;49m 8 [?25h [?0c.
[....] Starting OpenBSD Secure Shell server: sshd [?25l [?1c 7 [1G[ [32m ok
[39;49m 8 [?25h [?0c.

Debian GNU/Linux 7 syzkaller ttyS0

Warning: Permanently added '10.128.1.53' (ECDSA) to the list of known hosts.
2019/07/22 23:49:45 fuzzer started
2019/07/22 23:49:47 connecting to host at 10.128.0.26:36975
2019/07/22 23:49:47 checking machine...
2019/07/22 23:49:47 checking revisions...
2019/07/22 23:49:47 testing simple program...
syzkaller login: [ 78.547300][ T8703] IPVS: ftp: loaded support on
port[0] = 21
2019/07/22 23:49:47 building call list...
[ 79.944624][ T8693]
==================================================================
[ 79.952882][ T8693] BUG: KASAN: null-ptr-deref in
vhost_debug_mm+0x45/0x110
[ 79.960008][ T8693] Read of size 4 at addr 0000000000000058 by task
syz-fuzzer/8693
[ 79.967815][ T8693]
[ 79.970184][ T8693] CPU: 0 PID: 8693 Comm: syz-fuzzer Not tainted
5.2.0-rc2+ #1
[ 79.977646][ T8693] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 79.987746][ T8693] Call Trace:
[ 79.991064][ T8693] dump_stack+0x172/0x1f0
[ 79.995416][ T8693] ? vhost_debug_mm+0x45/0x110
[ 80.000206][ T8693] ? vhost_debug_mm+0x45/0x110
[ 80.005026][ T8693] __kasan_report.cold+0x5/0x40
[ 80.009906][ T8693] ? vhost_debug_mm+0x45/0x110
[ 80.014685][ T8693] kasan_report+0x12/0x20
[ 80.019056][ T8693] check_memory_region+0x123/0x190
[ 80.024174][ T8693] kasan_check_read+0x11/0x20
[ 80.028886][ T8693] vhost_debug_mm+0x45/0x110
[ 80.033491][ T8693] vhost_dev_cleanup+0x1e8/0xcd0
[ 80.038444][ T8693] vhost_vsock_dev_release+0x324/0x470
[ 80.043916][ T8693] ? __sanitizer_cov_trace_const_cmp2+0x18/0x20
[ 80.050170][ T8693] __fput+0x2ff/0x890
[ 80.054162][ T8693] ? vhost_vsock_dev_open+0x330/0x330
[ 80.059539][ T8693] ____fput+0x16/0x20
[ 80.063506][ T8693] task_work_run+0x145/0x1c0
[ 80.068129][ T8693] exit_to_usermode_loop+0x273/0x2c0
[ 80.073402][ T8693] do_syscall_64+0x58e/0x680
[ 80.078022][ T8693] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 80.083900][ T8693] RIP: 0033:0x47fcb4
[ 80.087782][ T8693] Code: ff ff cc cc cc cc e8 2b 41 fb ff 48 8b 7c 24
10 48 8b 74 24 18 48 8b 54 24 20 45 31 d2 45 31 c0 45 31 c9 48 8b 44 24 08
0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 80.107385][ T8693] RSP: 002b:000000c4201173e0 EFLAGS: 00000246
ORIG_RAX: 0000000000000003
[ 80.115800][ T8693] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
000000000047fcb4
[ 80.123760][ T8693] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000003
[ 80.131718][ T8693] RBP: 000000c420117428 R08: 0000000000000000 R09:
0000000000000000
[ 80.139675][ T8693] R10: 0000000000000000 R11: 0000000000000246 R12:
000000c4203e2cb9
[ 80.147637][ T8693] R13: 000000c4203e2cbf R14: 000000c4203e2cb8 R15:
000000c4203e2cc8
[ 80.155614][ T8693]
==================================================================
[ 80.163659][ T8693] Disabling lock debugging due to kernel taint
[ 80.170251][ T8693] Kernel panic - not syncing: panic_on_warn set ...
[ 80.176849][ T8693] CPU: 0 PID: 8693 Comm: syz-fuzzer Tainted: G
B 5.2.0-rc2+ #1
[ 80.185670][ T8693] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 80.195708][ T8693] Call Trace:
[ 80.198989][ T8693] dump_stack+0x172/0x1f0
[ 80.203302][ T8693] panic+0x2cb/0x744
[ 80.207177][ T8693] ? __warn_printk+0xf3/0xf3
[ 80.211759][ T8693] ? vhost_debug_mm+0x45/0x110
[ 80.216528][ T8693] ? preempt_schedule+0x4b/0x60
[ 80.221369][ T8693] ? ___preempt_schedule+0x16/0x18
[ 80.226474][ T8693] ? trace_hardirqs_on+0x5e/0x220
[ 80.231489][ T8693] ? vhost_debug_mm+0x45/0x110
[ 80.236254][ T8693] end_report+0x47/0x4f
[ 80.247913][ T8693] ? vhost_debug_mm+0x45/0x110
[ 80.252664][ T8693] __kasan_report.cold+0xe/0x40
[ 80.257512][ T8693] ? vhost_debug_mm+0x45/0x110
[ 80.262259][ T8693] kasan_report+0x12/0x20
[ 80.266576][ T8693] check_memory_region+0x123/0x190
[ 80.271948][ T8693] kasan_check_read+0x11/0x20
[ 80.276697][ T8693] vhost_debug_mm+0x45/0x110
[ 80.281274][ T8693] vhost_dev_cleanup+0x1e8/0xcd0
[ 80.286199][ T8693] vhost_vsock_dev_release+0x324/0x470
[ 80.291645][ T8693] ? __sanitizer_cov_trace_const_cmp2+0x18/0x20
[ 80.297869][ T8693] __fput+0x2ff/0x890
[ 80.301855][ T8693] ? vhost_vsock_dev_open+0x330/0x330
[ 80.307236][ T8693] ____fput+0x16/0x20
[ 80.314132][ T8693] task_work_run+0x145/0x1c0
[ 80.318741][ T8693] exit_to_usermode_loop+0x273/0x2c0
[ 80.324010][ T8693] do_syscall_64+0x58e/0x680
[ 80.328586][ T8693] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 80.334495][ T8693] RIP: 0033:0x47fcb4
[ 80.338378][ T8693] Code: ff ff cc cc cc cc e8 2b 41 fb ff 48 8b 7c 24
10 48 8b 74 24 18 48 8b 54 24 20 45 31 d2 45 31 c0 45 31 c9 48 8b 44 24 08
0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 80.357972][ T8693] RSP: 002b:000000c4201173e0 EFLAGS: 00000246
ORIG_RAX: 0000000000000003
[ 80.366382][ T8693] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
000000000047fcb4
[ 80.374363][ T8693] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000003
[ 80.382331][ T8693] RBP: 000000c420117428 R08: 0000000000000000 R09:
0000000000000000
[ 80.390290][ T8693] R10: 0000000000000000 R11: 0000000000000246 R12:
000000c4203e2cb9
[ 80.398247][ T8693] R13: 000000c4203e2cbf R14: 000000c4203e2cb8 R15:
000000c4203e2cc8
[ 80.407403][ T8693] Kernel Offset: disabled
[ 80.411733][ T8693] Rebooting in 86400 seconds..



Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=14dc213fa00000

Jason Wang

unread,
Jul 22, 2019, 11:55:49 PM7/22/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Yes I do.


>
> Now I took a hard look at the uaddr hackery it really makes
> me nervious. So I think for this release we want something
> safe, and optimizations on top. As an alternative revert the
> optimization and try again for next merge window.


Will post a series of fixes, let me know if you're ok with that.

Thanks


>
>

Jason Wang

unread,
Jul 23, 2019, 12:02:00 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I think you mean the synchronize_rcu_expedited()? Rethink of the code,
the synchronize_rcu() in ioctl() could be removed, since it was
serialized with memory accessor.

Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
(just a little bit more hard to trigger):


    case KVM_RUN: {
...
        if (unlikely(oldpid != task_pid(current))) {
            /* The thread running this VCPU changed. */
            struct pid *newpid;

            r = kvm_arch_vcpu_run_pid_change(vcpu);
            if (r)
                break;

            newpid = get_task_pid(current, PIDTYPE_PID);
            rcu_assign_pointer(vcpu->pid, newpid);
            if (oldpid)
                synchronize_rcu();
            put_pid(oldpid);
        }
...
        break;


>
>>> Signed-off-by: Michael S. Tsirkin<m...@redhat.com>
>>
>> Let me try to figure out the root cause then decide whether or not to go for
>> this way.
>>
>> Thanks
> The root cause of the crash is relevant, but we still need
> to fix issues 1-4.
>
> More issues (my patch tries to fix them too):
>
> 5. page not dirtied when mappings are torn down outside
> of invalidate callback


Yes.


>
> 6. potential cross-VM DOS by one guest keeping system busy
> and increasing synchronize_rcu latency to the point where
> another guest stars timing out and crashes
>
>
>

This will be addressed after I remove the synchronize_rcu() from ioctl path.

Thanks

Michael S. Tsirkin

unread,
Jul 23, 2019, 1:01:55 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Really let's just use kfree_rcu. It's way cleaner: fire and forget.

>
> Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
> (just a little bit more hard to trigger):


AFAIK these never run in response to guest events.
So they can take very long and guests still won't crash.

Michael S. Tsirkin

unread,
Jul 23, 2019, 1:02:45 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I'd prefer you to take a hard look at the patch I posted
which makes code cleaner, and ad optimizations on top.
But other ways could be ok too.

>
> >
> >

Jason Wang

unread,
Jul 23, 2019, 1:47:19 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Looks not, you need rate limit the fire as you've figured out? And in
fact, the synchronization is not even needed, does it help if I leave a
comment to explain?


>
>> Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
>> (just a little bit more hard to trigger):
>
> AFAIK these never run in response to guest events.
> So they can take very long and guests still won't crash.


What if guest manages to escape to qemu?

Thanks

Jason Wang

unread,
Jul 23, 2019, 1:49:05 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I did. But it looks to me a series that is only about 60 lines of code
can fix all the issues we found without reverting the uaddr optimization.


> and ad optimizations on top.
> But other ways could be ok too.


I'm waiting for the test result from syzbot and will post. Let's see if
you are OK with that.

Thanks


>>>

Michael S. Tsirkin

unread,
Jul 23, 2019, 3:23:29 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
See the discussion that followed. Basically no, it's good enough
already and is only going to be better.

> And in fact,
> the synchronization is not even needed, does it help if I leave a comment to
> explain?

Let's try to figure it out in the mail first. I'm pretty sure the
current logic is wrong.

>
> >
> > > Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
> > > (just a little bit more hard to trigger):
> >
> > AFAIK these never run in response to guest events.
> > So they can take very long and guests still won't crash.
>
>
> What if guest manages to escape to qemu?
>
> Thanks

Then it's going to be slow. Why do we care?
What we do not want is synchronize_rcu that guest is blocked on.

Michael S. Tsirkin

unread,
Jul 23, 2019, 3:25:18 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Oh I didn't know one can push a test to syzbot and get back
a result. How does one do that?


>
> > > >

syzbot

unread,
Jul 23, 2019, 3:34:02 AM7/23/19
to jaso...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger
crash:

Reported-and-tested-by:
syzbot+e58112...@syzkaller.appspotmail.com

Tested on:

commit: 7f466032 vhost: access vq metadata through kernel virtual ..
git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
kernel config: https://syzkaller.appspot.com/x/.config?x=a63910e9628f3db1
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
patch: https://syzkaller.appspot.com/x/patch.diff?x=177ae87c600000

Note: testing is done by a robot and is best-effort only.

Jason Wang

unread,
Jul 23, 2019, 3:53:17 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org

On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
>>> Really let's just use kfree_rcu. It's way cleaner: fire and forget.
>> Looks not, you need rate limit the fire as you've figured out?
> See the discussion that followed. Basically no, it's good enough
> already and is only going to be better.
>
>> And in fact,
>> the synchronization is not even needed, does it help if I leave a comment to
>> explain?
> Let's try to figure it out in the mail first. I'm pretty sure the
> current logic is wrong.


Here is what the code what to achieve:

- The map was protected by RCU

- Writers are: MMU notifier invalidation callbacks, file operations
(ioctls etc), meta_prefetch (datapath)

- Readers are: memory accessor

Writer are synchronized through mmu_lock. RCU is used to synchronized
between writers and readers.

The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized
it with readers (memory accessors) in the path of file operations. But
in this case, vq->mutex was already held, this means it has been
serialized with memory accessor. That's why I think it could be removed
safely.

Anything I miss here?


>
>>>> Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
>>>> (just a little bit more hard to trigger):
>>> AFAIK these never run in response to guest events.
>>> So they can take very long and guests still won't crash.
>> What if guest manages to escape to qemu?
>>
>> Thanks
> Then it's going to be slow. Why do we care?
> What we do not want is synchronize_rcu that guest is blocked on.
>

Ok, this looks like that I have some misunderstanding here of the reason
why synchronize_rcu() is not preferable in the path of ioctl. But in kvm
case, if rcu_expedited is set, it can triggers IPIs AFAIK.

Thanks


Jason Wang

unread,
Jul 23, 2019, 3:56:00 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
See here https://github.com/google/syzkaller/blob/master/docs/syzbot.md

Just reply this thread by attaching a fix with command like: "#syz test:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
7f466032dc9e5a61217f22ea34b2df932786bbfc"

Btw, I've let syzbot test you patch, and it passes.

Thanks


>

Michael S. Tsirkin

unread,
Jul 23, 2019, 3:57:06 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Another thing I like about the patch I posted is that
it removes 60 lines of code, instead of adding more :)
Mostly because of unifying everything into
a single cleanup function and using kfree_rcu.

So how about this: do exactly what you propose but as a 2 patch series:
start with the slow safe patch, and add then return uaddr optimizations
on top. We can then more easily reason about whether they are safe.

Basically you are saying this:
- notifiers are only needed to invalidate maps
- we make sure any uaddr change invalidates maps anyway
- thus it's ok not to have notifiers since we do
not have maps

All this looks ok but the question is why do we
bother unregistering them. And the answer seems to
be that this is so we can start with a balanced
counter: otherwise we can be between _start and
_end calls.

I also wonder about ordering. kvm has this:
/*
* Used to check for invalidations in progress, of the pfn that is
* returned by pfn_to_pfn_prot below.
*/
mmu_seq = kvm->mmu_notifier_seq;
/*
* Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
* gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
* risk the page we get a reference to getting unmapped before we have a
* chance to grab the mmu_lock without mmu_notifier_retry() noticing.
*
* This smp_rmb() pairs with the effective smp_wmb() of the combination
* of the pte_unmap_unlock() after the PTE is zapped, and the
* spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
* mmu_notifier_seq is incremented.
*/
smp_rmb();

does this apply to us? Can't we use a seqlock instead so we do
not need to worry?

--
MST

Michael S. Tsirkin

unread,
Jul 23, 2019, 4:10:37 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Tue, Jul 23, 2019 at 03:53:06PM +0800, Jason Wang wrote:
>
> On 2019/7/23 下午3:23, Michael S. Tsirkin wrote:
> > > > Really let's just use kfree_rcu. It's way cleaner: fire and forget.
> > > Looks not, you need rate limit the fire as you've figured out?
> > See the discussion that followed. Basically no, it's good enough
> > already and is only going to be better.
> >
> > > And in fact,
> > > the synchronization is not even needed, does it help if I leave a comment to
> > > explain?
> > Let's try to figure it out in the mail first. I'm pretty sure the
> > current logic is wrong.
>
>
> Here is what the code what to achieve:
>
> - The map was protected by RCU
>
> - Writers are: MMU notifier invalidation callbacks, file operations (ioctls
> etc), meta_prefetch (datapath)
>
> - Readers are: memory accessor
>
> Writer are synchronized through mmu_lock. RCU is used to synchronized
> between writers and readers.
>
> The synchronize_rcu() in vhost_reset_vq_maps() was used to synchronized it
> with readers (memory accessors) in the path of file operations. But in this
> case, vq->mutex was already held, this means it has been serialized with
> memory accessor. That's why I think it could be removed safely.
>
> Anything I miss here?
>

So invalidate callbacks need to reset the map, and they do
not have vq mutex. How can they do this and free
the map safely? They need synchronize_rcu or kfree_rcu right?

And I worry somewhat that synchronize_rcu in an MMU notifier
is a problem, MMU notifiers are supposed to be quick:
they are on a read side critical section of SRCU.

If we could get rid of RCU that would be even better.

But now I wonder:
invalidate_start has to mark page as dirty
(this is what my patch added, current code misses this).

at that point kernel can come and make the page clean again.

At that point VQ handlers can keep a copy of the map
and change the page again.


At this point I don't understand how we can mark page dirty
safely.

> >
> > > > > Btw, for kvm ioctl it still uses synchronize_rcu() in kvm_vcpu_ioctl(),
> > > > > (just a little bit more hard to trigger):
> > > > AFAIK these never run in response to guest events.
> > > > So they can take very long and guests still won't crash.
> > > What if guest manages to escape to qemu?
> > >
> > > Thanks
> > Then it's going to be slow. Why do we care?
> > What we do not want is synchronize_rcu that guest is blocked on.
> >
>
> Ok, this looks like that I have some misunderstanding here of the reason why
> synchronize_rcu() is not preferable in the path of ioctl. But in kvm case,
> if rcu_expedited is set, it can triggers IPIs AFAIK.
>
> Thanks
>

Yes, expedited is not good for something guest can trigger.
Let's just use kfree_rcu if we can. Paul said even though
documentation still says it needs to be rate-limited, that
part is basically stale and will get updated.

--
MST

Jason Wang

unread,
Jul 23, 2019, 4:42:32 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Yes.


>
> So how about this: do exactly what you propose but as a 2 patch series:
> start with the slow safe patch, and add then return uaddr optimizations
> on top. We can then more easily reason about whether they are safe.


If you stick, I can do this.


> Basically you are saying this:
> - notifiers are only needed to invalidate maps
> - we make sure any uaddr change invalidates maps anyway
> - thus it's ok not to have notifiers since we do
> not have maps
>
> All this looks ok but the question is why do we
> bother unregistering them. And the answer seems to
> be that this is so we can start with a balanced
> counter: otherwise we can be between _start and
> _end calls.


Yes, since there could be multiple co-current invalidation requests. We
need count them to make sure we don't pin wrong pages.


>
> I also wonder about ordering. kvm has this:
> /*
> * Used to check for invalidations in progress, of the pfn that is
> * returned by pfn_to_pfn_prot below.
> */
> mmu_seq = kvm->mmu_notifier_seq;
> /*
> * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
> * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
> * risk the page we get a reference to getting unmapped before we have a
> * chance to grab the mmu_lock without mmu_notifier_retry() noticing.
> *
> * This smp_rmb() pairs with the effective smp_wmb() of the combination
> * of the pte_unmap_unlock() after the PTE is zapped, and the
> * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
> * mmu_notifier_seq is incremented.
> */
> smp_rmb();
>
> does this apply to us? Can't we use a seqlock instead so we do
> not need to worry?


I'm not familiar with kvm MMU internals, but we do everything under of
mmu_lock.

Thanks


Jason Wang

unread,
Jul 23, 2019, 4:49:12 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Invalidation callbacks need but file operations (e.g ioctl) not.


>
> And I worry somewhat that synchronize_rcu in an MMU notifier
> is a problem, MMU notifiers are supposed to be quick:


Looks not, since it can allow to be blocked and lots of driver depends
on this. (E.g mmu_notifier_range_blockable()).


> they are on a read side critical section of SRCU.
>
> If we could get rid of RCU that would be even better.
>
> But now I wonder:
> invalidate_start has to mark page as dirty
> (this is what my patch added, current code misses this).


Nope, current code did this but not the case when map need to be
invalidated in the vhost control path (ioctl etc).


>
> at that point kernel can come and make the page clean again.
>
> At that point VQ handlers can keep a copy of the map
> and change the page again.


We will increase invalidate_count which prevent the page being used by map.

Thanks

Michael S. Tsirkin

unread,
Jul 23, 2019, 5:27:01 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Right, they can block. So why don't we take a VQ mutex and be
done with it then? No RCU tricks.

>
> > they are on a read side critical section of SRCU.
> >
> > If we could get rid of RCU that would be even better.
> >
> > But now I wonder:
> > invalidate_start has to mark page as dirty
> > (this is what my patch added, current code misses this).
>
>
> Nope, current code did this but not the case when map need to be invalidated
> in the vhost control path (ioctl etc).
>
>
> >
> > at that point kernel can come and make the page clean again.
> >
> > At that point VQ handlers can keep a copy of the map
> > and change the page again.
>
>
> We will increase invalidate_count which prevent the page being used by map.
>
> Thanks

OK I think I got it, thanks!

Michael S. Tsirkin

unread,
Jul 23, 2019, 6:27:58 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Given I realized my patch is buggy in that
it does not wait for outstanding maps, I don't
insist.
I don't think this helps at all.

There's no lock between checking the invalidate counter and
get user pages fast within vhost_map_prefetch. So it's possible
that get user pages fast reads PTEs speculatively before
invalidate is read.

--
MST

Michael S. Tsirkin

unread,
Jul 23, 2019, 6:42:48 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Tue, Jul 23, 2019 at 04:42:19PM +0800, Jason Wang wrote:
> > So how about this: do exactly what you propose but as a 2 patch series:
> > start with the slow safe patch, and add then return uaddr optimizations
> > on top. We can then more easily reason about whether they are safe.
>
>
> If you stick, I can do this.

So I definitely don't insist but I'd like us to get back to where
we know existing code is very safe (if not super fast) and
optimizing from there. Bugs happen but I'd like to see a bisect
giving us "oh it's because of XYZ optimization" and not the
general "it's somewhere within this driver" that we are getting
now.

Maybe the way to do this is to revert for this release cycle
and target the next one. What do you think?

--
MST

Jason Wang

unread,
Jul 23, 2019, 9:31:49 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
This is how I want to go with RFC and V1. But I end up with deadlock
between vq locks and some MM internal locks. So I decide to use RCU
which is 100% under the control of vhost.

Thanks

Jason Wang

unread,
Jul 23, 2019, 9:34:39 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
In vhost_map_prefetch() we do:

        spin_lock(&vq->mmu_lock);

        ...

        err = -EFAULT;
        if (vq->invalidate_count)
                goto err;

        ...

        npinned = __get_user_pages_fast(uaddr->uaddr, npages,
                                        uaddr->write, pages);

        ...

        spin_unlock(&vq->mmu_lock);

Is this not sufficient?

Thanks

Jason Wang

unread,
Jul 23, 2019, 9:37:50 AM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org

On 2019/7/23 下午6:42, Michael S. Tsirkin wrote:
> On Tue, Jul 23, 2019 at 04:42:19PM +0800, Jason Wang wrote:
>>> So how about this: do exactly what you propose but as a 2 patch series:
>>> start with the slow safe patch, and add then return uaddr optimizations
>>> on top. We can then more easily reason about whether they are safe.
>>
>> If you stick, I can do this.
> So I definitely don't insist but I'd like us to get back to where
> we know existing code is very safe (if not super fast) and
> optimizing from there. Bugs happen but I'd like to see a bisect
> giving us "oh it's because of XYZ optimization" and not the
> general "it's somewhere within this driver" that we are getting
> now.


Syzbot has bisected to the commit of metadata acceleration in fact :)


>
> Maybe the way to do this is to revert for this release cycle
> and target the next one. What do you think?


I would try to fix the issues consider packed virtqueue which may use
this for a good performance number. But if you insist, I'm ok to revert.
Or maybe introduce a config option to disable it by default (almost all
optimized could be ruled out).

Thanks

Michael S. Tsirkin

unread,
Jul 23, 2019, 11:02:48 AM7/23/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
So what orders __get_user_pages_fast wrt invalidate_count read?

--
MST

Jason Wang

unread,
Jul 23, 2019, 10:17:30 PM7/23/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
So in invalidate_end() callback we have:

spin_lock(&vq->mmu_lock);
--vq->invalidate_count;
        spin_unlock(&vq->mmu_lock);


So even PTE is read speculatively before reading invalidate_count (only
in the case of invalidate_count is zero). The spinlock has guaranteed
that we won't read any stale PTEs.

Thanks


>

Michael S. Tsirkin

unread,
Jul 24, 2019, 4:05:27 AM7/24/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I'm sorry I just do not get the argument.
If you want to order two reads you need an smp_rmb
or stronger between them executed on the same CPU.

Executing any kind of barrier on another CPU
will have no ordering effect on the 1st one.


So if CPU1 runs the prefetch, and CPU2 runs invalidate
callback, read of invalidate counter on CPU1 can bypass
read of PTE on CPU1 unless there's a barrier
in between, and nothing CPU2 does can affect that outcome.


What did I miss?

>
> >

Jason Wang

unread,
Jul 24, 2019, 6:08:27 AM7/24/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
It doesn't harm if PTE is read before invalidate_count, this is because:

1) This speculation is serialized with invalidate_range_end() because of
the spinlock

2) This speculation can only make effect when we read invalidate_count
as zero.

3) This means the speculation is done after the last
invalidate_range_end() and because of the spinlock, when we enter the
critical section of spinlock in prefetch, we can not see any stale PTE
that was unmapped before.

Am I wrong?

Thanks

Jason Gunthorpe

unread,
Jul 24, 2019, 12:53:20 PM7/24/19
to Michael S. Tsirkin, Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Wed, Jul 24, 2019 at 04:05:17AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2019 at 10:17:14AM +0800, Jason Wang wrote:
> > So even PTE is read speculatively before reading invalidate_count (only in
> > the case of invalidate_count is zero). The spinlock has guaranteed that we
> > won't read any stale PTEs.
>
> I'm sorry I just do not get the argument.
> If you want to order two reads you need an smp_rmb
> or stronger between them executed on the same CPU.

No, that is only for unlocked algorithms.

In this case the spinlock provides all the 'or stronger' ordering
required.

For invalidate_count going 0->1 the spin_lock ensures that any
following PTE update during invalidation does not order before the
spin_lock()

While holding the lock and observing 1 in invalidate_count the PTE
values might be changing, but are ignored. C's rules about sequencing
make this safe.

For invalidate_count going 1->0 the spin_unlock ensures that any
preceeding PTE update during invalidation does not order after the
spin_unlock

While holding the lock and observing 0 in invalidating_count the PTE
values cannot be changing.

Jason

Michael S. Tsirkin

unread,
Jul 24, 2019, 2:25:25 PM7/24/19
to Jason Gunthorpe, Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Oh right. So prefetch holds the spinlock the whole time.
Sorry about the noise.

--
MST

Michael S. Tsirkin

unread,
Jul 24, 2019, 2:26:04 PM7/24/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
OK I think you are right. Sorry it took me a while to figure out.

--
MST

Jason Wang

unread,
Jul 24, 2019, 11:45:02 PM7/24/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
No problem. So do you want me to send a V2 of the fixes (e.g with the
conversion from synchronize_rcu() to kfree_rcu()). Or you want something
else. (e.g revert or a config option)?

Thanks

Michael S. Tsirkin

unread,
Jul 25, 2019, 1:09:19 AM7/25/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Pls post V2 and I'll do my best to do a thorough review. We can then
decide, if we find more issues then patch revert makes more sense IMHO.
If we don't let's keep it in and if issues surface close to release
we can flip the config option.



--
MST

Michael S. Tsirkin

unread,
Jul 25, 2019, 1:53:03 AM7/25/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
And I guess the deadlock is because GUP is taking mmu locks which are
taken on mmu notifier path, right? How about we add a seqlock and take
that in invalidate callbacks? We can then drop the VQ lock before GUP,
and take it again immediately after.

something like
if (!vq_meta_mapped(vq)) {
vq_meta_setup(&uaddrs);
mutex_unlock(vq->mutex)
vq_meta_map(&uaddrs);
mutex_lock(vq->mutex)

/* recheck both sock->private_data and seqlock count. */
if changed - bail out
}

And also requires that VQ uaddrs is defined like this:
- writers must have both vq mutex and dev mutex
- readers must have either vq mutex or dev mutex


That's a big change though. For now, how about switching to a per-vq SRCU?
That is only a little bit more expensive than RCU, and we
can use synchronize_srcu_expedited.

--
MST

Michael S. Tsirkin

unread,
Jul 25, 2019, 2:02:52 AM7/25/19
to Jason Gunthorpe, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jaso...@redhat.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Mon, Jul 22, 2019 at 11:11:52AM -0300, Jason Gunthorpe wrote:
> On Sun, Jul 21, 2019 at 06:02:52AM -0400, Michael S. Tsirkin wrote:
> > On Sat, Jul 20, 2019 at 03:08:00AM -0700, syzbot wrote:
> > > syzbot has bisected this bug to:
> > >
> > > commit 7f466032dc9e5a61217f22ea34b2df932786bbfc
> > > Author: Jason Wang <jaso...@redhat.com>
> > > Date: Fri May 24 08:12:18 2019 +0000
> > >
> > > vhost: access vq metadata through kernel virtual address
> > >
> > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=149a8a20600000
> > > start commit: 6d21a41b Add linux-next specific files for 20190718
> > > git tree: linux-next
> > > final crash: https://syzkaller.appspot.com/x/report.txt?x=169a8a20600000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=129a8a20600000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=e58112d71f77113ddb7b
> > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10139e68600000
> > >
> > > Reported-by: syzbot+e58112...@syzkaller.appspotmail.com
> > > Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > address")
> > >
> > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> >
> >
> > OK I poked at this for a bit, I see several things that
> > we need to fix, though I'm not yet sure it's the reason for
> > the failures:
>
> This stuff looks quite similar to the hmm_mirror use model and other
> places in the kernel. I'm still hoping we can share this code a bit more.

Right. I think hmm is something we should look at.

--
MST

Jason Wang

unread,
Jul 25, 2019, 3:44:23 AM7/25/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Yes, but it's not the only lock. I don't remember the details, but I can
confirm I meet issues with one or two other locks.


> How about we add a seqlock and take
> that in invalidate callbacks? We can then drop the VQ lock before GUP,
> and take it again immediately after.
>
> something like
> if (!vq_meta_mapped(vq)) {
> vq_meta_setup(&uaddrs);
> mutex_unlock(vq->mutex)
> vq_meta_map(&uaddrs);


The problem is the vq address could be changed at this time.


> mutex_lock(vq->mutex)
>
> /* recheck both sock->private_data and seqlock count. */
> if changed - bail out
> }
>
> And also requires that VQ uaddrs is defined like this:
> - writers must have both vq mutex and dev mutex
> - readers must have either vq mutex or dev mutex
>
>
> That's a big change though. For now, how about switching to a per-vq SRCU?
> That is only a little bit more expensive than RCU, and we
> can use synchronize_srcu_expedited.
>

Consider we switch to use kfree_rcu(), what's the advantage of per-vq SRCU?

Thanks

Jason Wang

unread,
Jul 25, 2019, 3:45:45 AM7/25/19
to Michael S. Tsirkin, Jason Gunthorpe, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Exactly. I plan to do that.

Thanks

Michael S. Tsirkin

unread,
Jul 25, 2019, 4:28:50 AM7/25/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I thought we established that notifiers must wait for
all readers to finish before they mark page dirty, to
prevent page from becoming dirty after address
has been invalidated.
Right?

Jason Wang

unread,
Jul 25, 2019, 9:21:59 AM7/25/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Exactly, and that's the reason actually I use synchronize_rcu() there.

So the concern is still the possible synchronize_expedited()? Can I do
this on through another series on top of the incoming V2?

Thanks


Michael S. Tsirkin

unread,
Jul 25, 2019, 9:26:56 AM7/25/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I think synchronize_srcu_expedited.

synchronize_expedited sends lots of IPI and is bad for realtime VMs.

> Can I do this
> on through another series on top of the incoming V2?
>
> Thanks
>

The question is this: is this still a gain if we switch to the
more expensive srcu? If yes then we can keep the feature on,
if not we'll put it off until next release and think
of better solutions. rcu->srcu is just a find and replace,
don't see why we need to defer that. can be a separate patch
for sure, but we need to know how well it works.

--
MST

Jason Wang

unread,
Jul 25, 2019, 10:26:00 AM7/25/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org

On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
>> Exactly, and that's the reason actually I use synchronize_rcu() there.
>>
>> So the concern is still the possible synchronize_expedited()?
> I think synchronize_srcu_expedited.
>
> synchronize_expedited sends lots of IPI and is bad for realtime VMs.
>
>> Can I do this
>> on through another series on top of the incoming V2?
>>
>> Thanks
>>
> The question is this: is this still a gain if we switch to the
> more expensive srcu? If yes then we can keep the feature on,


I think we only care about the cost on srcu_read_lock() which looks
pretty tiny form my point of view. Which is basically a READ_ONCE() +
WRITE_ONCE().

Of course I can benchmark to see the difference.


> if not we'll put it off until next release and think
> of better solutions. rcu->srcu is just a find and replace,
> don't see why we need to defer that. can be a separate patch
> for sure, but we need to know how well it works.


I think I get here, let me try to do that in V2 and let's see the numbers.

Thanks

Michael S. Tsirkin

unread,
Jul 26, 2019, 7:49:29 AM7/26/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
There's one other thing that bothers me, and that is that
for large rings which are not physically contiguous
we don't implement the optimization.

For sure, that can wait, but I think eventually we should
vmap large rings.

--
MST

Jason Wang

unread,
Jul 26, 2019, 8:01:10 AM7/26/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org

On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
>> On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
>>>> Exactly, and that's the reason actually I use synchronize_rcu() there.
>>>>
>>>> So the concern is still the possible synchronize_expedited()?
>>> I think synchronize_srcu_expedited.
>>>
>>> synchronize_expedited sends lots of IPI and is bad for realtime VMs.
>>>
>>>> Can I do this
>>>> on through another series on top of the incoming V2?
>>>>
>>>> Thanks
>>>>
>>> The question is this: is this still a gain if we switch to the
>>> more expensive srcu? If yes then we can keep the feature on,
>>
>> I think we only care about the cost on srcu_read_lock() which looks pretty
>> tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
>>
>> Of course I can benchmark to see the difference.
>>
>>
>>> if not we'll put it off until next release and think
>>> of better solutions. rcu->srcu is just a find and replace,
>>> don't see why we need to defer that. can be a separate patch
>>> for sure, but we need to know how well it works.
>>
>> I think I get here, let me try to do that in V2 and let's see the numbers.
>>
>> Thanks


It looks to me for tree rcu, its srcu_read_lock() have a mb() which is
too expensive for us.

If we just worry about the IPI, can we do something like in
vhost_invalidate_vq_start()?

        if (map) {
                /* In order to avoid possible IPIs with
                 * synchronize_rcu_expedited() we use call_rcu() +
                 * completion.
*/
init_completion(&c.completion);
                call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
wait_for_completion(&c.completion);
                vhost_set_map_dirty(vq, map, index);
vhost_map_unprefetch(map);
        }

?


> There's one other thing that bothers me, and that is that
> for large rings which are not physically contiguous
> we don't implement the optimization.
>
> For sure, that can wait, but I think eventually we should
> vmap large rings.


Yes, worth to try. But using direct map has its own advantage: it can
use hugepage that vmap can't

Thanks

Michael S. Tsirkin

unread,
Jul 26, 2019, 8:38:24 AM7/26/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
>
> On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
> > On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
> > > On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
> > > > > Exactly, and that's the reason actually I use synchronize_rcu() there.
> > > > >
> > > > > So the concern is still the possible synchronize_expedited()?
> > > > I think synchronize_srcu_expedited.
> > > >
> > > > synchronize_expedited sends lots of IPI and is bad for realtime VMs.
> > > >
> > > > > Can I do this
> > > > > on through another series on top of the incoming V2?
> > > > >
> > > > > Thanks
> > > > >
> > > > The question is this: is this still a gain if we switch to the
> > > > more expensive srcu? If yes then we can keep the feature on,
> > >
> > > I think we only care about the cost on srcu_read_lock() which looks pretty
> > > tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
> > >
> > > Of course I can benchmark to see the difference.
> > >
> > >
> > > > if not we'll put it off until next release and think
> > > > of better solutions. rcu->srcu is just a find and replace,
> > > > don't see why we need to defer that. can be a separate patch
> > > > for sure, but we need to know how well it works.
> > >
> > > I think I get here, let me try to do that in V2 and let's see the numbers.
> > >
> > > Thanks
>
>
> It looks to me for tree rcu, its srcu_read_lock() have a mb() which is too
> expensive for us.

I will try to ponder using vq lock in some way.
Maybe with trylock somehow ...


> If we just worry about the IPI,

With synchronize_rcu what I would worry about is that guest is stalled
because system is busy because of other guests.
With expedited it's the IPIs...


> can we do something like in
> vhost_invalidate_vq_start()?
>
>         if (map) {
>                 /* In order to avoid possible IPIs with
>                  * synchronize_rcu_expedited() we use call_rcu() +
>                  * completion.
> */
> init_completion(&c.completion);
>                 call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
> wait_for_completion(&c.completion);
>                 vhost_set_map_dirty(vq, map, index);
> vhost_map_unprefetch(map);
>         }
>
> ?

Why would that be faster than synchronize_rcu?



>
> > There's one other thing that bothers me, and that is that
> > for large rings which are not physically contiguous
> > we don't implement the optimization.
> >
> > For sure, that can wait, but I think eventually we should
> > vmap large rings.
>
>
> Yes, worth to try. But using direct map has its own advantage: it can use
> hugepage that vmap can't
>
> Thanks

Sure, so we can do that for small rings.

--
MST

Jason Wang

unread,
Jul 26, 2019, 8:53:36 AM7/26/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Ok, let me retry if necessary (but I do remember I end up with deadlocks
last try).


>
>
>> If we just worry about the IPI,
> With synchronize_rcu what I would worry about is that guest is stalled


Can this synchronize_rcu() be triggered by guest? If yes, there are
several other MMU notifiers that can block. Is vhost something special here?


> because system is busy because of other guests.
> With expedited it's the IPIs...
>

The current synchronize_rcu()  can force a expedited grace period:

void synchronize_rcu(void)
{
        ...
        if (rcu_blocking_is_gp())
return;
        if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
wait_rcu_gp(call_rcu);
}
EXPORT_SYMBOL_GPL(synchronize_rcu);


>> can we do something like in
>> vhost_invalidate_vq_start()?
>>
>>         if (map) {
>>                 /* In order to avoid possible IPIs with
>>                  * synchronize_rcu_expedited() we use call_rcu() +
>>                  * completion.
>> */
>> init_completion(&c.completion);
>>                 call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
>> wait_for_completion(&c.completion);
>>                 vhost_set_map_dirty(vq, map, index);
>> vhost_map_unprefetch(map);
>>         }
>>
>> ?
> Why would that be faster than synchronize_rcu?


No faster but no IPI.


>
>
>>> There's one other thing that bothers me, and that is that
>>> for large rings which are not physically contiguous
>>> we don't implement the optimization.
>>>
>>> For sure, that can wait, but I think eventually we should
>>> vmap large rings.
>>
>> Yes, worth to try. But using direct map has its own advantage: it can use
>> hugepage that vmap can't
>>
>> Thanks
> Sure, so we can do that for small rings.


Yes, that's possible but should be done on top.

Thanks

Jason Wang

unread,
Jul 26, 2019, 9:36:37 AM7/26/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Ok, I play a little with this. And it works so far. Will do more testing
tomorrow.

One reason could be I switch to use get_user_pages_fast() to
__get_user_pages_fast() which doesn't need mmap_sem.

Thanks

Michael S. Tsirkin

unread,
Jul 26, 2019, 9:47:23 AM7/26/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Sorry, let me explain: guests (and tasks in general)
can trigger activity that will
make synchronize_rcu take a long time. Thus blocking
an mmu notifier until synchronize_rcu finishes
is a bad idea.

>
> > because system is busy because of other guests.
> > With expedited it's the IPIs...
> >
>
> The current synchronize_rcu()  can force a expedited grace period:
>
> void synchronize_rcu(void)
> {
>         ...
>         if (rcu_blocking_is_gp())
> return;
>         if (rcu_gp_is_expedited())
> synchronize_rcu_expedited();
> else
> wait_rcu_gp(call_rcu);
> }
> EXPORT_SYMBOL_GPL(synchronize_rcu);


An admin can force rcu to finish faster, trading
interrupts for responsiveness.

>
> > > can we do something like in
> > > vhost_invalidate_vq_start()?
> > >
> > >         if (map) {
> > >                 /* In order to avoid possible IPIs with
> > >                  * synchronize_rcu_expedited() we use call_rcu() +
> > >                  * completion.
> > > */
> > > init_completion(&c.completion);
> > >                 call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
> > > wait_for_completion(&c.completion);
> > >                 vhost_set_map_dirty(vq, map, index);
> > > vhost_map_unprefetch(map);
> > >         }
> > >
> > > ?
> > Why would that be faster than synchronize_rcu?
>
>
> No faster but no IPI.
>

Sorry I still don't see the point.
synchronize_rcu doesn't normally do an IPI either.


> >
> >
> > > > There's one other thing that bothers me, and that is that
> > > > for large rings which are not physically contiguous
> > > > we don't implement the optimization.
> > > >
> > > > For sure, that can wait, but I think eventually we should
> > > > vmap large rings.
> > >
> > > Yes, worth to try. But using direct map has its own advantage: it can use
> > > hugepage that vmap can't
> > >
> > > Thanks
> > Sure, so we can do that for small rings.
>
>
> Yes, that's possible but should be done on top.
>
> Thanks

Absolutely. Need to fix up the bugs first.

--
MST

Michael S. Tsirkin

unread,
Jul 26, 2019, 9:49:49 AM7/26/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
OK that sounds good. If we also set a flag to make
vhost_exceeds_weight exit, then I think it will be all good.

--
MST

Jason Wang

unread,
Jul 26, 2019, 10:00:51 AM7/26/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Yes, I get this.


> Thus blocking
> an mmu notifier until synchronize_rcu finishes
> is a bad idea.


The question is, MMU notifier are allowed to be blocked on
invalidate_range_start() which could be much slower than
synchronize_rcu() to finish.

Looking at amdgpu_mn_invalidate_range_start_gfx() which calls
amdgpu_mn_invalidate_node() which did:

                r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
                        true, false, MAX_SCHEDULE_TIMEOUT);

...


>>> because system is busy because of other guests.
>>> With expedited it's the IPIs...
>>>
>> The current synchronize_rcu()  can force a expedited grace period:
>>
>> void synchronize_rcu(void)
>> {
>>         ...
>>         if (rcu_blocking_is_gp())
>> return;
>>         if (rcu_gp_is_expedited())
>> synchronize_rcu_expedited();
>> else
>> wait_rcu_gp(call_rcu);
>> }
>> EXPORT_SYMBOL_GPL(synchronize_rcu);
>
> An admin can force rcu to finish faster, trading
> interrupts for responsiveness.


Yes, so when set, all each synchronize_rcu() will go for
synchronize_rcu_expedited().


>
>>>> can we do something like in
>>>> vhost_invalidate_vq_start()?
>>>>
>>>>         if (map) {
>>>>                 /* In order to avoid possible IPIs with
>>>>                  * synchronize_rcu_expedited() we use call_rcu() +
>>>>                  * completion.
>>>> */
>>>> init_completion(&c.completion);
>>>>                 call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
>>>> wait_for_completion(&c.completion);
>>>>                 vhost_set_map_dirty(vq, map, index);
>>>> vhost_map_unprefetch(map);
>>>>         }
>>>>
>>>> ?
>>> Why would that be faster than synchronize_rcu?
>>
>> No faster but no IPI.
>>
> Sorry I still don't see the point.
> synchronize_rcu doesn't normally do an IPI either.
>

Not the case of when rcu_expedited is set. This can just 100% make sure
there's no IPI.


>>>
>>>>> There's one other thing that bothers me, and that is that
>>>>> for large rings which are not physically contiguous
>>>>> we don't implement the optimization.
>>>>>
>>>>> For sure, that can wait, but I think eventually we should
>>>>> vmap large rings.
>>>> Yes, worth to try. But using direct map has its own advantage: it can use
>>>> hugepage that vmap can't
>>>>
>>>> Thanks
>>> Sure, so we can do that for small rings.
>>
>> Yes, that's possible but should be done on top.
>>
>> Thanks
> Absolutely. Need to fix up the bugs first.
>

Yes.

Thanks

Michael S. Tsirkin

unread,
Jul 26, 2019, 10:11:03 AM7/26/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
Right. And the result will probably be VMs freezing/timing out, too.
It's just that we care about VMs more than the GPU guys :)


> > > > because system is busy because of other guests.
> > > > With expedited it's the IPIs...
> > > >
> > > The current synchronize_rcu()  can force a expedited grace period:
> > >
> > > void synchronize_rcu(void)
> > > {
> > >         ...
> > >         if (rcu_blocking_is_gp())
> > > return;
> > >         if (rcu_gp_is_expedited())
> > > synchronize_rcu_expedited();
> > > else
> > > wait_rcu_gp(call_rcu);
> > > }
> > > EXPORT_SYMBOL_GPL(synchronize_rcu);
> >
> > An admin can force rcu to finish faster, trading
> > interrupts for responsiveness.
>
>
> Yes, so when set, all each synchronize_rcu() will go for
> synchronize_rcu_expedited().

And that's bad for realtime things. I understand what you are saying,
host admin can set this and VMs won't time-out. What I'm saying is we
should not make admins choose between two types of bugs. Tuning for
performance is fine.

>
> >
> > > > > can we do something like in
> > > > > vhost_invalidate_vq_start()?
> > > > >
> > > > >         if (map) {
> > > > >                 /* In order to avoid possible IPIs with
> > > > >                  * synchronize_rcu_expedited() we use call_rcu() +
> > > > >                  * completion.
> > > > > */
> > > > > init_completion(&c.completion);
> > > > >                 call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
> > > > > wait_for_completion(&c.completion);
> > > > >                 vhost_set_map_dirty(vq, map, index);
> > > > > vhost_map_unprefetch(map);
> > > > >         }
> > > > >
> > > > > ?
> > > > Why would that be faster than synchronize_rcu?
> > >
> > > No faster but no IPI.
> > >
> > Sorry I still don't see the point.
> > synchronize_rcu doesn't normally do an IPI either.
> >
>
> Not the case of when rcu_expedited is set. This can just 100% make sure
> there's no IPI.

Right but then the latency can be pretty big.

Jason Gunthorpe

unread,
Jul 26, 2019, 11:03:26 AM7/26/19
to Jason Wang, Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
On Fri, Jul 26, 2019 at 10:00:20PM +0800, Jason Wang wrote:
> The question is, MMU notifier are allowed to be blocked on
> invalidate_range_start() which could be much slower than synchronize_rcu()
> to finish.
>
> Looking at amdgpu_mn_invalidate_range_start_gfx() which calls
> amdgpu_mn_invalidate_node() which did:
>
>                 r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
>                         true, false, MAX_SCHEDULE_TIMEOUT);
>
> ...

The general guidance has been that invalidate_start should block
minimally, if at all.

I would say synchronize_rcu is outside that guidance.

BTW, always returning EAGAIN for mmu_notifier_range_blockable() is not
good either, it should instead only return EAGAIN if any
vhost_map_range_overlap() is true.

Jason

Jason Wang

unread,
Jul 29, 2019, 1:55:05 AM7/29/19
to Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org

On 2019/7/26 下午9:49, Michael S. Tsirkin wrote:
>>> Ok, let me retry if necessary (but I do remember I end up with deadlocks
>>> last try).
>> Ok, I play a little with this. And it works so far. Will do more testing
>> tomorrow.
>>
>> One reason could be I switch to use get_user_pages_fast() to
>> __get_user_pages_fast() which doesn't need mmap_sem.
>>
>> Thanks
> OK that sounds good. If we also set a flag to make
> vhost_exceeds_weight exit, then I think it will be all good.


After some experiments, I came up two methods:

1) switch to use vq->mutex, then we must take the vq lock during range
checking (but I don't see obvious slowdown for 16vcpus + 16queues).
Setting flags during weight check should work but it still can't address
the worst case: wait for the page to be swapped in. Is this acceptable?

2) using current RCU but replace synchronize_rcu() with
vhost_work_flush(). The worst case is the same as 1) but we can check
range without holding any locks.

Which one did you prefer?

Thanks

Jason Wang

unread,
Jul 29, 2019, 1:56:28 AM7/29/19
to Jason Gunthorpe, Michael S. Tsirkin, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org

On 2019/7/26 下午11:03, Jason Gunthorpe wrote:
> On Fri, Jul 26, 2019 at 10:00:20PM +0800, Jason Wang wrote:
>> The question is, MMU notifier are allowed to be blocked on
>> invalidate_range_start() which could be much slower than synchronize_rcu()
>> to finish.
>>
>> Looking at amdgpu_mn_invalidate_range_start_gfx() which calls
>> amdgpu_mn_invalidate_node() which did:
>>
>>                 r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
>>                         true, false, MAX_SCHEDULE_TIMEOUT);
>>
>> ...
> The general guidance has been that invalidate_start should block
> minimally, if at all.
>
> I would say synchronize_rcu is outside that guidance.


Yes, I get this.


>
> BTW, always returning EAGAIN for mmu_notifier_range_blockable() is not
> good either, it should instead only return EAGAIN if any
> vhost_map_range_overlap() is true.


Right, let me optimize that.

Thanks


>
> Jason

Michael S. Tsirkin

unread,
Jul 29, 2019, 4:59:38 AM7/29/19
to Jason Wang, syzbot, aarc...@redhat.com, ak...@linux-foundation.org, chri...@brauner.io, da...@davemloft.net, ebie...@xmission.com, elena.r...@intel.com, gu...@fb.com, h...@infradead.org, james.b...@hansenpartnership.com, jgl...@redhat.com, kees...@chromium.org, l...@altlinux.org, linux-ar...@lists.infradead.org, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, lu...@amacapital.net, mho...@suse.com, mi...@kernel.org, na...@vmware.com, pet...@infradead.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, w...@chromium.org
I would rather we start with 1 and switch to 2 after we
can show some gain.

But the worst case needs to be addressed. How about sending a signal to
the vhost thread? We will need to fix up error handling (I think that
at the moment it will error out in that case, handling this as EFAULT -
and we don't want to drop packets if we can help it, and surely not
enter any error states. In particular it might be especially tricky if
we wrote into userspace memory and are now trying to log the write.
I guess we can disable the optimization if log is enabled?).

--
MST
It is loading more messages.
0 new messages