some clang miscompilation again?

11 views
Skip to first unread message

Dmitry Vyukov

unread,
Mar 18, 2020, 7:26:11 AM3/18/20
to clang-built-linux, Alexander Potapenko
Hi,

We started seeing massive crashes on one of syzbot instances. You can
see 2 examples below. The rest are piled here:
https://syzkaller.appspot.com/bug?id=d5bc3e0c66d200d72216ab343a67c4327e4a3452
(search for "ci-upstream-kasan-gce-smack-root").

This happens only on the smack instance. It's the only instance that uses clang.
The previous weird crash spike we observed on that instance was caused
by clang miscompilation:
https://groups.google.com/d/msg/clang-built-linux/LUIT7csFWas/wEd-p6FKDQAJ

Maybe this rings any bells for somebody?

The clang we use is:
clang version 10.0.0 (https://github.com/llvm/llvm-project/
c2443155a0fb245c8f17f2c1c72b6ea391e86e81)


[ 202.652969][ T9969] BUG: kernel NULL pointer dereference, address:
0000000000000086
[ 202.660811][ T9969] #PF: supervisor instruction fetch in kernel mode
[ 202.667314][ T9969] #PF: error_code(0x0010) - not-present page
[ 202.673292][ T9969] PGD 42d21067 P4D 42d21067 PUD a442d067 PMD 0
[ 202.679547][ T9969] Oops: 0010 [#1] PREEMPT SMP KASAN
[ 202.684751][ T9969] CPU: 1 PID: 9969 Comm: syz-executor.0 Not
tainted 5.6.0-rc6-syzkaller #0
[ 202.685601][ T9967] ubi0: scanning is finished
[ 202.693464][ T9969] Hardware name: Google Google Compute
Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 202.693481][ T9969] RIP: 0010:0x86
[ 202.693499][ T9969] Code: Bad RIP value.
[ 202.693508][ T9969] RSP: 0018:ffffc90001b9f998 EFLAGS: 00010086
[ 202.693515][ T9969] RAX: ffffc90001b9f9c8 RBX: fffffe0000000000
RCX: 0000000000040000
[ 202.693520][ T9969] RDX: ffffc90002121000 RSI: 00000000000042cc
RDI: 00000000000042cd
[ 202.693525][ T9969] RBP: 0000000000000ec0 R08: ffffffff839870a3
R09: ffffffff811c7eca
[ 202.693530][ T9969] R10: ffff88804b338000 R11: 0000000000000002
R12: dffffc0000000000
[ 202.693535][ T9969] R13: fffffe0000000ec8 R14: ffffffff880016f0
R15: fffffe0000000ecb
[ 202.693547][ T9969] FS: 00007f70cf831700(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 202.693552][ T9969] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 202.693558][ T9969] CR2: 000000000000005c CR3: 0000000098245000
CR4: 00000000001426e0
[ 202.693564][ T9969] Call Trace:
[ 202.693582][ T9969] ? handle_external_interrupt_irqoff+0x154/0x280
[ 202.693597][ T9969] ? handle_external_interrupt_irqoff+0x132/0x280
[ 202.693606][ T9969] ? __irqentry_text_start+0x8/0x8
[ 202.693625][ T9969] ? vcpu_enter_guest+0x6c77/0x9290
[ 202.811509][ T9969] ? __kasan_slab_free+0x12e/0x1e0
[ 202.816609][ T9969] ? kfree+0x10a/0x220
[ 202.820667][ T9969] ? tomoyo_path_number_perm+0x525/0x690
[ 202.826289][ T9969] ? security_file_ioctl+0x55/0xb0
[ 202.831397][ T9969] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 202.837465][ T9969] ? __lock_acquire+0xc5a/0x1bc0
[ 202.842409][ T9969] ? mark_lock+0x107/0x1650
[ 202.846912][ T9969] ? lock_acquire+0x154/0x250
[ 202.851580][ T9969] ? rcu_lock_acquire+0x9/0x30
[ 202.856335][ T9969] ? kvm_check_async_pf_completion+0x34e/0x360
[ 202.862486][ T9969] ? vcpu_run+0x3a3/0xd50
[ 202.866823][ T9969] ? kvm_arch_vcpu_ioctl_run+0x419/0x880
[ 202.872449][ T9969] ? kvm_vcpu_ioctl+0x67c/0xa80
[ 202.877303][ T9969] ? kvm_vm_release+0x50/0x50
[ 202.881990][ T9969] ? __se_sys_ioctl+0xf9/0x160
[ 202.886873][ T9969] ? do_syscall_64+0xf3/0x1b0
[ 202.891570][ T9969] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 202.897636][ T9969] Modules linked in:
[ 202.901521][ T9969] CR2: 0000000000000086
[ 202.905670][ T9969] ---[ end trace e25748bb637f10e1 ]---
[ 202.911117][ T9969] RIP: 0010:0x86
[ 202.914666][ T9969] Code: Bad RIP value.
[ 202.918737][ T9969] RSP: 0018:ffffc90001b9f998 EFLAGS: 00010086
[ 202.924791][ T9969] RAX: ffffc90001b9f9c8 RBX: fffffe0000000000
RCX: 0000000000040000
[ 202.932770][ T9969] RDX: ffffc90002121000 RSI: 00000000000042cc
RDI: 00000000000042cd
[ 202.940749][ T9969] RBP: 0000000000000ec0 R08: ffffffff839870a3
R09: ffffffff811c7eca
[ 202.948727][ T9969] R10: ffff88804b338000 R11: 0000000000000002
R12: dffffc0000000000
[ 202.956700][ T9969] R13: fffffe0000000ec8 R14: ffffffff880016f0
R15: fffffe0000000ecb
[ 202.964675][ T9969] FS: 00007f70cf831700(0000)
GS:ffff8880ae900000(0000) knlGS:0000000000000000
[ 202.973600][ T9969] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 202.980175][ T9969] CR2: 000000000000005c CR3: 0000000098245000
CR4: 00000000001426e0
[ 202.988141][ T9969] Kernel panic - not syncing: Fatal exception
[ 202.995457][ T9969] Kernel Offset: disabled
[ 202.999782][ T9969] Rebooting in 86400 seconds..



[ 490.564553][T16898] BUG: unable to handle page fault for address:
0000000000ffff88
[ 490.572415][T16898] #PF: supervisor read access in kernel mode
[ 490.578422][T16898] #PF: error_code(0x0000) - not-present page
[ 490.584378][T16898] PGD 862e4067 P4D 862e4067 PUD 9961c067 PMD 0
[ 490.590606][T16898] Oops: 0000 [#1] PREEMPT SMP
[ 490.595264][T16898] CPU: 1 PID: 16898 Comm: syz-executor.3 Not
tainted 5.6.0-rc1-syzkaller #0
[ 490.604044][T16898] Hardware name: Google Google Compute
Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 490.616101][T16898] RIP: 0010:__list_del_entry_valid+0x59/0x8e
[ 490.622083][T16898] Code: 00 00 00 00 ad de 49 39 c4 0f 84 92 00 00
00 48 b8 22 01 00 00 00 00 ad de 49 39 c5 0f 84 b8 00 00 00 4c 89 ef
e8 77 4c 00 ff <4d> 8b 6d 00 49 39 ed 0f 85 8f 00 00 00 49 8d 7c 24 08
e8 60 4c 00
[ 490.641675][T16898] RSP: 0018:ffffc9000902f7d0 EFLAGS: 00010202
[ 490.647729][T16898] RAX: 0000000000000afb RBX: ffff888120ab9e00
RCX: ffffffff86d9d6b0
[ 490.655685][T16898] RDX: 0000000000000000 RSI: ffff88808aa81a01
RDI: 0000000000ffff88
[ 490.663645][T16898] RBP: ffff88808aa81a05 R08: 0000000000000000
R09: 0000000000ffff88
[ 490.671622][T16898] R10: 0000888120ab9e48 R11: 0000000000ffff8f
R12: 80ae4b8ef8ffff88
[ 490.679590][T16898] R13: 0000000000ffff88 R14: ffff88808aa81a00
R15: ffff8880ae4b8100
[ 490.687557][T16898] FS: 00007f0364420700(0000)
GS:ffff88812c100000(0000) knlGS:0000000000000000
[ 490.696473][T16898] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 490.703106][T16898] CR2: 0000000000ffff88 CR3: 000000008b30d000
CR4: 00000000001406e0
[ 490.711077][T16898] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 490.719062][T16898] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 490.727070][T16898] Call Trace:
[ 490.730505][T16898] ? __list_add+0x40/0xd0
[ 490.734877][T16898] ? nf_tables_newflowtable+0xcee/0xf00
[ 490.740427][T16898] ? nft_trans_alloc_gfp+0xc0/0xc0
[ 490.745550][T16898] ? nfnetlink_rcv_batch+0x528/0xbd0
[ 490.750871][T16898] ? __nla_validate_parse+0xa8/0x11d0
[ 490.756239][T16898] ? security_capable+0x8a/0xa0
[ 490.761195][T16898] ? ns_capable_common+0xad/0xc0
[ 490.766142][T16898] ? __nla_parse+0x4b/0x60
[ 490.770548][T16898] ? nfnetlink_rcv+0x269/0x290
[ 490.775300][T16898] ? netlink_unicast+0x390/0x4c0
[ 490.780291][T16898] ? netlink_sendmsg+0x4cf/0x8a0
[ 490.785216][T16898] ? netlink_unicast+0x4c0/0x4c0
[ 490.790131][T16898] ? sock_sendmsg+0x98/0xc0
[ 490.794634][T16898] ? ____sys_sendmsg+0x493/0x4c0
[ 490.799730][T16898] ? ___sys_sendmsg+0xb5/0x100
[ 490.804505][T16898] ? __rcu_read_unlock+0x66/0x2f0
[ 490.809583][T16898] ? __fget_files+0xa2/0x1c0
[ 490.814237][T16898] ? __fget_light+0xc0/0x1a0
[ 490.818811][T16898] ? __fdget+0x29/0x30
[ 490.822862][T16898] ? sockfd_lookup_light+0xa5/0x100
[ 490.828065][T16898] ? __sys_sendmsg+0x9b/0x150
[ 490.832740][T16898] ? __x64_sys_sendmsg+0x4c/0x60
[ 490.837671][T16898] ? do_syscall_64+0xc7/0x390
[ 490.842344][T16898] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 490.848485][T16898] Modules linked in:
[ 490.852420][T16898] CR2: 0000000000ffff88
[ 490.856659][T16896] BUG: unable to handle page fault for address:
0000000100ffffc9
[ 490.864378][T16896] #PF: supervisor read access in kernel mode
[ 490.870346][T16896] #PF: error_code(0x0000) - not-present page
[ 490.876346][T16896] PGD 0 P4D 0
[ 490.879714][T16896] Oops: 0000 [#2] PREEMPT SMP
[ 490.884377][T16896] CPU: 0 PID: 16896 Comm: sh Tainted: G D
5.6.0-rc1-syzkaller #0
[ 490.893399][T16896] Hardware name: Google Google Compute
Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 490.903531][T16896] RIP: 0010:__list_del_entry_valid+0x59/0x8e
[ 490.909499][T16896] Code: 00 00 00 00 ad de 49 39 c4 0f 84 92 00 00
00 48 b8 22 01 00 00 00 00 ad de 49 39 c5 0f 84 b8 00 00 00 4c 89 ef
e8 77 4c 00 ff <4d> 8b 6d 00 49 39 ed 0f 85 8f 00 00 00 49 8d 7c 24 08
e8 60 4c 00
[ 490.929309][T16896] RSP: 0018:ffffc90008fffb70 EFLAGS: 00010006
[ 490.935750][T16896] RAX: 000000000000036a RBX: 0000000000000000
RCX: ffffffff86d9d6b0
[ 490.943714][T16896] RDX: 0000000000000000 RSI: ffffc90008fffbd9
RDI: 0000000100ffffc9
[ 490.951762][T16896] RBP: ffffc90008fffbdd R08: 0000000000000000
R09: 0000000100ffffc9
[ 490.959728][T16896] R10: 0000ffffffffffff R11: 0000000100ffffd0
R12: 0008fffbd8ffffc9
[ 490.968083][T16896] R13: 0000000100ffffc9 R14: ffffc90008fffbd8
R15: ffffea00043d5c80
[ 490.976052][T16896] FS: 0000000000000000(0000)
GS:ffff88812c000000(0000) knlGS:0000000000000000
[ 490.985008][T16896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 490.991683][T16896] CR2: 0000000100ffffc9 CR3: 0000000005a23000
CR4: 00000000001406f0
[ 490.999644][T16896] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 491.007609][T16896] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 491.015569][T16896] Call Trace:
[ 491.018893][T16896] ? release_pages+0x7a3/0x9b0
[ 491.023651][T16896] ? free_pages_and_swap_cache+0x231/0x2a0
[ 491.029443][T16896] ? tlb_flush_mmu+0x76/0x390
[ 491.034566][T16896] ? tlb_finish_mmu+0x7f/0x230
[ 491.039518][T16896] ? exit_mmap+0x15e/0x2f0
[ 491.044021][T16896] ? mmput+0xe2/0x260
[ 491.048012][T16896] ? do_exit+0x640/0x1880
[ 491.052543][T16896] ? recalc_sigpending+0x4f/0xe0
[ 491.058572][T16896] ? do_sigaltstack.constprop.0+0x2b5/0x390
[ 491.064513][T16896] ? _copy_from_user+0x93/0xf0
[ 491.069330][T16896] ? do_group_exit+0xae/0x1a0
[ 491.074018][T16896] ? __x64_sys_exit_group+0x2b/0x30
[ 491.079224][T16896] ? do_syscall_64+0xc7/0x390
[ 491.083977][T16896] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 491.090448][T16896] Modules linked in:
[ 491.094324][T16896] CR2: 0000000100ffffc9
[ 491.098573][T16896] ---[ end trace a5ad8ea8946e7e64 ]---
[ 491.098603][ C1] BUG: unable to handle page fault for address:
0000000000ffff88
[ 491.104175][T16896] RIP: 0010:__list_del_entry_valid+0x59/0x8e
[ 491.111928][ C1] #PF: supervisor read access in kernel mode
[ 491.111935][ C1] #PF: error_code(0x0000) - not-present page
[ 491.111941][ C1] PGD 862e4067 P4D 862e4067 PUD 9961c067 PMD 0
[ 491.111963][ C1] Oops: 0000 [#3] PREEMPT SMP
[ 491.111978][ C1] CPU: 1 PID: 16898 Comm: syz-executor.3 Tainted:
G D 5.6.0-rc1-syzkaller #0
[ 491.111996][ C1] Hardware name: Google Google Compute
Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 491.117985][T16896] Code: 00 00 00 00 ad de 49 39 c4 0f 84 92 00 00
00 48 b8 22 01 00 00 00 00 ad de 49 39 c5 0f 84 b8 00 00 00 4c 89 ef
e8 77 4c 00 ff <4d> 8b 6d 00 49 39 ed 0f 85 8f 00 00 00 49 8d 7c 24 08
e8 60 4c 00
[ 491.118003][T16896] RSP: 0018:ffffc9000902f7d0 EFLAGS: 00010202
[ 491.124542][ C1] RIP: 0010:__list_del_entry_valid+0x59/0x8e
[ 491.130507][T16896] RAX: 0000000000000afb RBX: ffff888120ab9e00
RCX: ffffffff86d9d6b0
[ 491.136850][ C1] Code: 00 00 00 00 ad de 49 39 c4 0f 84 92 00 00
00 48 b8 22 01 00 00 00 00 ad de 49 39 c5 0f 84 b8 00 00 00 4c 89 ef
e8 77 4c 00 ff <4d> 8b 6d 00 49 39 ed 0f 85 8f 00 00 00 49 8d 7c 24 08
e8 60 4c 00
[ 491.141504][T16896] RDX: 0000000000000000 RSI: ffff88808aa81a01
RDI: 0000000000ffff88
[ 491.152402][ C1] RSP: 0018:ffffc90000d08d18 EFLAGS: 00010002
[ 491.163177][T16896] RBP: ffff88808aa81a05 R08: 0000000000000000
R09: 0000000000ffff88
[ 491.183561][ C1] RAX: 0000000000000985 RBX: ffff8880a6a21a00
RCX: ffffffff86d9d6b0
[ 491.189647][T16896] R10: 0000888120ab9e48 R11: 0000000000ffff8f
R12: 80ae4b8ef8ffff88
[ 491.195641][ C1] RDX: 0000000000000000 RSI: ffff88812c12d311
RDI: 0000000000ffff88
[ 491.203601][T16896] R13: 0000000000ffff88 R14: ffff88808aa81a00
R15: ffff8880ae4b8100
[ 491.223211][ C1] RBP: ffff88812c12d315 R08: 0000000000000000
R09: 0000000000ffff88
[ 491.231595][T16896] FS: 0000000000000000(0000)
GS:ffff88812c000000(0000) knlGS:0000000000000000
[ 491.237646][ C1] R10: 0000000000000000 R11: 0000000000ffff8f
R12: 808a92f0f0ffff88
[ 491.245640][T16896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 491.253729][ C1] R13: 0000000000ffff88 R14: ffff88812c12d310
R15: ffff88812c12d310
[ 491.261774][T16896] CR2: 0000000100ffffc9 CR3: 0000000005a23000
CR4: 00000000001406f0
[ 491.261801][T16896] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 491.269817][ C1] FS: 00007f0364420700(0000)
GS:ffff88812c100000(0000) knlGS:0000000000000000
[ 491.278022][T16896] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 491.286501][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 491.295634][T16896] Kernel panic - not syncing: Fatal exception
[ 491.303559][ C1] CR2: 0000000000ffff88 CR3: 000000008b30d000
CR4: 00000000001406e0
[ 491.371566][ C1] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 491.381001][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 491.388963][ C1] Call Trace:
[ 491.392229][ C1] <IRQ>
[ 491.395101][ C1] ? account_entity_enqueue+0x97/0xc0
[ 491.400496][ C1] ? enqueue_entity+0x96/0x5a0
[ 491.405267][ C1] ? enqueue_task_fair+0xa6/0x400
[ 491.410298][ C1] ? activate_task+0x67/0x100
[ 491.414964][ C1] ? ttwu_do_activate.isra.0+0x3b/0x50
[ 491.420410][ C1] ? try_to_wake_up+0x3b5/0x6d0
[ 491.425252][ C1] ? hrtimer_wakeup+0x48/0x60
[ 491.429918][ C1] ? __hrtimer_run_queues+0x271/0x600
[ 491.435281][ C1] ? hrtimer_active+0x1b0/0x1b0
[ 491.440135][ C1] ? hrtimer_interrupt+0x226/0x490
[ 491.445250][ C1] ? kvm_clock_read+0x14/0x30
[ 491.449926][ C1] ? smp_apic_timer_interrupt+0xd8/0x270
[ 491.455551][ C1] ? apic_timer_interrupt+0xf/0x20
[ 491.460641][ C1] </IRQ>
[ 491.463582][ C1] ? add_taint+0x2b/0x60
[ 491.467830][ C1] ? oops_end+0x5c/0xe0
[ 491.471987][ C1] ? no_context+0x2ce/0x5e0
[ 491.476522][ C1] ? add_nops+0xa0/0xa0
[ 491.480696][ C1] ? __bad_area_nosemaphore+0x7d/0x310
[ 491.486158][ C1] ? do_page_fault+0x708/0xa52
[ 491.490934][ C1] ? page_fault+0x34/0x40
[ 491.495274][ C1] ? __list_del_entry_valid+0x59/0x8e
[ 491.500638][ C1] ? __list_add+0x40/0xd0
[ 491.504960][ C1] ? nf_tables_newflowtable+0xcee/0xf00
[ 491.510544][ C1] ? nft_trans_alloc_gfp+0xc0/0xc0
[ 491.515659][ C1] ? nfnetlink_rcv_batch+0x528/0xbd0
[ 491.520979][ C1] ? __nla_validate_parse+0xa8/0x11d0
[ 491.526350][ C1] ? security_capable+0x8a/0xa0
[ 491.531193][ C1] ? ns_capable_common+0xad/0xc0
[ 491.536134][ C1] ? __nla_parse+0x4b/0x60
[ 491.540562][ C1] ? nfnetlink_rcv+0x269/0x290
[ 491.545330][ C1] ? netlink_unicast+0x390/0x4c0
[ 491.550265][ C1] ? netlink_sendmsg+0x4cf/0x8a0
[ 491.555209][ C1] ? netlink_unicast+0x4c0/0x4c0
[ 491.560138][ C1] ? sock_sendmsg+0x98/0xc0
[ 491.564640][ C1] ? ____sys_sendmsg+0x493/0x4c0
[ 491.569580][ C1] ? ___sys_sendmsg+0xb5/0x100
[ 491.574352][ C1] ? __rcu_read_unlock+0x66/0x2f0
[ 491.579389][ C1] ? __fget_files+0xa2/0x1c0
[ 491.583993][ C1] ? __fget_light+0xc0/0x1a0
[ 491.588613][ C1] ? __fdget+0x29/0x30
[ 491.592682][ C1] ? sockfd_lookup_light+0xa5/0x100
[ 491.597879][ C1] ? __sys_sendmsg+0x9b/0x150
[ 491.602559][ C1] ? __x64_sys_sendmsg+0x4c/0x60
[ 491.607508][ C1] ? do_syscall_64+0xc7/0x390
[ 491.612205][ C1] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 491.618273][ C1] Modules linked in:
[ 491.622160][ C1] CR2: 0000000000ffff88
[ 491.627088][ C1] ---[ end trace a5ad8ea8946e7e65 ]---
[ 491.632562][ C1] RIP: 0010:__list_del_entry_valid+0x59/0x8e
[ 491.638548][ C1] Code: 00 00 00 00 ad de 49 39 c4 0f 84 92 00 00
00 48 b8 22 01 00 00 00 00 ad de 49 39 c5 0f 84 b8 00 00 00 4c 89 ef
e8 77 4c 00 ff <4d> 8b 6d 00 49 39 ed 0f 85 8f 00 00 00 49 8d 7c 24 08
e8 60 4c 00
[ 491.658155][ C1] RSP: 0018:ffffc9000902f7d0 EFLAGS: 00010202
[ 491.664223][ C1] RAX: 0000000000000afb RBX: ffff888120ab9e00
RCX: ffffffff86d9d6b0
[ 491.672190][ C1] RDX: 0000000000000000 RSI: ffff88808aa81a01
RDI: 0000000000ffff88
[ 491.680164][ C1] RBP: ffff88808aa81a05 R08: 0000000000000000
R09: 0000000000ffff88
[ 491.688139][ C1] R10: 0000888120ab9e48 R11: 0000000000ffff8f
R12: 80ae4b8ef8ffff88
[ 491.696109][ C1] R13: 0000000000ffff88 R14: ffff88808aa81a00
R15: ffff8880ae4b8100
[ 491.704086][ C1] FS: 00007f0364420700(0000)
GS:ffff88812c100000(0000) knlGS:0000000000000000
[ 491.713021][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 491.719605][ C1] CR2: 0000000000ffff88 CR3: 000000008b30d000
CR4: 00000000001406e0
[ 491.727579][ C1] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 491.736171][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 492.459000][T16896] BUG: unable to handle page fault for address:
000000000000ffff
[ 492.466825][T16896] #PF: supervisor write access in kernel mode
[ 492.472995][T16896] #PF: error_code(0x0002) - not-present page
[ 492.478964][T16896] PGD 0 P4D 0
[ 492.482342][T16896] Oops: 0002 [#4] PREEMPT SMP
[ 492.487107][T16896] CPU: 0 PID: 16896 Comm: sh Tainted: G D
5.6.0-rc1-syzkaller #0
[ 492.496489][T16896] Hardware name: Google Google Compute
Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 492.506675][T16896] RIP: 0010:__list_add_valid+0x6a/0x70
[ 492.512182][T16896] Code: e2 0f 85 e5 00 00 00 4c 39 ed 0f 84 c5 00
00 00 48 39 d5 0f 84 bc 00 00 00 e8 32 40 00 ff b8 01 00 00 00 5d 41
5c 58 ab ee a6 <80> 88 ff ff 00 00 41 55 41 54 55 48 89 fd 48 8b 7c 24
18 e8 fe 3f
[ 492.531896][T16896] RSP: 0018:ffffc90008fff888 EFLAGS: 00010013
[ 492.537953][T16896] RAX: 0000000000000000 RBX: ffffffff85a59580
RCX: ffffffff86d9d568
[ 492.545918][T16896] RDX: ffffffff85a59c20 RSI: ffffffff85a56429
RDI: ffffffff85a5642d
[ 492.553992][T16896] RBP: ffffffff85a56428 R08: 0000000000000000
R09: 0000ffff85a56428
[ 492.562088][T16896] R10: 0000c90008fff7a0 R11: 0000ffff85a5642f
R12: ffffffff85a59c20
[ 492.570106][T16896] R13: ffffffff85a56428 R14: ffffffff85a56428
R15: ffffffff85a56420
[ 492.578078][T16896] FS: 0000000000000000(0000)
GS:ffff88812c000000(0000) knlGS:0000000000000000
[ 492.586995][T16896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 492.593671][T16896] CR2: 000000000000ffff CR3: 0000000005a23000
CR4: 00000000001406f0
[ 492.601792][T16896] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 492.609813][T16896] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 492.617810][T16896] Call Trace:
[ 492.621097][T16896] ? __register_nmi_handler+0xd7/0x120
[ 492.626553][T16896] ? native_stop_other_cpus+0x148/0x180
[ 492.632164][T16896] ? panic+0x249/0x640
[ 492.636352][T16896] ? vprintk_func+0x89/0x13a
[ 492.640970][T16896] ? oops_end.cold+0x18/0x18
[ 492.645552][T16896] ? no_context+0x2ce/0x5e0
[ 492.650055][T16896] ? __bad_area_nosemaphore+0x7d/0x310
[ 492.655558][T16896] ? do_page_fault+0x3e9/0xa52
[ 492.660451][T16896] ? __rcu_read_unlock+0x66/0x2f0
[ 492.665475][T16896] ? page_fault+0x34/0x40
[ 492.669809][T16896] ? __list_del_entry_valid+0x59/0x8e
[ 492.675404][T16896] ? release_pages+0x7a3/0x9b0
[ 492.680201][T16896] ? free_pages_and_swap_cache+0x231/0x2a0
[ 492.686061][T16896] ? tlb_flush_mmu+0x76/0x390
[ 492.690758][T16896] ? tlb_finish_mmu+0x7f/0x230
[ 492.695592][T16896] ? exit_mmap+0x15e/0x2f0
[ 492.700009][T16896] ? mmput+0xe2/0x260
[ 492.704060][T16896] ? do_exit+0x640/0x1880
[ 492.708624][T16896] ? recalc_sigpending+0x4f/0xe0
[ 492.713642][T16896] ? do_sigaltstack.constprop.0+0x2b5/0x390
[ 492.719533][T16896] ? _copy_from_user+0x93/0xf0
[ 492.724302][T16896] ? do_group_exit+0xae/0x1a0
[ 492.728997][T16896] ? __x64_sys_exit_group+0x2b/0x30
[ 492.734242][T16896] ? do_syscall_64+0xc7/0x390
[ 492.738925][T16896] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 492.745030][T16896] Modules linked in:
[ 492.748945][T16896] CR2: 000000000000ffff
[ 492.753088][T16896] ---[ end trace a5ad8ea8946e7e66 ]---
[ 492.758542][T16896] RIP: 0010:__list_del_entry_valid+0x59/0x8e
[ 492.764661][T16896] Code: 00 00 00 00 ad de 49 39 c4 0f 84 92 00 00
00 48 b8 22 01 00 00 00 00 ad de 49 39 c5 0f 84 b8 00 00 00 4c 89 ef
e8 77 4c 00 ff <4d> 8b 6d 00 49 39 ed 0f 85 8f 00 00 00 49 8d 7c 24 08
e8 60 4c 00
[ 492.784401][T16896] RSP: 0018:ffffc9000902f7d0 EFLAGS: 00010202
[ 492.790535][T16896] RAX: 0000000000000afb RBX: ffff888120ab9e00
RCX: ffffffff86d9d6b0
[ 492.798498][T16896] RDX: 0000000000000000 RSI: ffff88808aa81a01
RDI: 0000000000ffff88
[ 492.806482][T16896] RBP: ffff88808aa81a05 R08: 0000000000000000
R09: 0000000000ffff88
[ 492.814450][T16896] R10: 0000888120ab9e48 R11: 0000000000ffff8f
R12: 80ae4b8ef8ffff88
[ 492.822415][T16896] R13: 0000000000ffff88 R14: ffff88808aa81a00
R15: ffff8880ae4b8100
[ 492.830394][T16896] FS: 0000000000000000(0000)
GS:ffff88812c000000(0000) knlGS:0000000000000000
[ 492.839314][T16896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 492.845895][T16896] CR2: 000000000000ffff CR3: 0000000005a23000
CR4: 00000000001406f0
[ 492.853893][T16896] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 492.861856][T16896] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 492.869878][T16896] Kernel panic - not syncing: Fatal exception
[ 492.876789][T16896] Kernel Offset: disabled
[ 492.881320][T16896] Rebooting in 86400 seconds..

Nick Desaulniers

unread,
Mar 18, 2020, 3:45:57 PM3/18/20
to Dmitry Vyukov, clang-built-linux, Alexander Potapenko, Tom Roeder
Thanks for the reports.

On Wed, Mar 18, 2020 at 4:26 AM 'Dmitry Vyukov' via Clang Built Linux
<clang-bu...@googlegroups.com> wrote:
>
> Hi,
>
> We started seeing massive crashes on one of syzbot instances. You can
> see 2 examples below. The rest are piled here:
> https://syzkaller.appspot.com/bug?id=d5bc3e0c66d200d72216ab343a67c4327e4a3452
> (search for "ci-upstream-kasan-gce-smack-root").
>
> This happens only on the smack instance. It's the only instance that uses clang.

Can you please enable more bots to test with Clang?

> The previous weird crash spike we observed on that instance was caused
> by clang miscompilation:
> https://groups.google.com/d/msg/clang-built-linux/LUIT7csFWas/wEd-p6FKDQAJ
>
> Maybe this rings any bells for somebody?
>
> The clang we use is:
> clang version 10.0.0 (https://github.com/llvm/llvm-project/
> c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
>
>
> [ 202.652969][ T9969] BUG: kernel NULL pointer dereference, address:
> 0000000000000086

So all of the reports I looked at had this trace, not the second one
(I didn't read all 30+ logs), can you give me the timestamp of the
report of the second case?
So handle_external_interrupt_irqoff() defined in
arch/x86/kvm/vmx/vmx.c has some tricky inline assembly, then is
annotated with STACK_FRAME_NON_STANDARD to tell objtool (the Linux
kernel's custom object file validator) to ignore validating the stack
frame (see comments in include/linux/frame.h). Let's see if we can
find historical context that explains why
handle_external_interrupt_irqoff is marked STACK_FRAME_NON_STANDARD.

It looks like handle_external_interrupt_irqoff was renamed from
vmx_handle_external_intr in
commit 95b5a48c4f2b ("KVM: VMX: Handle NMIs, #MCs and async #PFs in
common irqs-disabled fn")

STACK_FRAME_NON_STANDARD was added to vmx_handle_external_intr in
commit c207aee48037 ("objtool, x86: Add several functions and files to
the objtool whitelist")

Hmm...so looks like no info on why vmx_handle_external_intr was
annotated STACK_FRAME_NON_STANDARD other than that it caused problems
for objtool otherwise. Maybe time to revisit "why does
handle_external_interrupt_irqoff have a non-standard call frame?"

Looks like vmx_handle_external_intr was added in:
commit a547c6db4d2f ("KVM: VMX: Enable acknowledge interupt on vmexit")
Maybe "Intel SDM volum 3, chapter 33.2" has more info?

handle_external_interrupt_irqoff is qualified as `static inline`, but
is not inlined into its lone call site, vmx_handle_exit_irqoff. None
of the other called functions from there are marked
STACK_FRAME_NON_STANDARD, which is curious.
handle_external_interrupt_irqoff pushes 4 64b values then calls
through a function pointer, `entry`. I assume the thunk also has to
pop 4 extra 64b values off the stack, otherwise
handle_external_interrupt_irqoff's ret will return somewhere
non-sensical, like 0x86?

When I compile your config with GCC, I see:
arch/x86/kvm/vmx/vmx.o: warning: objtool:
vmx_handle_exit_irqoff()+0x1ef: unreachable instruction
which is curious, but maybe a red herring.

Comparing the disassembly between GCC and Clang of
handle_external_interrupt_irqoff, the inline asm looks similar. One
thing I don't understand is that the disassembly of
handle_external_interrupt_irqoff from GCC has no `ret` instruction...

Are there any more steps to reliably reproduce? Do we suspect this is
a recent regression in clang-10?
> --
> You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-li...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/CACT4Y%2Bax3BuO7kNofmTWoTKtZ1O0-6KbnKMrWxPviJom%2B2wngQ%40mail.gmail.com.



--
Thanks,
~Nick Desaulniers

Dmitry Vyukov

unread,
Mar 19, 2020, 3:31:05 AM3/19/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder
On Wed, Mar 18, 2020 at 8:45 PM Nick Desaulniers
<ndesau...@google.com> wrote:
>
> Thanks for the reports.
>
> On Wed, Mar 18, 2020 at 4:26 AM 'Dmitry Vyukov' via Clang Built Linux
> <clang-bu...@googlegroups.com> wrote:
> >
> > Hi,
> >
> > We started seeing massive crashes on one of syzbot instances. You can
> > see 2 examples below. The rest are piled here:
> > https://syzkaller.appspot.com/bug?id=d5bc3e0c66d200d72216ab343a67c4327e4a3452
> > (search for "ci-upstream-kasan-gce-smack-root").
> >
> > This happens only on the smack instance. It's the only instance that uses clang.
>
> Can you please enable more bots to test with Clang?

What are additional configurations you are interested in?
It's not exactly a unit-testing system, using it as unit-testing is
expensive and breaks production. So far we've seen 2 breakages due to
clang and 0 due to gcc. If we switch more instances, we will also need
some dedicated people ensuring that they work. I think eventually we
will make half of instances use clang/half gcc, but so far clang has
proven to be less stable for the kernel and we don't have these
dedicated people... If somebody volunteers? :)

Dmitry Vyukov

unread,
Mar 19, 2020, 3:41:22 AM3/19/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder
On Wed, Mar 18, 2020 at 8:45 PM Nick Desaulniers
<ndesau...@google.com> wrote:
>
> Thanks for the reports.
>
> On Wed, Mar 18, 2020 at 4:26 AM 'Dmitry Vyukov' via Clang Built Linux
> <clang-bu...@googlegroups.com> wrote:
> >
> > Hi,
> >
> > We started seeing massive crashes on one of syzbot instances. You can
> > see 2 examples below. The rest are piled here:
> > https://syzkaller.appspot.com/bug?id=d5bc3e0c66d200d72216ab343a67c4327e4a3452
> > (search for "ci-upstream-kasan-gce-smack-root").
> >
> > This happens only on the smack instance. It's the only instance that uses clang.
>
> Can you please enable more bots to test with Clang?
>
> > The previous weird crash spike we observed on that instance was caused
> > by clang miscompilation:
> > https://groups.google.com/d/msg/clang-built-linux/LUIT7csFWas/wEd-p6FKDQAJ
> >
> > Maybe this rings any bells for somebody?
> >
> > The clang we use is:
> > clang version 10.0.0 (https://github.com/llvm/llvm-project/
> > c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
> >
> >
> > [ 202.652969][ T9969] BUG: kernel NULL pointer dereference, address:
> > 0000000000000086
>
> So all of the reports I looked at had this trace, not the second one
> (I didn't read all 30+ logs), can you give me the timestamp of the
> report of the second case?

If it's not on the dashboard (and it's probably already not on the
dashboard today), unfortunately I did not save any additional info.
But it was among yesterday's batch and shared all the same properties.
Yes, I noticed 1 was different from the rest. Either we will see it
again among new reports... or not.
But I think we can ignore it for now. We can base debugging on the
more frequent manifestation. Or maybe that single one is a different
bug entirely, or maybe was a combination of the bugs + a previous
memory corruption.
Well, run syzkaller locally on the kernel using the provided
revision/config, compiler, image, etc. On a beefy machine with lots of
VMs it should fire every minute or so.
(that's part of what I mentioned as "using syzbot as unit testing
system in expensive")

> Do we suspect this is
> a recent regression in clang-10?

No.
Alex updated clang to this revision after we debugged the previous
miscopmilation, then the kernel got back to normal and we did not
touch clang.
So it must be a recent kernel change.
Additionally the kernel produces a broken crash report with all frames
marked as " ? " questionable, so syzbot classifies it as "corrupted"
and throws into the single bucket with lots of other corrupted
reports. So we don't know when exactly this started...

Dmitry Vyukov

unread,
Mar 19, 2020, 3:51:29 AM3/19/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder
Looking at some stats data we have for the instance, it seems to start on Mar 6.
The initial 0's in stats on Jan 9-10 is the previous breakage, and
around Mar 6 there is a new anomaly.

Date Corpus Cover Crashes Executions
20200109 0 0 0 0
20200110 0 0 0 0
20200113 50834 629433 536 1395118
20200114 56783 674044 209 9193960
20200115 59186 692183 256 8383961
20200116 51016 645884 327 6250395
20200117 53360 667066 238 5698928
20200118 54800 669470 196 5747800
20200119 55572 668463 170 6163455
20200120 58659 689599 194 5444768
20200121 55657 681862 253 4831752
20200122 64273 725220 276 5448551
20200123 59398 692608 258 4414910
20200124 54089 658262 292 5612887
20200125 52125 649080 191 5440661
20200126 56637 687494 367 5095477
20200127 59136 702062 415 5938377
20200128 58459 712660 353 4826737
20200129 60462 723094 349 4870841
20200130 65067 752818 291 6486425
20200131 65306 755543 338 5809624
20200201 56744 672614 375 4468147
20200202 60243 694134 260 5628242
20200203 49805 624049 256 5931740
20200204 52756 648054 230 5932072
20200205 40670 583719 322 5155979
20200206 45670 608468 347 4274670
20200207 48136 614452 323 4850050
20200208 50242 641225 295 5218742
20200209 45003 598594 272 5104881
20200210 49152 630923 379 7100143
20200211 53779 656252 298 5508205
20200212 56101 674183 340 5237191
20200213 60761 702222 263 6239786
20200214 58013 687933 284 5494024
20200215 58225 680698 271 7383022
20200216 58295 704614 290 4919207
20200217 60289 716586 268 4525219
20200218 45604 617967 263 5609495
20200219 48260 637681 274 5284118
20200220 54224 678450 311 4902636
20200221 61954 723951 287 4990487
20200222 59326 730434 302 5298617
20200223 65439 760776 250 4904851
20200224 63267 742702 291 4800550
20200225 69914 778741 237 4722249
20200226 75572 805909 271 4415733
20200227 79522 826667 357 3622346
20200228 82503 842398 424 3348351
20200229 84755 854530 529 2868372
20200301 84587 855053 320 4835048
20200302 82358 840318 273 4105833
20200303 85566 856249 315 3326516
20200304 87927 869043 432 2763706
20200305 89829 878743 507 2536747
20200306 90973 885239 812 2059721
20200307 71253 767589 2221 1088846
20200308 70813 766510 1181 1699390
20200309 65817 772257 1101 2821306
20200310 62675 759004 856 2140766
20200311 62829 750791 1089 2156522
20200312 61200 743232 1105 2105367
20200313 63360 750544 1227 2135191
20200314 63476 750896 885 4589837
20200315 55476 676559 1070 2264668
20200316 57210 690419 860 2763512
20200317 62560 723584 854 2570067
20200318 61662 714534 1032 2160078

Dmitry Vyukov

unread,
Mar 19, 2020, 4:04:31 AM3/19/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder
Here are the builds around these dates, they have kernel commit hashes:

name=3ed9001ea665f843e982dd20cd513bc503073eed amd64 clang version
10.0.0 (https://github.com/llvm/llvm-project/
c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
ee90f2feee99d6835c837046cf23cbfe7f267a1a master
fb279f4e238617417b132a550f24c1e86d922558 2020-03-01 (02:16:46.000) CET
Merge branch 'i2c/for-current-fixed' of
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
-8425196555570390697
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
ci-upstream-kasan-gce-smack-root upstream linux
fd2a5f28eb5e2b7c83b5e814f53e44e2a5dde24c 2020-03-06 (13:37:58.000) CET
2020-03-06 (23:04:41.053) CET 0 amd64

name=8aed02c6f126a5c64148561ea34799b9cc93a651 amd64 clang version
10.0.0 (https://github.com/llvm/llvm-project/
c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
3d4c0571a0cd3e4bb8e31564e5e160539ca6e39b master
fb279f4e238617417b132a550f24c1e86d922558 2020-03-01 (02:16:46.000) CET
Merge branch 'i2c/for-current-fixed' of
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
-8425196555570390697
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
ci-upstream-kasan-gce-smack-root upstream linux
c88c7b75a4e022b758f4b0f1bf3db8ebb2fb25e6 2020-02-27 (19:31:43.000) CET
2020-03-06 (21:34:28.851) CET 0 amd64

name=5d2dbc53da29896fd3ee7cb23e4050fa65936b6b amd64 clang version
10.0.0 (https://github.com/llvm/llvm-project/
c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
ea5f6a5d4c8df033e89c8a20ba1fbd5878288405 master
63623fd44972d1ed2bfb6e0fb631dfcf547fd1e7 2020-02-24 (20:48:17.000) CET
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
6714307646143599935
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
ci-upstream-kasan-gce-smack-root upstream linux
c88c7b75a4e022b758f4b0f1bf3db8ebb2fb25e6 2020-02-27 (19:31:43.000) CET
2020-03-01 (03:51:58.435) CET 0 amd64

name=0cbccdb029b57aee9e090f332c72344190a7b9af amd64 clang version
10.0.0 (https://github.com/llvm/llvm-project/
c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
66d0d464e7b075e7f170219eec7e6e275acb4074 master
63849c8f410717eb2e6662f3953ff674727303e7 2020-03-07 (00:03:37.000) CET
Merge tag 'linux-kselftest-5.6-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
8858458958893406001
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
ci-upstream-kasan-gce-smack-root upstream linux
2e9971bbbfb4df6ba0118353163a7703f3dbd6ec 2020-03-07 (07:46:22.000) CET
2020-03-07 (13:55:28.751) CET 0 amd64

This is the schema:
https://github.com/google/syzkaller/blob/2c31c529a9a44be5d99e769204b7a4b84b93eec1/dashboard/app/entities.go#L49-L67

But order of fields is all messed, here is the order GCE shows:
Name/ID
Arch
CompilerID
ID
KernelBranch
KernelCommit
KernelCommitDateKernelCommitTitleKernelConfig
KernelRepo
Manager
Namespace
OS
SyzkallerCommit
SyzkallerCommitDate
Time
Type
VMArch

Nick Desaulniers

unread,
Mar 19, 2020, 2:57:51 PM3/19/20
to Dmitry Vyukov, clang-built-linux, Alexander Potapenko, Tom Roeder, Oliver Upton
So it looks like you honed in on 20200306, because on 20200307 the
number of Crashes goes up to 2221 from 812 the previous day?

Does that also correspond with the first day that you received a
report with this signature or just coincidence?

There's no smoking guns looking at:
$ git log --pretty=fuller arch/x86/kvm/
for that timeframe.

If the version of clang was constant throughout this time period, then
we should pinpoint the kernel commit via bisection what commit
introduced this regression. Only then can we determine if we're
dealing with a "miscompile" vs code reliant on one particular
implementation of undefined behavior (as is typically the case, IME).

> 20200308 70813 766510 1181 1699390
> 20200309 65817 772257 1101 2821306
> 20200310 62675 759004 856 2140766
> 20200311 62829 750791 1089 2156522
> 20200312 61200 743232 1105 2105367
> 20200313 63360 750544 1227 2135191
> 20200314 63476 750896 885 4589837
> 20200315 55476 676559 1070 2264668
> 20200316 57210 690419 860 2763512
> 20200317 62560 723584 854 2570067
> 20200318 61662 714534 1032 2160078



--
Thanks,
~Nick Desaulniers

Nick Desaulniers

unread,
Mar 19, 2020, 3:35:50 PM3/19/20
to Dmitry Vyukov, clang-built-linux, Alexander Potapenko, Tom Roeder
On Thu, Mar 19, 2020 at 12:31 AM Dmitry Vyukov <dvy...@google.com> wrote:
>
> On Wed, Mar 18, 2020 at 8:45 PM Nick Desaulniers
> <ndesau...@google.com> wrote:
> >
> > Thanks for the reports.
> >
> > On Wed, Mar 18, 2020 at 4:26 AM 'Dmitry Vyukov' via Clang Built Linux
> > <clang-bu...@googlegroups.com> wrote:
> > >
> > > Hi,
> > >
> > > We started seeing massive crashes on one of syzbot instances. You can
> > > see 2 examples below. The rest are piled here:
> > > https://syzkaller.appspot.com/bug?id=d5bc3e0c66d200d72216ab343a67c4327e4a3452
> > > (search for "ci-upstream-kasan-gce-smack-root").
> > >
> > > This happens only on the smack instance. It's the only instance that uses clang.
> >
> > Can you please enable more bots to test with Clang?
>
> What are additional configurations you are interested in?

We're testing:
architectures: arm64, arm, x86_64, mips, ppc, (s390 and riscv in build
capacity, not yet booting, TODO)
trees+branches: -next, mainline, stable (back to 4.4)
configs: defconfig, allmodconfig, allyesconfig (defconfigs all the way
back to 4.4, allyesconfig only more recently)
clang: ToT, latest release

The more coverage, the better. I understand the limitations around capacity.

> It's not exactly a unit-testing system, using it as unit-testing is
> expensive and breaks production. So far we've seen 2 breakages due to
> clang and 0 due to gcc. If we switch more instances, we will also need

Syzcaller has never found a compiler bug in GCC? That's surprising.

I thought the one clang bug you identified was due to using a
pre-release version of clang-9 (was ToT at some point) which had
already been previously identified and fixed?

What was the second bug? This report? Are you sure it's a Clang
miscompile at this point? Are you sure it's not undefined behavior in
the kernel somewhere? Feel free to say "told you so" when we get to
the bottom of it, but I wouldn't be so certain at this point, lest
someone tell you "told you so" otherwise.

> some dedicated people ensuring that they work. I think eventually we
> will make half of instances use clang/half gcc, but so far clang has
> proven to be less stable for the kernel and we don't have these
> dedicated people... If somebody volunteers? :)

Doesn't syzcaller generally have this problem? People working on
reporting bugs, not necessarily fixing them?

We're here to fix Clang bugs; if we can find bugs that appear only
with Clang and not GCC, then yes please send them to our list.

FWIW, I'm a big fan of go/kernel-disaster and go/fix-linux; I'm
drafting up a similar doc along the lines of "maybe we should spend
some money and fix this" but more specific to Clang+Linux. I don't
think phrases like "clang has proven to be less stable for the kernel"
are accurate when your sources are weak, or will win you a lot of
volunteers to fix bugs reported by your tool though.
--
Thanks,
~Nick Desaulniers

Dmitry Vyukov

unread,
Mar 20, 2020, 3:12:04 AM3/20/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder, Oliver Upton
On Thu, Mar 19, 2020 at 7:57 PM Nick Desaulniers
Yes, I think it's the 20200306 that's bad and happened midday. So 812
is half normal rate + half increased rate.

> Does that also correspond with the first day that you received a
> report with this signature or just coincidence?

Since these crashes don't proper stacks and were classified as
corrupted and were thrown in the single trash bucket, that info is
lost now. We don't have capacity to save millions of crashes and
usually that's just impossible to process meaningfully.

Dmitry Vyukov

unread,
Mar 20, 2020, 1:00:35 PM3/20/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder
On Thu, Mar 19, 2020 at 8:35 PM Nick Desaulniers
<ndesau...@google.com> wrote:
>
> On Thu, Mar 19, 2020 at 12:31 AM Dmitry Vyukov <dvy...@google.com> wrote:
> >
> > On Wed, Mar 18, 2020 at 8:45 PM Nick Desaulniers
> > <ndesau...@google.com> wrote:
> > >
> > > Thanks for the reports.
> > >
> > > On Wed, Mar 18, 2020 at 4:26 AM 'Dmitry Vyukov' via Clang Built Linux
> > > <clang-bu...@googlegroups.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > We started seeing massive crashes on one of syzbot instances. You can
> > > > see 2 examples below. The rest are piled here:
> > > > https://syzkaller.appspot.com/bug?id=d5bc3e0c66d200d72216ab343a67c4327e4a3452
> > > > (search for "ci-upstream-kasan-gce-smack-root").
> > > >
> > > > This happens only on the smack instance. It's the only instance that uses clang.
> > >
> > > Can you please enable more bots to test with Clang?
> >
> > What are additional configurations you are interested in?
>
> We're testing:
> architectures: arm64, arm, x86_64, mips, ppc, (s390 and riscv in build
> capacity, not yet booting, TODO)
> trees+branches: -next, mainline, stable (back to 4.4)
> configs: defconfig, allmodconfig, allyesconfig (defconfigs all the way
> back to 4.4, allyesconfig only more recently)
> clang: ToT, latest release
>
> The more coverage, the better. > I understand the limitations around capacity.

We can only offer x86_64 with subset of allyesconfig. We have few
other branches (net, bpf), but that's it.


> > It's not exactly a unit-testing system, using it as unit-testing is
> > expensive and breaks production. So far we've seen 2 breakages due to
> > clang and 0 due to gcc. If we switch more instances, we will also need
>
> Syzcaller has never found a compiler bug in GCC? That's surprising.

I don't remember any.

> I thought the one clang bug you identified was due to using a
> pre-release version of clang-9 (was ToT at some point) which had
> already been previously identified and fixed?

Yes.

> What was the second bug? This report? Are you sure it's a Clang
> miscompile at this point? Are you sure it's not undefined behavior in
> the kernel somewhere? Feel free to say "told you so" when we get to
> the bottom of it, but I wouldn't be so certain at this point, lest
> someone tell you "told you so" otherwise.

Yes, this one.
It does not matter if it's clang's bug or not. The system will go
down. Imagine you are testing clang on the Google search and it goes
down. It won't matter if it was a clang bug or a latent bug in the
code :)


> > some dedicated people ensuring that they work. I think eventually we
> > will make half of instances use clang/half gcc, but so far clang has
> > proven to be less stable for the kernel and we don't have these
> > dedicated people... If somebody volunteers? :)
>
> Doesn't syzcaller generally have this problem? People working on
> reporting bugs, not necessarily fixing them?
>
> We're here to fix Clang bugs; if we can find bugs that appear only
> with Clang and not GCC, then yes please send them to our list.

This information is not readily available at the moment. We have
hundreds of bugs per month + some special issues like this one. To
understand what causes each one of them, somebody needs to look at
them first...

> FWIW, I'm a big fan of go/kernel-disaster and go/fix-linux; I'm
> drafting up a similar doc along the lines of "maybe we should spend
> some money and fix this" but more specific to Clang+Linux. I don't
> think phrases like "clang has proven to be less stable for the kernel"
> are accurate when your sources are weak, or will win you a lot of
> volunteers to fix bugs reported by your tool though.

I mean only our very specific situation of running a large production
system with almost no resources for maintenance. I did not say that
clang is more buggy or something. The previous one was already fixed
by the time we hit it. And this one may well be a latent bug in the
code. This may well be happening with gcc as well, but maybe somebody
else fixes it before we even notice it.

Dmitry Vyukov

unread,
Mar 21, 2020, 10:49:15 AM3/21/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder
FTR, I 've submitted
https://github.com/google/syzkaller/commit/a2d5b1c04d22c7db220cc795dc2b4d48b17437be
which should make (1) this crash pop up as separate bug, (2) syzbot
come up with a reproducer and bisect (hopefully).

Dmitry Vyukov

unread,
Mar 22, 2020, 3:00:55 AM3/22/20
to Nick Desaulniers, clang-built-linux, Alexander Potapenko, Tom Roeder

Alexander Potapenko

unread,
Mar 23, 2020, 9:02:12 AM3/23/20
to Dmitry Vyukov, Nick Desaulniers, clang-built-linux, Tom Roeder
Do you have any understanding of how long has this been happening?
--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Dmitry Vyukov

unread,
Mar 23, 2020, 9:09:23 AM3/23/20
to Alexander Potapenko, Nick Desaulniers, clang-built-linux, Tom Roeder
On Mon, Mar 23, 2020 at 2:02 PM Alexander Potapenko <gli...@google.com> wrote:
>
> Do you have any understanding of how long has this been happening?

Presumably in the Mar 6 syzbot build, potential commit range is mentioned here:
https://groups.google.com/d/msg/clang-built-linux/Cm3VojRK69I/yvgZNsS6AwAJ
https://groups.google.com/d/msg/clang-built-linux/Pk0g-hIWal8/Sqh8h1J_BAAJ
Reply all
Reply to author
Forward
0 new messages