[syzbot] [net?] possible deadlock in rtnl_newlink

6 views
Skip to first unread message

syzbot

unread,
May 29, 2025, 6:32:34 AM5/29/25
to da...@davemloft.net, edum...@google.com, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot found the following issue on:

HEAD commit: b1427432d3b6 Merge tag 'iommu-fixes-v6.15-rc7' of git://gi..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=161ef5f4580000
kernel config: https://syzkaller.appspot.com/x/.config?x=9fd1c9848687d742
dashboard link: https://syzkaller.appspot.com/bug?extid=846bb38dc67fe62cc733
compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12d21170580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17d9a8e8580000

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/d900f083ada3/non_bootable_disk-b1427432.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/47b0c66c70d9/vmlinux-b1427432.xz
kernel image: https://storage.googleapis.com/syzbot-assets/a2df6bfabd3c/bzImage-b1427432.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+846bb3...@syzkaller.appspotmail.com

ifb0: entered allmulticast mode
ifb1: entered allmulticast mode
======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc7-syzkaller-00144-gb1427432d3b6 #0 Not tainted
------------------------------------------------------
syz-executor216/5313 is trying to acquire lock:
ffff888033f496f0 ((work_completion)(&adapter->reset_task)){+.+.}-{0:0}, at: rcu_lock_acquire include/linux/rcupdate.h:331 [inline]
ffff888033f496f0 ((work_completion)(&adapter->reset_task)){+.+.}-{0:0}, at: rcu_read_lock include/linux/rcupdate.h:841 [inline]
ffff888033f496f0 ((work_completion)(&adapter->reset_task)){+.+.}-{0:0}, at: start_flush_work kernel/workqueue.c:4150 [inline]
ffff888033f496f0 ((work_completion)(&adapter->reset_task)){+.+.}-{0:0}, at: __flush_work+0xd2/0xbc0 kernel/workqueue.c:4208

but task is already holding lock:
ffffffff8f2fab48 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
ffffffff8f2fab48 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock net/core/rtnetlink.c:341 [inline]
ffffffff8f2fab48 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x8db/0x1c70 net/core/rtnetlink.c:4064

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (rtnl_mutex){+.+.}-{4:4}:
lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
__mutex_lock_common kernel/locking/mutex.c:601 [inline]
__mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746
e1000_reset_task+0x56/0xc0 drivers/net/ethernet/intel/e1000/e1000_main.c:3512
process_one_work kernel/workqueue.c:3238 [inline]
process_scheduled_works+0xadb/0x17a0 kernel/workqueue.c:3319
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3400
kthread+0x70e/0x8a0 kernel/kthread.c:464
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:153
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

-> #0 ((work_completion)(&adapter->reset_task)){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3166 [inline]
check_prevs_add kernel/locking/lockdep.c:3285 [inline]
validate_chain+0xb9b/0x2140 kernel/locking/lockdep.c:3909
__lock_acquire+0xaac/0xd20 kernel/locking/lockdep.c:5235
lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
touch_work_lockdep_map kernel/workqueue.c:3922 [inline]
start_flush_work kernel/workqueue.c:4176 [inline]
__flush_work+0x6b8/0xbc0 kernel/workqueue.c:4208
__cancel_work_sync+0xbe/0x110 kernel/workqueue.c:4364
e1000_down+0x402/0x6b0 drivers/net/ethernet/intel/e1000/e1000_main.c:526
e1000_close+0x17b/0xa10 drivers/net/ethernet/intel/e1000/e1000_main.c:1448
__dev_close_many+0x361/0x6f0 net/core/dev.c:1702
__dev_close net/core/dev.c:1714 [inline]
__dev_change_flags+0x2c7/0x6d0 net/core/dev.c:9352
netif_change_flags+0x88/0x1a0 net/core/dev.c:9417
do_setlink+0xcb9/0x40d0 net/core/rtnetlink.c:3152
rtnl_group_changelink net/core/rtnetlink.c:3783 [inline]
__rtnl_newlink net/core/rtnetlink.c:3937 [inline]
rtnl_newlink+0x149f/0x1c70 net/core/rtnetlink.c:4065
rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x75b/0x8d0 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:712 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:727
____sys_sendmsg+0x505/0x830 net/socket.c:2566
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
__sys_sendmsg net/socket.c:2652 [inline]
__do_sys_sendmsg net/socket.c:2657 [inline]
__se_sys_sendmsg net/socket.c:2655 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(rtnl_mutex);
lock((work_completion)(&adapter->reset_task));
lock(rtnl_mutex);
lock((work_completion)(&adapter->reset_task));

*** DEADLOCK ***

2 locks held by syz-executor216/5313:
#0: ffffffff8f2fab48 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
#0: ffffffff8f2fab48 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock net/core/rtnetlink.c:341 [inline]
#0: ffffffff8f2fab48 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x8db/0x1c70 net/core/rtnetlink.c:4064
#1: ffffffff8df3dee0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:331 [inline]
#1: ffffffff8df3dee0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:841 [inline]
#1: ffffffff8df3dee0 (rcu_read_lock){....}-{1:3}, at: start_flush_work kernel/workqueue.c:4150 [inline]
#1: ffffffff8df3dee0 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xd2/0xbc0 kernel/workqueue.c:4208

stack backtrace:
CPU: 0 UID: 0 PID: 5313 Comm: syz-executor216 Not tainted 6.15.0-rc7-syzkaller-00144-gb1427432d3b6 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_circular_bug+0x2ee/0x310 kernel/locking/lockdep.c:2079
check_noncircular+0x134/0x160 kernel/locking/lockdep.c:2211
check_prev_add kernel/locking/lockdep.c:3166 [inline]
check_prevs_add kernel/locking/lockdep.c:3285 [inline]
validate_chain+0xb9b/0x2140 kernel/locking/lockdep.c:3909
__lock_acquire+0xaac/0xd20 kernel/locking/lockdep.c:5235
lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866
touch_work_lockdep_map kernel/workqueue.c:3922 [inline]
start_flush_work kernel/workqueue.c:4176 [inline]
__flush_work+0x6b8/0xbc0 kernel/workqueue.c:4208
__cancel_work_sync+0xbe/0x110 kernel/workqueue.c:4364
e1000_down+0x402/0x6b0 drivers/net/ethernet/intel/e1000/e1000_main.c:526
e1000_close+0x17b/0xa10 drivers/net/ethernet/intel/e1000/e1000_main.c:1448
__dev_close_many+0x361/0x6f0 net/core/dev.c:1702
__dev_close net/core/dev.c:1714 [inline]
__dev_change_flags+0x2c7/0x6d0 net/core/dev.c:9352
netif_change_flags+0x88/0x1a0 net/core/dev.c:9417
do_setlink+0xcb9/0x40d0 net/core/rtnetlink.c:3152
rtnl_group_changelink net/core/rtnetlink.c:3783 [inline]
__rtnl_newlink net/core/rtnetlink.c:3937 [inline]
rtnl_newlink+0x149f/0x1c70 net/core/rtnetlink.c:4065
rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955
netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
netlink_unicast+0x75b/0x8d0 net/netlink/af_netlink.c:1339
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883
sock_sendmsg_nosec net/socket.c:712 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:727
____sys_sendmsg+0x505/0x830 net/socket.c:2566
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620
__sys_sendmsg net/socket.c:2652 [inline]
__do_sys_sendmsg net/socket.c:2657 [inline]
__se_sys_sendmsg net/socket.c:2655 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f09c1caf4a9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f09c1c47198 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f09c1d39318 RCX: 00007f09c1caf4a9
RDX: 0000000000000000 RSI: 0000200000000140 RDI: 0000000000000005
RBP: 00007f09c1d39310 R08: 0000000000000008 R09: 0000000000000000
R10: 0000000000000004 R11: 0000000000000246 R12: 00007f09c1d060ac
R13: 000000000000006e R14: 0000200000000080 R15: 0000200000000150
</TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Stanislav Fomichev

unread,
May 29, 2025, 11:59:47 AM5/29/25
to syzbot, da...@davemloft.net, edum...@google.com, ho...@kernel.org, ku...@kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
So this is internal WQ entry lock that is being reordered with rtnl
lock. But looking at process_one_work, I don't see actual locks, mostly
lock_map_acquire/lock_map_release calls to enforce some internal WQ
invariants. Not sure what to do with it, will try to read more.

Jakub Kicinski

unread,
May 29, 2025, 12:10:07 PM5/29/25
to Stanislav Fomichev, syzbot, da...@davemloft.net, edum...@google.com, ho...@kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
On Thu, 29 May 2025 08:59:43 -0700 Stanislav Fomichev wrote:
> So this is internal WQ entry lock that is being reordered with rtnl
> lock. But looking at process_one_work, I don't see actual locks, mostly
> lock_map_acquire/lock_map_release calls to enforce some internal WQ
> invariants. Not sure what to do with it, will try to read more.

Basically a flush_work() happens while holding rtnl_lock,
but the work itself takes that lock. It's a driver bug.

Stanislav Fomichev

unread,
May 29, 2025, 12:45:13 PM5/29/25
to Jakub Kicinski, syzbot, da...@davemloft.net, edum...@google.com, ho...@kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
e400c7444d84 ("e1000: Hold RTNL when e1000_down can be called") ?
I think similar things (but wrt netdev instance lock) are happening
with iavf: iavf_remove calls cancel_work_sync while holding the
instance lock and the work callbacks grab the instance lock as well :-/

syzbot

unread,
May 30, 2025, 4:18:48 PM5/30/25
to linux-...@vger.kernel.org, syzkall...@googlegroups.com
For archival purposes, forwarding an incoming command email to
linux-...@vger.kernel.org, syzkall...@googlegroups.com.

***

Subject: Re: [syzbot] [net?] possible deadlock in rtnl_newlink
Author: jda...@fastly.com

#syz test

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 3f089c3d47b2..d8595e84326d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -477,10 +477,6 @@ static void e1000_down_and_stop(struct e1000_adapter *adapter)

cancel_delayed_work_sync(&adapter->phy_info_task);
cancel_delayed_work_sync(&adapter->fifo_stall_task);
-
- /* Only kill reset task if adapter is not resetting */
- if (!test_bit(__E1000_RESETTING, &adapter->flags))
- cancel_work_sync(&adapter->reset_task);
}

void e1000_down(struct e1000_adapter *adapter)
@@ -1266,6 +1262,10 @@ static void e1000_remove(struct pci_dev *pdev)

unregister_netdev(netdev);

+ /* Only kill reset task if adapter is not resetting */
+ if (!test_bit(__E1000_RESETTING, &adapter->flags))
+ cancel_work_sync(&adapter->reset_task);
+
e1000_phy_hw_reset(hw);

kfree(adapter->tx_ring);
--
2.43.0


syzbot

unread,
May 30, 2025, 4:39:06 PM5/30/25
to jda...@fastly.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer is still triggering an issue:
no output from test machine



Tested on:

commit: 8477ab14 Merge tag 'iommu-updates-v6.16' of git://git...
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1768400c580000
kernel config: https://syzkaller.appspot.com/x/.config?x=8a01551457d63a4b
dashboard link: https://syzkaller.appspot.com/bug?extid=846bb38dc67fe62cc733
compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
patch: https://syzkaller.appspot.com/x/patch.diff?x=17b46970580000

Hillf Danton

unread,
Jun 1, 2025, 5:50:33 AM6/1/25
to syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
> Date: Thu, 29 May 2025 03:32:31 -0700
> syzbot found the following issue on:
>
> HEAD commit: b1427432d3b6 Merge tag 'iommu-fixes-v6.15-rc7' of git://gi..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=161ef5f4580000
> kernel config: https://syzkaller.appspot.com/x/.config?x=9fd1c9848687d742
> dashboard link: https://syzkaller.appspot.com/bug?extid=846bb38dc67fe62cc733
> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12d21170580000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17d9a8e8580000

#syz test

--- l/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ e/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -3509,7 +3509,11 @@ static void e1000_reset_task(struct work
container_of(work, struct e1000_adapter, reset_task);

e_err(drv, "Reset adapter\n");
+ while (test_and_set_bit(__E1000_RESETTING, &adapter->flags))
+ msleep(1);
rtnl_lock();
+ clear_bit(__E1000_RESETTING, &adapter->flags);
+ smp_mb();
e1000_reinit_locked(adapter);
rtnl_unlock();
}
--

syzbot

unread,
Jun 1, 2025, 6:11:05 AM6/1/25
to hda...@sina.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer is still triggering an issue:
no output from test machine



Tested on:

commit: 7d4e49a7 Merge tag 'mm-nonmm-stable-2025-05-31-15-28' ..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10eb300c580000
kernel config: https://syzkaller.appspot.com/x/.config?x=2ea0d63949bc4278
dashboard link: https://syzkaller.appspot.com/bug?extid=846bb38dc67fe62cc733
compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
patch: https://syzkaller.appspot.com/x/patch.diff?x=11b6a00c580000

Joe Damato

unread,
Jun 4, 2025, 2:21:47 AM6/4/25
to Stanislav Fomichev, Jakub Kicinski, syzbot, da...@davemloft.net, edum...@google.com, ho...@kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
I think this is probably the same thread as:

https://lore.kernel.org/netdev/CAP=Rh=OEsn4y_2LvkO3UtDWurKcGPnZ_NPSXK=FbgygN...@mail.gmail.com/

I posted a response there about how to possibly avoid the problem
(based on my rough reading of the driver code), but am still
thinking more on this.
Reply all
Reply to author
Forward
0 new messages