[google/syzkaller] e83313: executor: improve startup time on machines with ma...

1 view

Skip to first unread message

Florent Revest

unread,

Nov 26, 2025, 12:10:49 PM (7 days ago) Nov 26

to syzk...@googlegroups.com

Branch: refs/heads/gh-readonly-queue/master/pr-6482-d6526ea3e6ad9081c902859bbb80f9f840377cb4
Home: https://github.com/google/syzkaller
Commit: e8331348a26e30c511b7bbbd25d071a1862cf6a8
https://github.com/google/syzkaller/commit/e8331348a26e30c511b7bbbd25d071a1862cf6a8
Author: Florent Revest <rev...@chromium.org>
Date: 2025-11-26 (Wed, 26 Nov 2025)

Changed paths:
M executor/common_linux.h

Log Message:
-----------
executor: improve startup time on machines with many CPUs

I observed that on machines with many CPUs (480 on my setup), fuzzing
with a handful of procs (8 on my setup) would consistently fail to start
because syz-executors would fail to respond within the default handshake
timeout of 1 minute. Reducing procs to 4 would fix it but sounds
ridiculous on such a powerful machine.

As part of the default sandbox policy, a syz-executor creates a large
number of virtual network interfaces (16 on my kernel config, probably
more on other kernels). This step vastly dominates the executor startup
time and was clearly responsible for the timeout I observed that
prevented me from fuzzing.

When fuzzing or reproducing with procs > 1, all executors run their
sandbox setup in parallel. Creating network interfaces is done by socket
operations to the RTNL (routing netlink) subsystem. Unfortunately, all
RTNL operations in the kernel are serialized by a "rtnl_mutex" mega lock
so instead of paralellizing the 8*16 interfaces creation, they
effectively get serialized and the timing it takes to set up the default
sandbox for one executor scales lineraly with the number of executors
started "in parallel". This is currently inherent to the rtnl_mutex in
the kernel and as far as I can tell there's nothing we can do about it.

However, it makes it very important that each critical section guarded
by "rtnl_mutex" stays short and snappy, to avoid long waits on the lock.
Unfortunately, the default behavior of a virtual network interface
creation is to create one RX and one TX queue per CPU. Each queue is
associated with a sysfs file whose creation is quite slow and goes
through various sanitized paths that take a long time. This means that
each critical section scales linearly to the number of CPUs on the host.

For example, in my setup, starting fuzzing needs 2 minutes 25. I found
that I could bring this down to 10 seconds (15x faster startup time!) by
limiting the number of RX and TX queues created per virtual interface to
2 using the IFLA_NUM_*X_QUEUES RTNL attributes. I opportunistically
chose 2 to try and keep coverage of the code that exercises multiple
queues but I don't have evidences that choosing 1 here would actually
reduce the code coverage.

As far as I can tell, reducing the number of queues would be problematic
in a high performance networking scenario but doesn't matter for fuzzing
in a namespace with only one process so this seems like a fair trade-off
to me. Ultimately, this lets me start a lot more parallel executors and
take better advantage of my beefy machine.

Technical detail for review: veth interfaces actually create two
interfaces for both side of the virtual ethernet link so both sides
need to be configured with a low number of queues.

To unsubscribe from these emails, change your notification settings at https://github.com/google/syzkaller/settings/notifications

Florent Revest

unread,

Nov 26, 2025, 12:22:21 PM (7 days ago) Nov 26

to syzk...@googlegroups.com

Branch: refs/heads/master

Reply all

Reply to author

Forward

0 new messages