Hi Zhiyu,
Sorry for the late reply.
On Mon, Jul 7, 2025 at 1:29 PM Zhiyu Zhang <
zhiyuz...@gmail.com> wrote:
>
> Hi Aleksandr,
>
> Thank you for your feedback!
>
> >
https://github.com/google/syzkaller/pull/6134 has been merged :)
>
> Thanks! :)
>
> > How many seeds did you generate? Syzkaller runs everything in syzkaller/sys/linux/test/ each time it starts, so it puts some limits on how much can be added there.
>
> Yes, I understand. Currently, we have generated around 1k-2k effective
> seeds (excluding their children variants) for those rarely covered
> syscalls during normal fuzzing.
> I can minimize them and select a smaller group from them according to
> the amount that Syzkaller limits.
1k-2k doesn't sound like a very big number, though it would still be
better to reduce it further. Currently, we focus our seed programs on
something that a coverage-guided fuzzer could not be realistically
expected to come up with: valid filesystem images, long and tricky
network communications, etc. Some seeds are used as tests for the
manually written descriptions (see e.g. [1]), but these do not always
contribute a lot to how far the fuzzer could reach.
So the seeds we would prefer to see in our repository should each be
cohesive and target specifically the areas that are most problematic
for the fuzzer.
Did you upload the ones you have generated to your public repository?
I am very curious to see what they look like now :)
[1]
https://github.com/google/syzkaller/blob/master/sys/linux/test/landlock_fs_accesses
>
>
> > As I see from the paper and the repository (please correct me if I'm wrong), ... - query some detailed coverage data, inject seeds, etc.
>
> Your understanding is correct. The main interactions between our
> generator and syzkaller are:
> 1. Query the coverage status of EnabledCalls (i.e., CoveredCalls).
Please note that we also have a /coverprogs HTTP endpoint that returns
a jsonl stream of all corpus programs alongside of the coverage
triggered by each of them. If you don't need the exact raw coverage,
but rather just the total size of it per syscall, I guess making a
?json=1 version of /syscalls could also be an option.
>
> 2. Let syz-manager load our generated seeds under a specific directory
> (e.g., cfg.Workdir/generated_seeds/) periodically.
>
> There are several tools added or modified on top of syzkaller:
> 1. tools/syz-repair (added): repair the generated seeds based on our
> heuristic rules, which are enough and faster than LLM-based repair.
> 2. tools/syz-validator (added): parse the defined syscalls from
> target.Syscalls and check the validity of the generated seeds.
As you already noted below, these two indeed could become just one tool.
FWIW we also have
https://github.com/google/syzkaller/issues/5877 in
our backlog that (when implemented, of course, but the implementation
would be rather straightforward) should hopefully make it much easier
to apply LLMs to syzlang.
> 3. tools/syz-db (modified): build the inverted index for our generator
> to conduct dependency-based reference-program retrieval.
If you plan to do it dynamically (i.e. not just once at start),
calling tools/syz-db on a corpus of a running syzkaller instance is
probably not the most straightforward approach. Especially if you also
plan to share some information directly from the syz-manager.
Processing the data returned by syz-manager on the side of your
generator application sounds like a better option here.
>
> 4. tools/syz-execprog (modified): calculate the effectiveness of the
> generated seeds by execution.
> P.S. Tool 1 and 2 are required by our generator and can be merged to
> syzkaller as one tool. Tool 3 is required by our DRAG method. Tool 4
> is only for evalaution and does not need to be merged.
>
> > I wonder if we could take in those parts (ideally by making them more generic / usable in more contexts), so that the rest of your code may stay written in a different language, but will be able to interact with any new syzkaller revision.
>
> Yes, these parts are the modules that I would like to merge into
> syzkaller. Actually, I am trying to adapt our implementation to the
> newer syzkaller.
> (e.g., I had to collect the CoveredCalls through the rpc message when
> finding a new input, since the syz-fuzzer was located in the VM in the
> old days. Now I don't have to.)
>
> I will try to make them more generic and raise a pr (maybe several
> weeks later), where we can further discuss if they are proper or not.
> How do you like it?
Sounds good to me!
Looking forward to your PR :)
--
Aleksandr