Request to add our paper to reasearch.md and merge our artifact into the mainline

48 views
Skip to first unread message

Zhiyu Zhang

unread,
Jun 30, 2025, 11:05:08 AMJun 30
to syzkaller
Dear Syzkaller Maintainers,

I hope email find you well.

I'm writting to ask for the addition of our research paper at ISSTA'25, "Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG", to the research.md. (Please refer to pull request 6134)

In this paper, we define the low frequency syscalls (LFS) problem in kernel fuzzing, which exist due to the unresolved syscall dependencies and fuzzing uncertainty. To address the LFS problem, we propose SyzGPT which leverages Dependency-RAG and periodically generates effective Syz-programs containing target LFS. You can find more details in the following links:


Additionally, I would like to try to merge our work into Syzkaller project in a non-invasive manner. I have two preliminary proposals and look forward to your advice:

1. Merge the seeds generated by SyzGPT to syzkaller/sys/linux/test/
2. Merge the "runtime enriched seeds loading module" to syzkaller in a pkg format (maybe some slight code snippets would need to be added to syz-manager). In this case, one can enbale syzkaller to periodically load LLM-generated or some-generated seeds from a specified directory by switching on some CLI options or set mgrconfig (I don't know which way is preferred by the comminity?)
(3.) Migrating SyzGPT-generator, which is implemented by Python, to Golang is somehow trivial and tedious. (So this is not included as a merging proposal)

Thank you again for providing such a great project and community. I would always be willing to help improve it.

Best regards,
Zhiyu Zhang


Aleksandr Nogikh

unread,
Jul 7, 2025, 6:14:19 AMJul 7
to Zhiyu Zhang, syzkaller
Hi Zhiyu,

Thanks for reaching out!

On Mon, Jun 30, 2025 at 5:05 PM Zhiyu Zhang <zhiyuz...@gmail.com> wrote:
>
> Dear Syzkaller Maintainers,
>
> I hope email find you well.
>
> I'm writting to ask for the addition of our research paper at ISSTA'25, "Unlocking Low Frequency Syscalls in Kernel Fuzzing with Dependency-Based RAG", to the research.md. (Please refer to pull request 6134)

https://github.com/google/syzkaller/pull/6134 has been merged :)

>
> In this paper, we define the low frequency syscalls (LFS) problem in kernel fuzzing, which exist due to the unresolved syscall dependencies and fuzzing uncertainty. To address the LFS problem, we propose SyzGPT which leverages Dependency-RAG and periodically generates effective Syz-programs containing target LFS. You can find more details in the following links:
>
> Paper: https://dl.acm.org/doi/pdf/10.1145/3728913
> Code: https://github.com/QGrain/SyzGPT
>
> Additionally, I would like to try to merge our work into Syzkaller project in a non-invasive manner. I have two preliminary proposals and look forward to your advice:
>
> 1. Merge the seeds generated by SyzGPT to syzkaller/sys/linux/test/

That sounds like the most straightforward option.

How many seeds did you generate?
Syzkaller runs everything in syzkaller/sys/linux/test/ each time it
starts, so it puts some limits on how much can be added there.

> 2. Merge the "runtime enriched seeds loading module" to syzkaller in a pkg format (maybe some slight code snippets would need to be added to syz-manager). In this case, one can enbale syzkaller to periodically load LLM-generated or some-generated seeds from a specified directory by switching on some CLI options or set mgrconfig (I don't know which way is preferred by the comminity?)

As I see from the paper and the repository (please correct me if I'm
wrong), all the actual work on determining the low frequency syscalls
and generating seeds for them is done outside the syz-manager process.
So essentially what you need is some proper API to interact with a
running syzkaller - query some detailed coverage data, inject seeds,
etc.

I wonder if we could take in those parts (ideally by making them more
generic / usable in more contexts), so that the rest of your code may
stay written in a different language, but will be able to interact
with any new syzkaller revision.

> (3.) Migrating SyzGPT-generator, which is implemented by Python, to Golang is somehow trivial and tedious. (So this is not included as a merging proposal)
>
> Thank you again for providing such a great project and community. I would always be willing to help improve it.
>
> Best regards,
> Zhiyu Zhang
>
>

--
Aleksandr

Zhiyu Zhang

unread,
Jul 7, 2025, 7:29:43 AMJul 7
to Aleksandr Nogikh, syzkaller
Hi Aleksandr,

Thank you for your feedback!
Thanks! :)

> How many seeds did you generate? Syzkaller runs everything in syzkaller/sys/linux/test/ each time it starts, so it puts some limits on how much can be added there.

Yes, I understand. Currently, we have generated around 1k-2k effective
seeds (excluding their children variants) for those rarely covered
syscalls during normal fuzzing.
I can minimize them and select a smaller group from them according to
the amount that Syzkaller limits.

> As I see from the paper and the repository (please correct me if I'm wrong), ... - query some detailed coverage data, inject seeds, etc.

Your understanding is correct. The main interactions between our
generator and syzkaller are:
1. Query the coverage status of EnabledCalls (i.e., CoveredCalls).
2. Let syz-manager load our generated seeds under a specific directory
(e.g., cfg.Workdir/generated_seeds/) periodically.

There are several tools added or modified on top of syzkaller:
1. tools/syz-repair (added): repair the generated seeds based on our
heuristic rules, which are enough and faster than LLM-based repair.
2. tools/syz-validator (added): parse the defined syscalls from
target.Syscalls and check the validity of the generated seeds.
3. tools/syz-db (modified): build the inverted index for our generator
to conduct dependency-based reference-program retrieval.
4. tools/syz-execprog (modified): calculate the effectiveness of the
generated seeds by execution.
P.S. Tool 1 and 2 are required by our generator and can be merged to
syzkaller as one tool. Tool 3 is required by our DRAG method. Tool 4
is only for evalaution and does not need to be merged.

> I wonder if we could take in those parts (ideally by making them more generic / usable in more contexts), so that the rest of your code may stay written in a different language, but will be able to interact with any new syzkaller revision.

Yes, these parts are the modules that I would like to merge into
syzkaller. Actually, I am trying to adapt our implementation to the
newer syzkaller.
(e.g., I had to collect the CoveredCalls through the rpc message when
finding a new input, since the syz-fuzzer was located in the VM in the
old days. Now I don't have to.)

I will try to make them more generic and raise a pr (maybe several
weeks later), where we can further discuss if they are proper or not.
How do you like it?


Best regards,
Zhiyu Zhang

Aleksandr Nogikh <nog...@google.com> 于2025年7月7日周一 18:14写道:

Aleksandr Nogikh

unread,
Jul 14, 2025, 6:22:48 AMJul 14
to Zhiyu Zhang, syzkaller
Hi Zhiyu,

Sorry for the late reply.

On Mon, Jul 7, 2025 at 1:29 PM Zhiyu Zhang <zhiyuz...@gmail.com> wrote:
>
> Hi Aleksandr,
>
> Thank you for your feedback!
>
> > https://github.com/google/syzkaller/pull/6134 has been merged :)
>
> Thanks! :)
>
> > How many seeds did you generate? Syzkaller runs everything in syzkaller/sys/linux/test/ each time it starts, so it puts some limits on how much can be added there.
>
> Yes, I understand. Currently, we have generated around 1k-2k effective
> seeds (excluding their children variants) for those rarely covered
> syscalls during normal fuzzing.
> I can minimize them and select a smaller group from them according to
> the amount that Syzkaller limits.

1k-2k doesn't sound like a very big number, though it would still be
better to reduce it further. Currently, we focus our seed programs on
something that a coverage-guided fuzzer could not be realistically
expected to come up with: valid filesystem images, long and tricky
network communications, etc. Some seeds are used as tests for the
manually written descriptions (see e.g. [1]), but these do not always
contribute a lot to how far the fuzzer could reach.

So the seeds we would prefer to see in our repository should each be
cohesive and target specifically the areas that are most problematic
for the fuzzer.

Did you upload the ones you have generated to your public repository?
I am very curious to see what they look like now :)

[1] https://github.com/google/syzkaller/blob/master/sys/linux/test/landlock_fs_accesses

>
>
> > As I see from the paper and the repository (please correct me if I'm wrong), ... - query some detailed coverage data, inject seeds, etc.
>
> Your understanding is correct. The main interactions between our
> generator and syzkaller are:
> 1. Query the coverage status of EnabledCalls (i.e., CoveredCalls).

Please note that we also have a /coverprogs HTTP endpoint that returns
a jsonl stream of all corpus programs alongside of the coverage
triggered by each of them. If you don't need the exact raw coverage,
but rather just the total size of it per syscall, I guess making a
?json=1 version of /syscalls could also be an option.

>
> 2. Let syz-manager load our generated seeds under a specific directory
> (e.g., cfg.Workdir/generated_seeds/) periodically.
>
> There are several tools added or modified on top of syzkaller:
> 1. tools/syz-repair (added): repair the generated seeds based on our
> heuristic rules, which are enough and faster than LLM-based repair.
> 2. tools/syz-validator (added): parse the defined syscalls from
> target.Syscalls and check the validity of the generated seeds.

As you already noted below, these two indeed could become just one tool.

FWIW we also have https://github.com/google/syzkaller/issues/5877 in
our backlog that (when implemented, of course, but the implementation
would be rather straightforward) should hopefully make it much easier
to apply LLMs to syzlang.

> 3. tools/syz-db (modified): build the inverted index for our generator
> to conduct dependency-based reference-program retrieval.

If you plan to do it dynamically (i.e. not just once at start),
calling tools/syz-db on a corpus of a running syzkaller instance is
probably not the most straightforward approach. Especially if you also
plan to share some information directly from the syz-manager.

Processing the data returned by syz-manager on the side of your
generator application sounds like a better option here.

>
> 4. tools/syz-execprog (modified): calculate the effectiveness of the
> generated seeds by execution.
> P.S. Tool 1 and 2 are required by our generator and can be merged to
> syzkaller as one tool. Tool 3 is required by our DRAG method. Tool 4
> is only for evalaution and does not need to be merged.
>
> > I wonder if we could take in those parts (ideally by making them more generic / usable in more contexts), so that the rest of your code may stay written in a different language, but will be able to interact with any new syzkaller revision.
>
> Yes, these parts are the modules that I would like to merge into
> syzkaller. Actually, I am trying to adapt our implementation to the
> newer syzkaller.
> (e.g., I had to collect the CoveredCalls through the rpc message when
> finding a new input, since the syz-fuzzer was located in the VM in the
> old days. Now I don't have to.)
>
> I will try to make them more generic and raise a pr (maybe several
> weeks later), where we can further discuss if they are proper or not.
> How do you like it?

Sounds good to me!
Looking forward to your PR :)

--
Aleksandr

Zhiyu Zhang

unread,
Jul 15, 2025, 1:15:20 PMJul 15
to syzkaller
Hi Aleksandr,

Thanks for your reply.

> > Yes, I understand. Currently, we have generated around 1k-2k effective
> > seeds (excluding their children variants) for those rarely covered
> > syscalls during normal fuzzing.
1k-2k doesn't sound like a very big number, though it would still be better to reduce it further. 

Sorry, I misremembered the scale of the generated seeds due to the relatively long time interval. They should be ~20k (and ~4k among which have length ≥5) before minimization.

> So the seeds we would prefer to see in our repository should each be
> cohesive and target specifically the areas that are most problematic
>for the fuzzer.

> Did you upload the ones you have generated to your public repository?
> I am very curious to see what they look like now :)

Understand. I haven't uploaded them so far. I would like to minimize them before uploading.
However, I find it difficult to elegantly minimize the seeds. There there seems to be no existing standalone tool for prog minimization.  :(
So, it seems that I have to develop a syz-minimizer following the example of syz-execprog (prog.Minimize is easy to call, but not for execution and signal acquisition).

> > 1. Query the coverage status of EnabledCalls (i.e., CoveredCalls).
> If you don't need the exact raw coverage, but rather just the total size of it per syscall, 
> I guess making a ?json=1 version of /syscalls could also be an option.

Yes, what we need is just the number of covered syscalls (bool) but not code coverage per syscall.
I find that there are tuples of <Inputs, Total, Coverage> for every enabled syzkaller syscalls in /syscalls. Could you tell me which non-zero value indicates the syscall has been tested/covered by syzkaller? (I guess it's Inputs/Coverage?)

> FWIW we also have https://github.com/google/syzkaller/issues/5877 in
> our backlog that (when implemented, of course, but the implementation
> would be rather straightforward) should hopefully make it much easier
> to apply LLMs to syzlang.

"4. Add support for referencing const values by name." in the issue 5877 looks good to the LLM generated programs, as they always like to use macros rather than values.

> If you plan to do it dynamically (i.e. not just once at start),
> calling tools/syz-db on a corpus of a running syzkaller instance is
> probably not the most straightforward approach. ...

Actually, we won't let a running syzkaller to call the modified tool/syz-db. We just added a argument to syz-db (you can refer to https://github.com/QGrain/SyzGPT/blob/main/fuzzer/SyzGPT-fuzzer_for_f1b6b00.patch#L524-L658).
And the "syz-db parse corpus.db dir" command would be dynamically called by SyzGPT-generator rather than syz-manager (it's more like providing a tool service for our generator).

> Sounds good to me!
> Looking forward to your PR :)

Thank you very much! I might be busy in the next month or two, but I will try to make progress  :)

Best regards,

Zhiyu

Aleksandr Nogikh

unread,
Jul 17, 2025, 6:06:55 AMJul 17
to Zhiyu Zhang, syzkaller
Hi Zhiyu,

On Tue, Jul 15, 2025 at 7:15 PM Zhiyu Zhang <zhiyuz...@gmail.com> wrote:
>
> Hi Aleksandr,
>
> Thanks for your reply.
>
> > > Yes, I understand. Currently, we have generated around 1k-2k effective
> > > seeds (excluding their children variants) for those rarely covered
> > > syscalls during normal fuzzing.
> >
> > 1k-2k doesn't sound like a very big number, though it would still be better to reduce it further.
>
> Sorry, I misremembered the scale of the generated seeds due to the relatively long time interval. They should be ~20k (and ~4k among which have length ≥5) before minimization.

Oh, 20k is already a very significant number.

I wonder if instead of merging it into sys/linux/* it'd be better to
share the corpus like it's done in this project:
https://github.com/cmu-pasta/linux-kernel-enriched-corpus

In this case, there's no critical need for minimizing it (it would
still be nice, though). Also, on the syzbot side, we could inject
those programs to the syzbot instances so that they would re-triage it
and learn something new from it.

>
> > So the seeds we would prefer to see in our repository should each be
> > cohesive and target specifically the areas that are most problematic
> >for the fuzzer.
> >
> > Did you upload the ones you have generated to your public repository?
> > I am very curious to see what they look like now :)
>
> Understand. I haven't uploaded them so far. I would like to minimize them before uploading.
> However, I find it difficult to elegantly minimize the seeds. There there seems to be no existing standalone tool for prog minimization. :(
> So, it seems that I have to develop a syz-minimizer following the example of syz-execprog (prog.Minimize is easy to call, but not for execution and signal acquisition).

There's also an interesting question regarding the predicate against
which one could minimize a program. During fuzzing, the predicate is
clear - we know the delta of the new coverage that this specific
program has been able to trigger, so we minimize it as long as the
minimized program can still trigger it. But if we just take an
arbitrary program, it's not clear what significant/coverage is
significant and what is not.

It probably makes more sense to add a special mode to syz-manager to
only retriage the corpus and exit, without doing any extra fuzzing.
Then you could reuse all the already implemented functionality that
tears unrelated programs apart, minimizes against the new coverage
they are able to produce, etc.

https://github.com/google/syzkaller/blob/master/syz-manager/manager.go#L126

>
> > > 1. Query the coverage status of EnabledCalls (i.e., CoveredCalls).
> > If you don't need the exact raw coverage, but rather just the total size of it per syscall,
> > I guess making a ?json=1 version of /syscalls could also be an option.
>
> Yes, what we need is just the number of covered syscalls (bool) but not code coverage per syscall.
> I find that there are tuples of <Inputs, Total, Coverage> for every enabled syzkaller syscalls in /syscalls. Could you tell me which non-zero value indicates the syscall has been tested/covered by syzkaller? (I guess it's Inputs/Coverage?)

If the criteria is whether the value is zero or not, Inputs and
Coverage should be interchangeable.

>
> > FWIW we also have https://github.com/google/syzkaller/issues/5877 in
> > our backlog that (when implemented, of course, but the implementation
> > would be rather straightforward) should hopefully make it much easier
> > to apply LLMs to syzlang.
>
> "4. Add support for referencing const values by name." in the issue 5877 looks good to the LLM generated programs, as they always like to use macros rather than values.
>
> > If you plan to do it dynamically (i.e. not just once at start),
> > calling tools/syz-db on a corpus of a running syzkaller instance is
> > probably not the most straightforward approach. ...
>
> Actually, we won't let a running syzkaller to call the modified tool/syz-db. We just added a argument to syz-db (you can refer to https://github.com/QGrain/SyzGPT/blob/main/fuzzer/SyzGPT-fuzzer_for_f1b6b00.patch#L524-L658).
> And the "syz-db parse corpus.db dir" command would be dynamically called by SyzGPT-generator rather than syz-manager (it's more like providing a tool service for our generator).

I was talking specifically about your generator tool calling syz-db on
a corpus.db that is simultaneously updated by some running
syz-manager.

>
> > Sounds good to me!
> > Looking forward to your PR :)
>
> Thank you very much! I might be busy in the next month or two, but I will try to make progress :)
>
> Best regards,
>
> Zhiyu

--
Aleksandr
Reply all
Reply to author
Forward
0 new messages