Allow multiple corpus locations?

61 views
Skip to first unread message

thepudds

unread,
May 12, 2019, 9:44:18 AM5/12/19
to Golang Fuzzing
This is the start of a thread specifically on item 11 "Allow multiple corpus locations?" from the bigger "List of possible modifications to the March 2017 proposal" thread:


In this post, I just:

 * Quoted what was said on item 11 in that bigger thread.
 * Grabbed a few quotes from multi-corpus related discussion in the "Should `go test` without `-fuzz` ever be non-deterministic?" thread (but did not try to grab everything).

In short, if you recently read those other threads, no real need to read this post, but wanted to at least extract some of the bigger comments to put here. 
 
From the bigger "List of possible modifications to the March 2017 proposal" thread:

    > -------------------------------------------------------------------
    > 11. Allow multiple corpus locations?
    > -------------------------------------------------------------------
    > 
    > This is a longer topic, but wanted to at least record the topic itself.
    > 
    > 
    >   "I wonder if it's a good idea to instead allow 2/2+ directories with input corpus?
    > For example, if we read inputs from testdata/something/something, but also from 
    > -fuzzdir/-workdir if provided. Then testdata/ could contain hand-written inputs and regression 
    > tests and is checked-in with the code (that's small number of higher-quality inputs with low 
    > churn, so no different from unit-tests and makes sense to check-in). The second dir can contain 
    > the random inputs, there are more of them and high churn. So that is preferably checked-in 
    > somewhere else (stored in an archive or something else). Then workflow would be to simply copy 
    > the crashing input from the second dir into testdata/ and run go test -run=file_name (if the 
    > auto-generated regression test uses t.Run then this will work auto-magically)."
    > 
    > 
    > And then the probably too enthusiastic response:
    > -------------------------------------------------------------------
 
Josh responded in that bigger thread:
    > -------------------------------------------------------------------
    > FWIW, I think multiple corpus locations is a great idea.
    > -------------------------------------------------------------------
    
Romain wrote in that bigger thread:
    > -------------------------------------------------------------------
    > I think we should not need multiple corpus locations: either there is a real need 
    > because both corpus are meant to be run differently, or have different purposes, in which 
    > case it also makes sense to run the tool twice, or there is none and why have them in the first place?
    > -------------------------------------------------------------------
 
Dmitry also responded in that bigger thread:
    > -------------------------------------------------------------------
    > We still did not come up with a complete plan for paths, right? I 
    > remember there was a bunch of scattered discussions about this. 
    > [...]
    > There are one-off runs, where I think should just work with zero 
    > configuration in some reasonable way. CI runs that may specify an 
    > explicit directory. Also some discussion about ability to specify 
    > several paths for corpus for merging or pointing to regression testing 
    > inputs in VCS.  There is also a use case of running corpus during 
    > regression testing. Also this is complicated by ability to run 
    > multiple fuzz functions/packages, how are paths overridden in this 
    > case? 
    > 
    > I remember there were some discussions re pkg/testdata and GOPATH/pkg. 
    > Go command now also uses $HOME/.cache (or what is that?) so we could 
    > too. 
    > 
    > I wonder if we could come up with some convention re corpus location 
    > (for both explicitly checked-in inputs for regression testing and 
    > actual live fuzzing corpus) that would work for all of these cases? In 
    > the end a CI could also copy-in/out fuzzing corpus somewhere into 
    > GOPATH/pkg/... rather then impose own location to go-fuzz. 
    > -------------------------------------------------------------------

Dmitry also later wrote in that bigger thread:
    > -------------------------------------------------------------------
    > The question I have: can we design a solution that does not have 
    > -fuzzdir at all? 
    > That would automatically resolve all problems around -fuzzdir handling 
    > and meaning :) 
    > -------------------------------------------------------------------

In the "Should `go test` without `-fuzz` ever be non-deterministic?" thread, Dmitry also made some comments on multiple corpus locations. I won't quote everything he wrote there, but the biggest related comment there from Dmitry I think was:
    > -------------------------------------------------------------------
    > But what exactly is "inputs from the corpus" depends on how we 
    > organize and discover the corpus. 
    > 
    > Traditionally, for fuzzer corpus is a single directory with files, all 
    > of which are under control of the fuzzer (i.e. it can 
    > add/remove/change/etc them as necessary). 
    > But I thinking towards splitting corpus into "fixed" part and "dynamic" part. 
    > "Dynamic" part is the traditional definition of the corpus (lots of 
    > random inputs that fuzzer generates and changes). 
    > "Fixed" part is controlled by humans and is checked in into VCS 
    > (generally) (say, this will be located in testdata/fuzz). There are 2 
    > ways how inputs appear in the "fixed" part: (1) a developer copies a 
    > single crashing input from the dynamic corpus, which now becomes a 
    > regression test; (2) a developer explicitly writes a new input (or 
    > copies them from some existing base of tests, e.g. different real jpeg 
    > pictures), this is also useful for so-called bootstrapping of the 
    > corpus. 
    > 
    > Now, fuzzer will also take, execute and mutate inputs from the "fixed" 
    > part, but only do any changes (add/remove/change) to the "dynamic" 
    > part. 
    > 
    > Now, go test may run only fixed inputs or both fixed and dynamic. 
    > Running only fixed is fully deterministic. Running both may lead to 
    > unwanted failures. Consider, you run a fuzzer, it produced a crashing 
    > input in the corpus. Now, you do some unrelated change to the code and 
    > re-run tests. Tests crash because you have that crashing input in the 
    > dynamic corpus now (other people don't, it's only on your disk for 
    > now). 
    > -------------------------------------------------------------------

Romain had a couple related comments in that same thread, including:
    > -------------------------------------------------------------------
    > In addition, using GOPATH/pkg/fuzz/xxx, I'm not sure about the way the user should promote 
    > a generated input to the checked-in corpus. Having to do a manual copy seems clumsy, 
    > error-prone at best and too much arcane for the "standard user" I imagine. We would
    > have to add tooling for this and I'm not convinced this would be better.
    > -------------------------------------------------------------------

So that is a summary of some of the main points made so far on the multi-corpus discussion...

Regards,
thepudds

thepudds

unread,
May 12, 2019, 9:56:03 AM5/12/19
to Golang Fuzzing
FWIW, as I've said before, I think the idea of having more than one corpus location helps a great deal on multiple dimensions.

If we go that way, something plausible that is maybe on the simpler end of the spectrum:

------------------------------------------------------------------
Simple Baseline for Multiple Corpus Locations
------------------------------------------------------------------
 * -fuzzdir supports at most 1 directory path
 * Regardless of whether or not -fuzzdir is set:
     * Always read from <pkgdir>/testdata/fuzz/<fuzzfunc>/corpus if it exits
 * If -fuzzdir is set:
     * New inputs written to <fuzzdir>/.../corpus
     * Crashes written to <fuzzdir>/.../crashers
 * If -fuzzdir is not set:
     * New inputs written to <pkgdir>/testdata/fuzz/<fuzzfunc>/corpus
     * Crashes written to <pkgdir>/testdata/fuzz/<fuzzfunc>/crashers
     
That I think might be close to what Dmitry is describing as a "Fixed" location and a "Dynamic" location.

(And for now, I'm ignoring the exact structure of what is under -fuzzdir, e.g., is it <fuzzdir>/github.com/some/import/path/FuzzFunc, or no additional structure under <fuzzdir>, or something else).

Here are some possible variations on the above approach, where probably any of these could be independently added to the simpler baseline above, or possibly could add a combination of them, but each of these are arguably more complex than what's above:

a. perhaps crashes always go to <pkgdir>/testdata/fuzz/<fuzzfunc>/crashers, even if -fuzzdir is set to point somewhere else. The argument for this might be that crashers are interesting, you don't want to lose them, and after fixing them, it might at least feel slightly easier to do 'cd testdata/fuzz/FuzzFunc; cp crashers/* corpus/' than to copy crashers from the right location underneath -fuzzdir into the right location under the package itself.  Semi-related: perhaps 'go test ./...' (without a '-fuzz' or '-fuzzdir') should always run anything in <pkgdir>/testdata/fuzz/<fuzzfunc>/crashers as unit tests in addition to <pkgdir>/testdata/fuzz/<fuzzfunc>/corpus -- in other words, 'go test ./...' (without a '-fuzz' or '-fuzzdir') would fail while you have actual crashers sitting in the crashers directory underneath testdata.

b. -fuzzdir could be a comma separated list of locations, with the first value treated as the "Fixed" location (e.g., read from it but don't put new generated inputs into it).

c. -fuzzdir could be a comma separated list of locations as in b.,but a special value in the list like 'auto' or 'testdata' could mean <pkgdir>/testdata/fuzz/<fuzzfunc>. So a common setting might be something like '-fuzzdir=testdata,/local/big/corpus-repo'.  (In terms of modeling things here after things that already exist in the go tool: 'GOPROXY=direct' means "connect directly" (e.g., connect directly to github.com), and in 1.13 it can appear in a list like 'GOPROXY=proxy1.org,proxy2.org,direct').

d. I think that is much more complicated than what we need, especially for any type of first version, but -fuzzdir could be a comma separated list of wildcarded package patterns that map to corpus locations. So something like '-fuzzdir=github.com/my/proj1.*:/local/proj1-corpus,github.com/my/proj2.*:/local/proj2-corpus'.  This could mean a user could set the new 'GOFLAGS' environment variable and have it be meaningful as they switch projects, or use the new in 1.13 'go env -w' to set those patterns to a user-specific environmental config file. (In terms of modeling things here after things that already exist -- in 1.13, you will be able to use wildcarded patterns to prevent private packages from being having their cryptographic checksums verified against the public sumdb.golang.org notary via things like 'GONOSUMDB=*.corp.google.com,rsc.io/private').

Net, I think staying on the simpler end of the spectrum here is better, but worth at least thinking through some possible alternatives...

Regards,
thepudds

Dmitry Vyukov

unread,
May 16, 2019, 10:15:16 AM5/16/19
to thepudds, Golang Fuzzing
On Sun, May 12, 2019 at 3:56 PM thepudds <thepud...@gmail.com> wrote:
>
> FWIW, as I've said before, I think the idea of having more than one corpus location helps a great deal on multiple dimensions.
>
> If we go that way, something plausible that is maybe on the simpler end of the spectrum:
>
> ------------------------------------------------------------------
> Simple Baseline for Multiple Corpus Locations
> ------------------------------------------------------------------
> * -fuzzdir supports at most 1 directory path
> * Regardless of whether or not -fuzzdir is set:
> * Always read from <pkgdir>/testdata/fuzz/<fuzzfunc>/corpus if it exits
> * If -fuzzdir is set:
> * New inputs written to <fuzzdir>/.../corpus
> * Crashes written to <fuzzdir>/.../crashers
> * If -fuzzdir is not set:
> * New inputs written to <pkgdir>/testdata/fuzz/<fuzzfunc>/corpus
> * Crashes written to <pkgdir>/testdata/fuzz/<fuzzfunc>/crashers
>
> That I think might be close to what Dmitry is describing as a "Fixed" location and a "Dynamic" location.

I get to know that one of our internal fuzzing systems independently
arrived to the idea of having 2 corpuses: one is smaller and manually
create and checked into vcs and is used during unit testing; another
is larger, dynamic and is stored elsewhere.
So I would accept this idea of 2 corpuses for now as the solution. At
least until we try it and gather feedback.

I also think we should use <pkgdir>/testdata/fuzz/<fuzzfunc>/corpus
always, without any flags or options to override. Just to make things
simpler. In the end one can always temporal delete/move contents of
that dir, if they don't want it to be used for some reason.


> (And for now, I'm ignoring the exact structure of what is under -fuzzdir, e.g., is it <fuzzdir>/github.com/some/import/path/FuzzFunc, or no additional structure under <fuzzdir>, or something else).
>
> Here are some possible variations on the above approach, where probably any of these could be independently added to the simpler baseline above, or possibly could add a combination of them, but each of these are arguably more complex than what's above:
>
> a. perhaps crashes always go to <pkgdir>/testdata/fuzz/<fuzzfunc>/crashers, even if -fuzzdir is set to point somewhere else. The argument for this might be that crashers are interesting, you don't want to lose them, and after fixing them, it might at least feel slightly easier to do 'cd testdata/fuzz/FuzzFunc; cp crashers/* corpus/' than to copy crashers from the right location underneath -fuzzdir into the right location under the package itself. Semi-related: perhaps 'go test ./...' (without a '-fuzz' or '-fuzzdir') should always run anything in <pkgdir>/testdata/fuzz/<fuzzfunc>/crashers as unit tests in addition to <pkgdir>/testdata/fuzz/<fuzzfunc>/corpus -- in other words, 'go test ./...' (without a '-fuzz' or '-fuzzdir') would fail while you have actual crashers sitting in the crashers directory underneath testdata.

Ha! That's an interesting idea.
If this works as intended and don't create any unforeseen
consequences, then it looks like what users will want.
But we also need to think about continuous fuzzing integration, they
may want new crashes to not be intermixed with older, fixed bugs.



> b. -fuzzdir could be a comma separated list of locations, with the first value treated as the "Fixed" location (e.g., read from it but don't put new generated inputs into it).
>
> c. -fuzzdir could be a comma separated list of locations as in b.,but a special value in the list like 'auto' or 'testdata' could mean <pkgdir>/testdata/fuzz/<fuzzfunc>. So a common setting might be something like '-fuzzdir=testdata,/local/big/corpus-repo'. (In terms of modeling things here after things that already exist in the go tool: 'GOPROXY=direct' means "connect directly" (e.g., connect directly to github.com), and in 1.13 it can appear in a list like 'GOPROXY=proxy1.org,proxy2.org,direct').
>
> d. I think that is much more complicated than what we need, especially for any type of first version, but -fuzzdir could be a comma separated list of wildcarded package patterns that map to corpus locations. So something like '-fuzzdir=github.com/my/proj1.*:/local/proj1-corpus,github.com/my/proj2.*:/local/proj2-corpus'. This could mean a user could set the new 'GOFLAGS' environment variable and have it be meaningful as they switch projects, or use the new in 1.13 'go env -w' to set those patterns to a user-specific environmental config file. (In terms of modeling things here after things that already exist -- in 1.13, you will be able to use wildcarded patterns to prevent private packages from being having their cryptographic checksums verified against the public sumdb.golang.org notary via things like 'GONOSUMDB=*.corp.google.com,rsc.io/private').
>
> Net, I think staying on the simpler end of the spectrum here is better, but worth at least thinking through some possible alternatives...

For the dynamic corpus, hard to say ahead of time. I think we should
go with something simpler and not-overengineer it without clear need.
That would be either 1 location for the dynamic corpus, or no -fuzzdir
at all if we can find a suitable default location.
However, OSS-Fuzz also uses libfuzzer's ability to merge corpuses.
This may require specifying more than 1 corpus location. Starting
running go-fuzz on OSS-Fuzz will help figure out what exactly we need.
> --
> You received this message because you are subscribed to the Google Groups "Golang Fuzzing" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-fuzzing-pr...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-fuzzing-proposal/dd8b24c7-c5a1-449d-bb50-389c2db7d46d%40googlegroups.com.

thepudds

unread,
May 21, 2019, 12:59:58 PM5/21/19
to golang-fuzzing-proposal
I think the questions around how > 1 corpus could work might be the largest open questions for the proposal at this point.

Here are a few background thoughts, and then a more specific sketch of a one possible set of behaviors, which might do a reasonable job of making it simple for people starting out, while still supporting needs of something like CI, and allowing a (hopefully?) reasonable transition path forward if you start out using the simplest approach. (Sorry this is slightly long, but I didn't have time to make it shorter ;-)

First, it might help to keep in mind slicing usage along a few different dimensions... 

For example:

Project Sophistication / Size
--------------------------------------------------

  1. A 1-2 person open source project, or a 1-2 person enterprise project, or a student/hobbiest, etc. Often wants it to be as simple as possible, at least to start.

  2. Medium size projects. Perhaps still wants to keeps things simple and might not set up specific new infrastructure (e.g., might not want to go through the overhead of creating a separate repo), but might be more likely than a small project to do so or to use something that already exists.

  3. Larger project (open source, start up, enterprise). Likely runs or uses other test or CI infrastructure, more willing to have a shell script to help with setup, more likely to have a make file or similar, etc.

That is not a 100% representative split (including there is not a perfect correlation between project size and project sophistication), but it is helpful to keep in mind that some people want it simple (especially when starting out), and others are willing to tolerate a bit more complexity if it gives a better outcome.

A separate dimension is how long an individual fuzzing run lasts. You could pick many different breakpoints for duration, but here is a sample stab from the perspective of how "interactive" it might feel for the developer:

Timescale for a Single Fuzzing Run
--------------------------------------------------

  0. Want to use corpus as inputs to unit tests (but want it to be fast and deterministic, which currently means no actual fuzzing / generation of mutation-based inputs).

  1. 5-120 seconds ("I'll do quick fuzz run while I stare at it for 5 seconds", or "Let me grab a cup of coffee while it runs", etc.)

  2. Multiple hours or days (e.g., over night, over the weekend)

  3. Continuously (oss-fuzz, fuzzbuzz, ClusterFuzz, but also probably important for people starting out: "We spun up a VM, kicked off fuzzing, and check it periodically", or ~10-line bash script kicked off by something like Bamboo or Jenkins or some other non-fuzzing-specific tool that someone already has running).

I think one interesting aspect to think about is how valuable is the resulting corpus after those different timescales. 

If you want to optimize *solely* for *always* retaining even the tiniest CPU timeslice spent on fuzzing, that might lead to a design that requires the the corpus to always be in VCS.

On the other hand, short runs are not as valuable. For example, if you do a handful of 1-2 minute runs over the course of a couple days, it is not *terrible* if you end up losing the inputs generated during that time period, because you could for example always kick it off to run over night at the end of the day, which should dwarf the value from a few minutes of fuzzing time. (And it's not that we *want* to lose anything; this is just a comment on relative value).

Taking that into account, perhaps there is some additional convenience (including avoiding frequently dirtying VCS status) by having the corpus *default* to being stored in GOPATH/pkg/fuzz/corpus, and view that almost as a cache that is "OK" to lose if the user never moves it somewhere safer or never sets a non-default location.

This could end up with three types of locations for the corpus:

Three Allowed Corpus Locations
--------------------------------------------------

 1. GOPATH/pkg/fuzz/corpus/...  
      
 2. <pkgpath>/testdata/fuzz/<fuzzfunc>/corpus  
    (Underneath the code being tested, presumably in VCS)

 3. -fuzzdir=/any/directory  
    (In a separate repo, or a tar.gz unpacked from cloud storage, or wherever)
 
Putting that all together, you could then define three general principles: 

General Principles
--------------------------------------------------

 1. when fuzzing, always *read* inputs from any known corpus location, if it exists

 2. -fuzzdir controls where *writes* go

 3. GOPATH/pkg/fuzz/corpus is effectively a cache, and a storage place for lower value things until someone makes a conscious decision to store it elsewhere.

Those general principles then imply the main behavior that resuls, which I will try to spell out a bit more explicitly here:

Behaviors when '-fuzz' is set
--------------------------------------------------
 
If '-fuzzdir' is not set:
  read from: 
      GOPATH/pkg/fuzz/corpus/import/path
      .../import/path/testdata/fuzz/<fuzzfunc>/corpus (if it exists)
  write new inputs and crashes to:     
      GOPATH/pkg/fuzz/corpus/import/path

If '-fuzzdir' is set an actual directory path (-fuzzdir=/some/fuzzdir/directory)
  read from: 
      GOPATH/pkg/fuzz/corpus/import/path
      .../import/path/testdata/fuzz/<fuzzfunc>/corpus
      /some/fuzzdir/directory
  write new inputs and crashes to:     
      /some/fuzzdir/directory

If '-fuzzdir' is set to special value 'testdata' (-fuzzdir=testdata)
  read from: 
      GOPATH/pkg/fuzz/corpus/import/path
      .../import/path/testdata/fuzz/<fuzzfunc>/corpus
  write new inputs and crashes to:     
      .../import/path/testdata/fuzz/<fuzzfunc>/corpus
      
If you run `go test -fuzz=. -fuzzdir=testdata -fuzzminimze` (minimizing, with an output of 'testdata')
  read inputs (and crashers) from: 
      GOPATH/pkg/fuzz/corpus/import/path
      .../import/path/testdata/fuzz/<fuzzfunc>/corpus
      /some/fuzzdir/directory
  write *minimized* inputs (and crashers) to:     
      .../import/path/testdata/fuzz/<fuzzfunc>/{corpus,crashers}
      
Another dimension to think about is how people might transition from one usage pattern to another (especially if they start out doing something simple and then only later are willing to do something like create a separate corpus repo). If someone starts simple with their corpus going to GOPATH/pkg/fuzz/corpus, but then spends enough time fuzzing such that they later want to copy that corpus elsewhere, I would be slightly sad if that means they would need to manually hunt and fish around in a directory hierarchy under GOPATH/pkg/fuzz/corpus to copy to a different destination hierarchy of <pkgpath>/testdata directories (including they might have multiple packages, multiple fuzz functions, and their hierarchy under GOPATH/pkg/fuzz/corpus might have many corpus entries from other random projects, etc.).  However, it could be reasonable to minimize your corpus when moving it, which means that last behavior describing `-fuzzminimize` above would also serve in practice as a 'cp this stuff elsewhere' command. For example, if someone does `go test -fuzz=. -fuzzminimze -fuzzdir=testdata`, it places the results underneath the various testdata directories for their packages of interest, or alternatively could do `go test -fuzz=. -fuzzminimze -fuzzdir=/path/to/a/new/corpus/repo` if they want to populate somewhere else.

One related consideration is the fact that under modules, source code defaults to *read only* for your dependencies, which have their read-only source code stored under GOPATH/pkg/mod. This is in sharp contrast to old GOPATH behavior.

That leads in to yet another dimension to slice across, which is who is doing the fuzzing -- is it one of the primary authors, vs. a new contributor looking to test a PR, vs. someone who might just be trying to do some drive-by fuzzing, etc. Defaulting to GOPATH/pkg/fuzz/corpus is friendlier to someone fuzzing someone else's package given that will be always be a writable location. (If instead the behavior was to default to placing new inputs under <pkgpath>/testdata, that would default to a read-only location for dependencies; that is a problem, and it could be solved by something like having a rule that it falls back to GOPATH/pkg/fuzz/corpus if the default <pkgpath>/testdata is read-only, but that seems more complicated than just having the default always be GOPATH/pkg/fuzz/corpus as outlined above).

In any event, those are some rough thoughts on a possible approach.

There are a few permutations off of that basic approach (e.g., maybe support a comma separated list of directories for -fuzzdir, etc.).

And then of course there are completely different approaches that could be taken...

Regards,
thepudds
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-fuzzing-proposal+unsub...@googlegroups.com.

Romain Baugue

unread,
May 26, 2019, 7:46:07 AM5/26/19
to golang-fuzzing-proposal
In your last example (testdata output + minimize), I think you've added the last input path wrongly as it's inconsistent with the other examples.
Now, I'm not sure I'm completely on board with this, especially reading from the testdata corpus if -fuzzdir is set to something else. It seems to force users into something that they don't want.

Thinking about it, we have two "conventionnal" paths :
- GOPATH/pkg, which seems indicated for caching, ie "throwable" inputs
- <pkg>/testdata, which seems indicated for promoted inputs

I would say that both should be the default when running the fuzzer, with new inputs being written to GOPATH/pkg. Then, -fuzzdir should allow specifying a comma-separated list of directories, with the first one being used for writing the new entries, like the GOPATH works (and maybe we should rename it to -fuzzpath, but that's a question for later).
This allow most setups to work:
- if you just want a quick fuzz to check you're good,the caching location is there for you
- you can specify multiple corpus locations to source from while generating new inputs in a controlled place
- promoting inputs is easy enough with a copy-paste (FTR I'm not a great fan of this because it may be error-prone but this can be solved other ways)

Dmitry Vyukov

unread,
May 28, 2019, 11:29:01 AM5/28/19
to thepudds, golang-fuzzing-proposal
On Tue, May 21, 2019 at 7:00 PM thepudds <thepud...@gmail.com> wrote:
>
> I think the questions around how > 1 corpus could work might be the largest open questions for the proposal at this point.
>
> Here are a few background thoughts, and then a more specific sketch of a one possible set of behaviors, which might do a reasonable job of making it simple for people starting out, while still supporting needs of something like CI, and allowing a (hopefully?) reasonable transition path forward if you start out using the simplest approach. (Sorry this is slightly long, but I didn't have time to make it shorter ;-)
>
> First, it might help to keep in mind slicing usage along a few different dimensions...
>
> For example:
>
> Project Sophistication / Size
> --------------------------------------------------
>
> 1. A 1-2 person open source project, or a 1-2 person enterprise project, or a student/hobbiest, etc. Often wants it to be as simple as possible, at least to start.
>
> 2. Medium size projects. Perhaps still wants to keeps things simple and might not set up specific new infrastructure (e.g., might not want to go through the overhead of creating a separate repo), but might be more likely than a small project to do so or to use something that already exists.
>
> 3. Larger project (open source, start up, enterprise). Likely runs or uses other test or CI infrastructure, more willing to have a shell script to help with setup, more likely to have a make file or similar, etc.
>
> That is not a 100% representative split (including there is not a perfect correlation between project size and project sophistication), but it is helpful to keep in mind that some people want it simple (especially when starting out), and others are willing to tolerate a bit more complexity if it gives a better outcome.
>
> A separate dimension is how long an individual fuzzing run lasts. You could pick many different breakpoints for duration, but here is a sample stab from the perspective of how "interactive" it might feel for the developer:
>
> Timescale for a Single Fuzzing Run
> --------------------------------------------------
>
> 0. Want to use corpus as inputs to unit tests (but want it to be fast and deterministic, which currently means no actual fuzzing / generation of mutation-based inputs).
>
> 1. 5-120 seconds ("I'll do quick fuzz run while I stare at it for 5 seconds", or "Let me grab a cup of coffee while it runs", etc.)
>
> 2. Multiple hours or days (e.g., over night, over the weekend)
>
> 3. Continuously (oss-fuzz, fuzzbuzz, ClusterFuzz, but also probably important for people starting out: "We spun up a VM, kicked off fuzzing, and check it periodically", or ~10-line bash script kicked off by something like Bamboo or Jenkins or some other non-fuzzing-specific tool that someone already has running).
>
> I think one interesting aspect to think about is how valuable is the resulting corpus after those different timescales.
>
> If you want to optimize *solely* for *always* retaining even the tiniest CPU timeslice spent on fuzzing, that might lead to a design that requires the the corpus to always be in VCS.

Thanks for a good summary.

I would say that we don't have to retain even the tiniest CPU
timeslice _especially_ if it makes other things harder.
But we don't seem to compromise it so far (or do we?).
A question: do we really need -fuzzdir in the first version?
I added it to the proposal long time ago, and at that time the idea of
some kind of predefined location did not appear yet, and an explicit
flag is how other fuzzers tend to work.
But we don't have to follow that just because other fuzzers used to do
it. No single fuzzer is integrated into a standard toolchain so far.
So we may need different solutions.

What is not possible without -fuzzdir?
A CI does need some glue for each type of fuzzer, the one that will
know that for go-fuzz the corpus is passed in -fuzzdir. And the same
glue could arrange the corpus in GOPATH/pkg/...

It seem that whatever we do, we need to make these locations
transparent for users and clearly documented so that
copying/moving/removing files between these is the expected workflow
in some cases (rather then considered hacking in implementation
details).

Thoughts?


> If '-fuzzdir' is set to special value 'testdata' (-fuzzdir=testdata)
> read from:
> GOPATH/pkg/fuzz/corpus/import/path
> .../import/path/testdata/fuzz/<fuzzfunc>/corpus
> write new inputs and crashes to:
> .../import/path/testdata/fuzz/<fuzzfunc>/corpus

It it's merely a syntactic sugar, I would leave it aside for now.

> If you run `go test -fuzz=. -fuzzdir=testdata -fuzzminimze` (minimizing, with an output of 'testdata')
> read inputs (and crashers) from:
> GOPATH/pkg/fuzz/corpus/import/path
> .../import/path/testdata/fuzz/<fuzzfunc>/corpus
> /some/fuzzdir/directory
> write *minimized* inputs (and crashers) to:
> .../import/path/testdata/fuzz/<fuzzfunc>/{corpus,crashers}


I also not sure at this point if we need -fuzzminimze at all.
I think we need to minimize persistent corpus by default
(https://github.com/dvyukov/go-fuzz/issues/113). Then users/CIs don't
need to minimize manually and any kind of corpus merging can be done
simply by merging files in directories.
If there is no evident use case that does not work without manual
-fuzzminimze, I would leave it aside for now too.


One important aspect that we avoided so far is storing of crashing
inputs. We need to fit crashes into this picture too.
And to make things simpler I think we need to exclude what go-fuzz
currently stores in "suppressions" dir, because that can be extracted
from existing crashing inputs on start.
This leaves us with:
- corpus (normal inputs)
- crashing inputs
- output on the crashing inputs

Some assorted thoughts:
- there was a proposal to store crashing inputs right into testdata,
so that they are picked up by regression testing automatically and
relief user from copying files manually
- it's a nice idea
- but how does it play with read-only modules?
- but then where do we store output on the crashing inputs?
- it's reasonable to have it near the crashing inputs themselves
- but putting non-inputs into testdata/fuzz/corpus is not reasonable (?)
- we may not store output on the crashing inputs to make things simpler
- if the crashing inputs are stored into testdata, then user can
easily get the output by running go test
- but strictly saying the crash may be flaky, so go test will
pass and then the question is what was the crash and what to do with
the input?
- it would probably be useful to have crash output in CI
context without running "go test" additionally (again, it may not
reproduce second time)
- corpus can be checked in into a separate repository by checking out
that repo inside of GOPATH/pkg (is this correct?)
- if corpus is checked in into the same repository, then a user could
copy everything from GOPATH/pkg into testdata after prolonged runs
- but if the corpus is checked in into the same repository, then do
we want to separate large number of normal inputs from inputs that
previously caused crashes?
- if we mix normal inputs and crashers, then there is no clear
migration path to separating them later (one will either stuck with
huge checked-in corpus or will have to lose all regression tests for
previous crashes)
- we could have: testdata/fuzz/<fuzzfunc>/crashes and
testdata/fuzz/<fuzzfunc>/corpus
where the first one stores crashing inputs _and_ output for each input
while the second potentially stores the normal corpus
- but this leads us to 3 corpus locations, which looks like further
complicating things
- also if we store crash output files under testdata/, are users
supposed/want to check them in too?
- there is probably not much value in them? or not? over time line
numbers and everything will get out of sync anyway
- and now 'git add -u' won't do the right thing and a user will need
to carefully select and delete/ignore the output files
- but storing crashing input in testdata/ but the corresponding
output in GOPATH/pkg looks exceedingly weird
- also a CI will most likely need only _new_ crashers (along with
their output) so if we mix new crashers into testdata/ which already
contains old, fixed crashers, that's confusing, what a CI is supposed
to do? rm -rf current testdata after checking out the repo? that's
weird to ask for this....


I feel like I complicated things again. Or am I missing some obvious
solutions? What do we do with output files and crashing inputs?

Dmitry Vyukov

unread,
May 28, 2019, 11:35:13 AM5/28/19
to thepudds, golang-fuzzing-proposal
It seems that giving up on the idea of storing crashing inputs in
testdata avoids lots of these problems.
Namely, testdata/fuzz/<fuzzfunc>/corpus contains the "fixed" corpus
(small number of either manually created inputs or previous crashing
inputs). User will need to manually copy crashing inputs from
GOPATH/pkg into this location. We can even strip /corpus from the
path, because there is nothing else.
And then we have:
GOPATH/pkg/fuzz/import/path/corpus
GOPATH/pkg/fuzz/import/path/crashers
The first contains "dynamic" corpus files. The second new crashing
inputs and the corresponding output.
The rest is on user: they can move files as necessary and as they see fit.

Romain Baugue

unread,
May 28, 2019, 11:48:03 AM5/28/19
to 'Dmitry Vyukov' via golang-fuzzing-proposal, Dmitry Vyukov, thepudds
Fine by me. The only thing that bugs me is that I always considered the
`$GOPATH/pkg` directory to be some kind of blackbox I shouldn't touch.
Storing things in there and expecting the user to copy-paste by hand
feels somewhat wrong, although we can find solutions to this later.

On Tue, 28 May 2019 17:35:01 +0200
"'Dmitry Vyukov' via golang-fuzzing-proposal"

Dmitry Vyukov

unread,
May 28, 2019, 12:09:47 PM5/28/19
to Romain Baugue, 'Dmitry Vyukov' via golang-fuzzing-proposal, thepudds
We could try to squat $GOPATH/fuzz, but I dunno what will be reaction
of Go team...

thepudds

unread,
Aug 8, 2019, 10:06:06 AM8/8/19
to golang-fuzzing-proposal
Hello everyone,

This is debatable, but I think at least from my personal perspective -- the biggest open question for the proposal for "make fuzzing a first class citizen" remains what is the approach for > 1 corpus location, including what are the allowed locations, how are they specified, and what is the default.

Here is one more variation to consider as proposed behavior for those questions.

First, there would be three allowed destinations for writing new corpus entries:
 
     A. -fuzzdir=/some/path  ==>   some user specified dir.  (perhaps outside of VCS, perhaps later stored as a tar.gz in blob storage, perhaps a separate repo)
 
     B. -fuzzdir=testdata    ==>   <pkg-path>/testdata/fuzz/<func>  (therefore, in VCS along with the code under test if the code under test is in VCS)
 
     C. If neither of those are specified, default to storing under GOPATH/pkg/fuzz/corpus/...  (which means you don't end up with a dirty VCS status by default)

(Side comment: for people starting out with fuzzing who are using their own infrastructure, I would guess those are (arguably) listed in order of "most work" to "least work" to set up and start using. It's not an insurmountable project to set up, for example, some network storage to store your corpus, but it's more work than not doing that).

Second, there are three simple rules for what corpus locations are read, and what gets written:
 
     1. always read from all known sources that exist. 
 
     2. always write to the user's requested destination (and only write there).
 
     3. if the destination does already not exist, seed it with anything found in the corresponding corpus under GOPATH/pkg/fuzz/corpus (which is usually local to your machine and not shared with anyone yet).

That description above captures the heart of the proposed behavior.
 
For 1., testdata is always a known location (which might or might not exist with a corpus), GOPATH/pkg/fuzz/corpus is always a known location (which might or might not have a corpus), and -fuzzdir=/some/path is only known if the user specifies it.

For 2., the write destination is based entirely on the setting of -fuzzdir (or if -fuzzdir is not set, the default destination).

For 3., most people could ignore this rule most of the time, but the main intent is to help someone who is transitioning from the easy-to-start-but-not-shared-with-anyone GOPATH/pkg/fuzz/corpus location to "Hey, this seems useful, let me put what I've found somewhere more persistent and shareable", or "Hey, I've now invested enough days with ad-hoc local fuzzing for packages X, Y, and Z that I don't want to lose those CPU hours", etc.

The behavior above is a simplified variation on what I had suggested in this prior post in this thread. (That post also includes a longer write-up on some use cases and personas to think about for this behavior).

Also, FWIW, the fzgo prototype has been updated to that behavior* described above, and so far using it has made me much happier than the prior behavior (which previously tracked the original March 2017 proposal, previously only used a single location, and previously defaulted to <pkg-path>/testdata/...).

The things that make me happier using the updated fzgo behavior:
 * not needing to delete <pkg-path>/testdata routinely (or otherwise routinely dirty the state of the code under test).
 * not needing to supply a non-default location most of the time when I'm doing something short-ish.
 * not needing to hunt around for directory locations or copy files around manually when I want to go from "ad-hoc fuzzing" to save things somewhere more persistent.

I don't know if that is perfect behavior, but it at least attempts to balance the needs of sophisticated users vs. people just starting out, with some  mechanism to make transitioning easier, while aiming to be not crazy complicated behavior... It could be simplified further, or there are of course many other possible alternatives people could suggest, but wanted to at least send something concrete for consideration.

Avoiding manually copying around files when you want to seed a more persistent corpus location I think is an important one, especially because the directory hierarchies are not the identical across destinations (and more so with Go Modules in the future as people move outside GOPATH for their code locations), so it's not just a simple 'cp' command. Also, the annoyance compounds with support for fuzzing multiple packages at once, as well as hopefully we will transition to a world where people have many fuzz functions that are (hopefully eventually) much easier to create once rich signatures are supported.

Three final points that are perhaps less critical:

There is an '*' on the current fzgo behavior -- fzgo currently has a close-ish approximation of the behavior above. That said, it is a small-ish delta between the current fzgo behavior vs. the behavior proposed above, and the details of the delta are probably not immediately interesting in the context of the discussion above.

Right now, GOPATH/pkg/fuzz/corpus/... and -fuzzdir=/some/path both create an import-path based directory hierarchy under those top locations. The intent is to make it easy to have many packages be stored without collisions via a single setting of the user-supplied -fuzzdir. That could be changed, though, and is not an essential part of what is proposed above.

Finally, there is an additional detail about what to do with crashers. Right now, crashers go to the same write location as the corpus. I think that is reasonable behavior, and could be the final proposed behavior. That said, if the major questions about the > 1 corpus behavior get settled, there might be an opportunity to do something slightly smarter with crashers (some of which we've discussed here previously), but I think that would be a small "bonus" behavior that might or might not work out, so I suspect additional alternatives for crasher behavior could be evaluated later once (or if) there is some consensus on some the bigger questions first. Maybe?

Regards,
thepudds

On Tuesday, May 28, 2019 at 12:09:47 PM UTC-4, Dmitry Vyukov wrote:
We could try to squat $GOPATH/fuzz, but I dunno what will be reaction
of Go team...

On Tue, May 28, 2019 at 5:48 PM Romain Baugue <romain...@gmail.com> wrote:
>
> Fine by me. The only thing that bugs me is that I always considered the
> `$GOPATH/pkg` directory to be some kind of blackbox I shouldn't touch.
> Storing things in there and expecting the user to copy-paste by hand
> feels somewhat wrong, although we can find solutions to this later.
>
> On Tue, 28 May 2019 17:35:01 +0200
> "'Dmitry Vyukov' via golang-fuzzing-proposal"

thepudds

unread,
Aug 8, 2019, 10:20:38 AM8/8/19
to golang-fuzzing-proposal
Also, given this is now a long thread, I'll repeat that pretty much everything I've suggested for >1 corpus behavior in this thread are variations of what Dmitry originally suggested here:


(And there is also a summary of some other earlier conversations on this topic here).

Regards,
thepudds

t hepudds

unread,
Aug 21, 2019, 10:14:33 AM8/21/19
to golang-fuzzing-proposal
Hello there Dmitry,

My understanding is that roughly 3 or 5 posts ago in this thread, you were thinking through the implications of the following, among other things:

  1. storing crashers in testdata automatically.

  2. removing support for a user supplied -fuzzdir, with one approach being testdata/fuzz/<func> is always read but never written to, and GOPATH/pkg/fuzz/<import-path>/corpus is where new corpus files go (with a .../crashers location there as well), and then everything else is on the user in terms of file management -- they can move files as necessary and as they see fit.

For 1., it sounded like your conclusion at that time might have been that it added more complexity than it was worth, which seems like a reasonable conclusion.

For 2., your thinking sounded more nuanced, but I'm definitely curious what your latest thinking there is. While it would reduce the number of knobs in cmd/go to drop -fuzzdir, I wonder if doing so might increase the overall end-to-end complexity from the user perspective.

What you outlined is a convenient approach for example for someone just starting out and not really caring where the corpus is, but before too long it seems a healthy percentage of people will want to store their corpus somewhere more shareable and with better persistence than GOPATH/pkg/fuzz. That seems like a common enough desire that some type of direct support for storing the large corpus outside of GOPATH/pkg/fuzz seems like it would make sense in cmd/go. Fuzzing-aware CI and things like oss-fuzz could likely adapt to whatever structure is supported, but I think there is still an important swath of users who won't be using something like that (e.g., enterprise users behind firewalls, or people just starting out, etc.)

One other thing to think about is some people might easily have 10s or 100s of fuzz functions spread across many different packages. (One related factor is I suspect the frequency of having multiple fuzz functions per package likely would increase with support for rich signatures, or at least rich signature support means it will become easier to create a series of light wrappers for a series of functions). The directory structures proposed are such that it is not a simple 'cp' command in that case to move the files around (for example if you want to take an initial batch of fuzzing results and move them underneath testdata for multiple packages), and that is even more true with modules in different locations outside of GOPATH.

In any event, my guess is that not having any knob for corpus location might not make the experience friendly enough, and sometimes make fuzzing feel like more of a chore than it needs to be... but of course, I could easily be wrong about that guess.  ;-)

One other minor side note is that storing a piece of the corpus under GOPATH/pkg/fuzz probably means there would probably need to be a 'go clean -fuzzcache' (or something like that) to parallel things like 'go clean -modcache' that clears the module cache under GOPATH/pkg/mod for modules. (I don't think that's a problem for the fuzzing proposal, but I forget if anyone has mentioned it yet, so thought I'd mention it while on the topic).

Finally, I happen to address this to Dmitry given I am curious about his current thinking might be compared to what he wrote 3-5 posts ago in this thread, but of course anyone else should feel free to chime in as well.

Regards,
thepudds

--
You received this message because you are subscribed to the Google Groups "golang-fuzzing-proposal" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-fuzzing-pr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-fuzzing-proposal/979a5b34-48d5-4564-aad8-f9818e7cd34a%40googlegroups.com.

Dmitry Vyukov

unread,
Sep 16, 2019, 1:29:00 AM9/16/19
to t hepudds, golang-fuzzing-proposal
Hi,

First of all thanks for persistence, patience and for putting these
emails together.

Reading corpus from all known sources sounds good to me.
The only scenario I can imagine where this is not desirable is researches
doing A/B comparisons on fuzzers, but that's special enough and we can assume
these people are knowledgeable enough and will shuffle files as necessary.

Writing to the user provided -fuzzdir sounds good too. I assume this implies
writing to GOPATH if -fuzzdir is not specified, right?

Re -fuzzidr=testdata, did you use it a lot? If not, I would prefer to put
it aside for now. It looks like something that can be added later if needed.
Or maybe we will figure out some other frequent pattern based on
actual experience.

Having the same directory structure in GOPATH and -fuzzdir sounds reasonable.

Re copying corpus and automatic population of empty corpus dir.
Why do you say that directory structure is different and it's not simply a cp
invocation? Doesn't directory structure for GOPATH and custom -fuzzdir the same?
Doing automatic population looks a bit like doing too much magic under the hood.

One minor but important part of all this may be to simply dump all locations
to console when go-fuzz starts. E.g.:

reading corpus from GOPATH/src/mypkg/testdata/fuzz/FuzzFoo (5 inputs)
reading/writing corpus from GOPATH/pkg/fuzz/mypkg/FuzzFoo (123 inputs)

This will give transparency and make it clear what needs to be copied
where, etc.

I think we need to drop suppressions and .quoted files in crash dir.
Suppressions can be (and should be) inferred from crashing inputs.
.quoted was an ad-hoc helper to simply creation of reproducers,
but we figured out a better way of doing this - storing them in testdata.
Re crash location. Where does fzgo store them now? I can't converge on the
single best option for this. Does storing crashes in testdata/ work
well in practice?
If so, I guess we could try it.
Reply all
Reply to author
Forward
0 new messages