On Tue, May 21, 2019 at 7:00 PM thepudds <
thepud...@gmail.com> wrote:
>
> I think the questions around how > 1 corpus could work might be the largest open questions for the proposal at this point.
>
> Here are a few background thoughts, and then a more specific sketch of a one possible set of behaviors, which might do a reasonable job of making it simple for people starting out, while still supporting needs of something like CI, and allowing a (hopefully?) reasonable transition path forward if you start out using the simplest approach. (Sorry this is slightly long, but I didn't have time to make it shorter ;-)
>
> First, it might help to keep in mind slicing usage along a few different dimensions...
>
> For example:
>
> Project Sophistication / Size
> --------------------------------------------------
>
> 1. A 1-2 person open source project, or a 1-2 person enterprise project, or a student/hobbiest, etc. Often wants it to be as simple as possible, at least to start.
>
> 2. Medium size projects. Perhaps still wants to keeps things simple and might not set up specific new infrastructure (e.g., might not want to go through the overhead of creating a separate repo), but might be more likely than a small project to do so or to use something that already exists.
>
> 3. Larger project (open source, start up, enterprise). Likely runs or uses other test or CI infrastructure, more willing to have a shell script to help with setup, more likely to have a make file or similar, etc.
>
> That is not a 100% representative split (including there is not a perfect correlation between project size and project sophistication), but it is helpful to keep in mind that some people want it simple (especially when starting out), and others are willing to tolerate a bit more complexity if it gives a better outcome.
>
> A separate dimension is how long an individual fuzzing run lasts. You could pick many different breakpoints for duration, but here is a sample stab from the perspective of how "interactive" it might feel for the developer:
>
> Timescale for a Single Fuzzing Run
> --------------------------------------------------
>
> 0. Want to use corpus as inputs to unit tests (but want it to be fast and deterministic, which currently means no actual fuzzing / generation of mutation-based inputs).
>
> 1. 5-120 seconds ("I'll do quick fuzz run while I stare at it for 5 seconds", or "Let me grab a cup of coffee while it runs", etc.)
>
> 2. Multiple hours or days (e.g., over night, over the weekend)
>
> 3. Continuously (oss-fuzz, fuzzbuzz, ClusterFuzz, but also probably important for people starting out: "We spun up a VM, kicked off fuzzing, and check it periodically", or ~10-line bash script kicked off by something like Bamboo or Jenkins or some other non-fuzzing-specific tool that someone already has running).
>
> I think one interesting aspect to think about is how valuable is the resulting corpus after those different timescales.
>
> If you want to optimize *solely* for *always* retaining even the tiniest CPU timeslice spent on fuzzing, that might lead to a design that requires the the corpus to always be in VCS.
Thanks for a good summary.
I would say that we don't have to retain even the tiniest CPU
timeslice _especially_ if it makes other things harder.
But we don't seem to compromise it so far (or do we?).
A question: do we really need -fuzzdir in the first version?
I added it to the proposal long time ago, and at that time the idea of
some kind of predefined location did not appear yet, and an explicit
flag is how other fuzzers tend to work.
But we don't have to follow that just because other fuzzers used to do
it. No single fuzzer is integrated into a standard toolchain so far.
So we may need different solutions.
What is not possible without -fuzzdir?
A CI does need some glue for each type of fuzzer, the one that will
know that for go-fuzz the corpus is passed in -fuzzdir. And the same
glue could arrange the corpus in GOPATH/pkg/...
It seem that whatever we do, we need to make these locations
transparent for users and clearly documented so that
copying/moving/removing files between these is the expected workflow
in some cases (rather then considered hacking in implementation
details).
Thoughts?
> If '-fuzzdir' is set to special value 'testdata' (-fuzzdir=testdata)
> read from:
> GOPATH/pkg/fuzz/corpus/import/path
> .../import/path/testdata/fuzz/<fuzzfunc>/corpus
> write new inputs and crashes to:
> .../import/path/testdata/fuzz/<fuzzfunc>/corpus
It it's merely a syntactic sugar, I would leave it aside for now.
> If you run `go test -fuzz=. -fuzzdir=testdata -fuzzminimze` (minimizing, with an output of 'testdata')
> read inputs (and crashers) from:
> GOPATH/pkg/fuzz/corpus/import/path
> .../import/path/testdata/fuzz/<fuzzfunc>/corpus
> /some/fuzzdir/directory
> write *minimized* inputs (and crashers) to:
> .../import/path/testdata/fuzz/<fuzzfunc>/{corpus,crashers}
I also not sure at this point if we need -fuzzminimze at all.
I think we need to minimize persistent corpus by default
(
https://github.com/dvyukov/go-fuzz/issues/113). Then users/CIs don't
need to minimize manually and any kind of corpus merging can be done
simply by merging files in directories.
If there is no evident use case that does not work without manual
-fuzzminimze, I would leave it aside for now too.
One important aspect that we avoided so far is storing of crashing
inputs. We need to fit crashes into this picture too.
And to make things simpler I think we need to exclude what go-fuzz
currently stores in "suppressions" dir, because that can be extracted
from existing crashing inputs on start.
This leaves us with:
- corpus (normal inputs)
- crashing inputs
- output on the crashing inputs
Some assorted thoughts:
- there was a proposal to store crashing inputs right into testdata,
so that they are picked up by regression testing automatically and
relief user from copying files manually
- it's a nice idea
- but how does it play with read-only modules?
- but then where do we store output on the crashing inputs?
- it's reasonable to have it near the crashing inputs themselves
- but putting non-inputs into testdata/fuzz/corpus is not reasonable (?)
- we may not store output on the crashing inputs to make things simpler
- if the crashing inputs are stored into testdata, then user can
easily get the output by running go test
- but strictly saying the crash may be flaky, so go test will
pass and then the question is what was the crash and what to do with
the input?
- it would probably be useful to have crash output in CI
context without running "go test" additionally (again, it may not
reproduce second time)
- corpus can be checked in into a separate repository by checking out
that repo inside of GOPATH/pkg (is this correct?)
- if corpus is checked in into the same repository, then a user could
copy everything from GOPATH/pkg into testdata after prolonged runs
- but if the corpus is checked in into the same repository, then do
we want to separate large number of normal inputs from inputs that
previously caused crashes?
- if we mix normal inputs and crashers, then there is no clear
migration path to separating them later (one will either stuck with
huge checked-in corpus or will have to lose all regression tests for
previous crashes)
- we could have: testdata/fuzz/<fuzzfunc>/crashes and
testdata/fuzz/<fuzzfunc>/corpus
where the first one stores crashing inputs _and_ output for each input
while the second potentially stores the normal corpus
- but this leads us to 3 corpus locations, which looks like further
complicating things
- also if we store crash output files under testdata/, are users
supposed/want to check them in too?
- there is probably not much value in them? or not? over time line
numbers and everything will get out of sync anyway
- and now 'git add -u' won't do the right thing and a user will need
to carefully select and delete/ignore the output files
- but storing crashing input in testdata/ but the corresponding
output in GOPATH/pkg looks exceedingly weird
- also a CI will most likely need only _new_ crashers (along with
their output) so if we mix new crashers into testdata/ which already
contains old, fixed crashers, that's confusing, what a CI is supposed
to do? rm -rf current testdata after checking out the repo? that's
weird to ask for this....
I feel like I complicated things again. Or am I missing some obvious
solutions? What do we do with output files and crashing inputs?