Estimations, Predictions, and Statistical Correctness Guarantees for AFL

Marcel Böhme

unread,

Aug 5, 2017, 6:14:42 AM8/5/17

to afl-users

Hi all,

Recently, someone posted a question to this forum: When does AFL end? Michal answered that it is quite difficult to say.

Actually, it is not really that difficult (on the average, in the long run, and once the central ideas are clear). My most recent research submission (41 journal pages currently under peer review) has been on estimating and predicting progress and providing statistical correctness guarantees for automated testing and analysis techniques.

You can check out (and try out) my implementation into AFL (2.49b), called Pythia.

$ git clone https://github.com/mboehme/pythia.git

My extension (Pythia) allows to provide statistical correctness guarantees for fuzzing campaigns (see correctness). For fuzzing campaigns of the same program, fuzzed for the same time using the same seeds, Pythia shows an upper bound on the probability to discover a crashing input when no crashes have been found.
My extension (Pythia) allows to quantify just how "difficult" it is to fuzz a program (see difficulty). This is interesting if you want to improve the fuzzability of your test drivers (e.g., in OSS-Fuzz) or if you want to gauge, how long it might take to fuzz that program to achieve a decent progress / correctness guarantee.
My extension (Pythia) allows to determine the progress of the fuzzing campaign towards completion (i.e., path coverage = current paths / total paths). Once about 99% of paths have been discovered, you can abort the fuzzing campaign without expecting too many new discoveries. Definitely more reliable than setting AFL_EXIT_WHEN_DONE.
Of course, you can play around with the UI presentation. I could easily add the expected time until discovering the next path / unique crash or other prediction intervals. I could also predict map-coverage expected in the future. Wasn't so sure whether this is so interesting. I could show path coverage as percentage. I could provide 95%-confidence intervals, as well. I could try to "smooth" the predictions and estimates using a moving average to reduce the swings.
Estimation and prediction is quite efficient and easily scales! In the beginning of a fuzzing campaign the estimates and predictions might seem all over the place. However, in the accompanying research article, I show (empirically and theoretically) that the estimator bias reduces and precision increases as the fuzzing campaign continues. You should try it out and see for yourself :)
Also quite useful to implement a smart scheduling of fuzzing campaigns, where each campaign runs long enough to make sufficient progress towards completion but not too long to waste resources. See, some programs are "done" in a few hours while others show good progress for days on end -- Pythia finds that out for you :).

@Michal: If you decide to merge, you are welcome to do so.

Oh, I should say that I did not test Pythia when there are more than one instances (-M / -S). Maybe this needs to be accounted for in the implementation. However, again I expect that estimator performance improves with time (i.e., #cycles).

I'm pasting the rendered Readme.md below. Hoping it retains the formatting.

                   american fuzzy lop 2.49b (magic_fuzzer)

┌─ process timing ─────────────────────────────────────┬─ overall results ─────┐
│        run time : 0 days, 0 hrs, 2 min, 29 sec       │  cycles done : 0      │
│   last new path : 0 days, 0 hrs, 0 min, 0 sec        │current paths : 533    │
│ prediction info : 297 / 32 / 197k                    │  1 min paths : 652    │
│ last uniq crash : none seen yet                      │  total paths : 1911   │
│  last uniq hang : none seen yet                      │  correctness : 2e-03  │
│ uniq crash/hang : 0 / 0                              │   difficulty : 5e-02  │

current paths: shows the current number of paths discovered.
1 min paths: shows the number of paths discovered in the future (1min, 10min, 1hour, 10hrs, and 1d). However, the prediction does not anticipate sudden increases in the number of paths discovered. It expects the path discovery to decelerate at the same speed (approaching an asymptote). The prediction quickly adjusts once the sudden increase has happened.
total paths: shows the number of asymptotic total number of paths discovered. In the beginning when many paths are discovered the estimates are quite strongly biased. Later, when an asymptote is discernable the estimate becomes slightly negatively biased. However, the magnitude of the bias reduces continues as the fuzzing campaign.
prediction info: shows the number of paths exercised exactly once / exactly twice / #inputs. All estimates and predictions are computed from these numbers.
correctness: shows an upper bound on the probability to expose a unique crash if no crashes have been found. The number is fairly representative of fuzzing campaigns of similar length, assuming AFL is started with the same seeds, parameters, and test driver.
difficulty: shows how "difficult" it is to discover new paths in that program. Difficulty is a program property where programs with difficulty 0 are extremely difficult to fuzz while programs with difficulty 1 are extremely easy to fuzz.

Note: For some programs, the asymptotic total number of paths might be estimated several times higher than the current number of paths even after few days. The estimate is based on the number of paths exercised exactly once (see first number @ prediction info), which in such a case would be relatively high (e.g., 80% of paths are exercised exactly once). Consider this: Even after such a long time most of the discovered paths have still been exercised not more than once. Quite certainly, there are many other paths that have not been exercised at all.

If there is enough interest, I'll put up a technical report explaining the research behind this. Also, let me know if you want to build on this research and cite my paper :)

The central ideas also work for Libfuzzer, syzcaller, Peach, CSmith, and many other fuzzers (except for fuzzers based on symbolic execution) and other program analysis (not only for path coverage).

Looking forward to the feedback from the AFL community!

Cheers!

- Marcel

PS: Our (unrelated) paper on "Directed Greybox Fuzzing" has been accepted at ACM CCS'17 (e.g., use AFL for patch testing or directed towards "dangerous" program locations). Stay tuned!

PPS: My research profile: https://www.comp.nus.edu.sg/~mboehme

Michal Zalewski

unread,

Aug 6, 2017, 10:43:43 AM8/6/17

to afl-users

> Actually, it is not really that difficult (on the average, in the long run,
> and once the central ideas are clear). My most recent research submission
> (41 journal pages currently under peer review) has been on estimating and
> predicting progress and providing statistical correctness guarantees for
> automated testing and analysis techniques.

This is pretty cool! I think that for the current versions, given that
we have very limited screen estate, I'd be hesitant to dedicate so
much of it to predictions. But if it works, it seems worth
incorporating for the next major redesign (in the works).

Cheers,
/mz

Marcel Böhme

unread,

Aug 8, 2017, 11:15:04 PM8/8/17

to afl-users

Hi Michal,

Great! In that case, correctness and path coverage might provide the most interesting pieces of information.

Path coverage might be most actionable since it tells you about the progress of the fuzzing campaign towards completion (i.e., towards approaching the asymptotic total number of paths). The rule of considering the fuzzing campaign as "completed" at about 98% worked quite well for 10 runs of 4 subjects that I tested in OSS-Fuzz:

* Wireshark (fuzzshark_media_type-json),

* OpenSSL (server)

* FFMPEG (AV_CODEC_ID_MPEG4_fuzzer)

* Libjpeg-turbo (libjpeg_turbo_fuzzer)

Average number of discovered paths (S_obs) and estimated path coverage (Gˆ) over 10 runs at 6, 48, and 100 hours into the fuzzing campaign, respectively.

Notice how OpenSSL makes steady progress even after 100hrs while Wireshark is done pretty much after 6 hours. More details in the technical report, available on request!

Correctness might be interesting since fuzzing campaigns of the same length achieve different degrees of correctness for different subjects. Correctness provides the probability to discover a new path for fuzzing campaigns of the same length (and thus gives an upper bound on the probability to discover a unique crash [proof in technical report]).

Difficulty (together with the asymptotic total number of paths) might be interesting only to improve the test drivers for a given program (e.g., for OSS-Fuzz).

Estimates for correctness and difficulty are quite robust (modulo technical caveats such as overflows when casting u64 to int).

Estimates of path coverage can be improved at the cost of a few CPU cycles.

Hoping to hear from the AFL-community, what is your experience?

Cheers!

- Marcel

Marcel Böhme

unread,

Apr 4, 2018, 6:40:54 AM4/4/18

to afl-users

Accepted!

Fuzzing is essentially about discovery, e.g., of bugs in software. Ecology is also about discovery, e.g., of bugs in a tropical rain forest. Turns out that we can use methodologies from ecological biostatistics to solve long-standing, fundamental problems in software testing.

What does that mean in practice? The article provides the statistical tools for security researchers to conduct

* residual risk analyses (e.g., "What is the probability that a vulnerability exists when none has been detected after 2 weeks of fuzzing?") and

* cost-benefit analyses (e.g., "How much more coverage will be achieved (bugs will be found, ..) if the fuzzer is run for 2 more weeks?").

For OSS-Fuzz (and other large-scale fuzzing infrastructures) this means that fuzzing campaigns could be scheduled more efficiently (saving resources or reducing the residual risk). A campaign can automatically be stopped when insufficient progress is made. There are programs where the fuzzer stops being effective (e.g., increase in map coverage is insubstantial) after a few hours and other programs where the fuzzer remains effective for weeks on end.

My article is called "Software Testing as Species Discovery" and has been accepted at the ACM Transactions on Software Engineering and Methodology (TOSEM; 52 pages, journal-first, preprint: https://arxiv.org/abs/1803.02130). The article is a bit long but skimming the intro, you should get the gist of it.

I'm hoping that this solves some of your practical problems in fuzzing, such as: "When should I stop fuzzing?" or "Having invested so many resources, how confident can I be that no vulnerability exists?". At the very least it is a nice story about how finding new bugs in software is statistically the same as finding new bugs in nature (no matter how difficult you might think it is :)

Let me know if there are any questions, feedback, concerns, or suggestions.

Cheers!

- Marcel

Reply all

Reply to author

Forward