| Estimations, Predictions, and Statistical Correctness Guarantees for AFL | Marcel Böhme | 05/08/17 03:14 | Hi all, Recently, someone posted a question to this forum: When does AFL end? Michal answered that it is quite difficult to say. Actually, it is not really that difficult (on the average, in the long run, and once the central ideas are clear). My most recent research submission (41 journal pages currently under peer review) has been on estimating and predicting progress and providing statistical correctness guarantees for automated testing and analysis techniques. You can check out (and try out) my implementation into AFL (2.49b), called Pythia.
@Michal: If you decide to merge, you are welcome to do so. Oh, I should say that I did not test Pythia when there are more than one instances (-M / -S). Maybe this needs to be accounted for in the implementation. However, again I expect that estimator performance improves with time (i.e., #cycles). I'm pasting the rendered Readme.md below. Hoping it retains the formatting. american fuzzy lop 2.49b (magic_fuzzer) ┌─ process timing ─────────────────────────────────────┬─ overall results ─────┐ │ run time : 0 days, 0 hrs, 2 min, 29 sec │ cycles done : 0 │ │ last new path : 0 days, 0 hrs, 0 min, 0 sec │current paths : 533 │ │ prediction info : 297 / 32 / 197k │ 1 min paths : 652 │ │ last uniq crash : none seen yet │ total paths : 1911 │ │ last uniq hang : none seen yet │ correctness : 2e-03 │ │ uniq crash/hang : 0 / 0 │ difficulty : 5e-02 │
Note: For some programs, the asymptotic total number of paths might be estimated several times higher than the current number of paths even after few days. The estimate is based on the number of paths exercised exactly once (see first number @ prediction info), which in such a case would be relatively high (e.g., 80% of paths are exercised exactly once). Consider this: Even after such a long time most of the discovered paths have still been exercised not more than once. Quite certainly, there are many other paths that have not been exercised at all. If there is enough interest, I'll put up a technical report explaining the research behind this. Also, let me know if you want to build on this research and cite my paper :) The central ideas also work for Libfuzzer, syzcaller, Peach, CSmith, and many other fuzzers (except for fuzzers based on symbolic execution) and other program analysis (not only for path coverage). Looking forward to the feedback from the AFL community! Cheers! - Marcel PS: Our (unrelated) paper on "Directed Greybox Fuzzing" has been accepted at ACM CCS'17 (e.g., use AFL for patch testing or directed towards "dangerous" program locations). Stay tuned! PPS: My research profile: https://www.comp.nus.edu.sg/~mboehme |
| Re: [afl-users] Estimations, Predictions, and Statistical Correctness Guarantees for AFL | Michal Zalewski | 06/08/17 07:43 | > Actually, it is not really that difficult (on the average, in the long run,This is pretty cool! I think that for the current versions, given that we have very limited screen estate, I'd be hesitant to dedicate so much of it to predictions. But if it works, it seems worth incorporating for the next major redesign (in the works). Cheers, /mz |
| Re: [afl-users] Estimations, Predictions, and Statistical Correctness Guarantees for AFL | Marcel Böhme | 08/08/17 20:15 | Hi Michal, Path coverage might be most actionable since it tells you about the progress of the fuzzing campaign towards completion (i.e., towards approaching the asymptotic total number of paths). The rule of considering the fuzzing campaign as "completed" at about 98% worked quite well for 10 runs of 4 subjects that I tested in OSS-Fuzz: * Wireshark (fuzzshark_media_type-json), * OpenSSL (server) * FFMPEG (AV_CODEC_ID_MPEG4_fuzzer) * Libjpeg-turbo (libjpeg_turbo_fuzzer) Average number of discovered paths (S_obs) and estimated path coverage (Gˆ) over 10 runs at 6, 48, and 100 hours into the fuzzing campaign, respectively. Notice how OpenSSL makes steady progress even after 100hrs while Wireshark is done pretty much after 6 hours. More details in the technical report, available on request! Correctness might be interesting since fuzzing campaigns of the same length achieve different degrees of correctness for different subjects. Correctness provides the probability to discover a new path for fuzzing campaigns of the same length (and thus gives an upper bound on the probability to discover a unique crash [proof in technical report]). Difficulty (together with the asymptotic total number of paths) might be interesting only to improve the test drivers for a given program (e.g., for OSS-Fuzz). Estimates for correctness and difficulty are quite robust (modulo technical caveats such as overflows when casting u64 to int). Estimates of path coverage can be improved at the cost of a few CPU cycles. Hoping to hear from the AFL-community, what is your experience? Cheers! - Marcel |
| Re: [afl-users] Estimations, Predictions, and Statistical Correctness Guarantees for AFL | Marcel Böhme | 04/04/18 03:40 | Accepted! Fuzzing is essentially about discovery, e.g., of bugs in software. Ecology is also about discovery, e.g., of bugs in a tropical rain forest. Turns out that we can use methodologies from ecological biostatistics to solve long-standing, fundamental problems in software testing. What does that mean in practice? The article provides the statistical tools for security researchers to conduct * residual risk analyses (e.g., "What is the probability that a vulnerability exists when none has been detected after 2 weeks of fuzzing?") and * cost-benefit analyses (e.g., "How much more coverage will be achieved (bugs will be found, ..) if the fuzzer is run for 2 more weeks?"). For OSS-Fuzz (and other large-scale fuzzing infrastructures) this means that fuzzing campaigns could be scheduled more efficiently (saving resources or reducing the residual risk). A campaign can automatically be stopped when insufficient progress is made. There are programs where the fuzzer stops being effective (e.g., increase in map coverage is insubstantial) after a few hours and other programs where the fuzzer remains effective for weeks on end. My article is called "Software Testing as Species Discovery" and has been accepted at the ACM Transactions on Software Engineering and Methodology (TOSEM; 52 pages, journal-first, preprint: https://arxiv.org/abs/1803.02130). The article is a bit long but skimming the intro, you should get the gist of it. I'm hoping that this solves some of your practical problems in fuzzing, such as: "When should I stop fuzzing?" or "Having invested so many resources, how confident can I be that no vulnerability exists?". At the very least it is a nice story about how finding new bugs in software is statistically the same as finding new bugs in nature (no matter how difficult you might think it is :) Let me know if there are any questions, feedback, concerns, or suggestions. Cheers! - Marcel |