CCS'16 Paper on Improvements of AFL

2,205 views
Skip to first unread message

boehme...@gmail.com

unread,
Aug 16, 2016, 9:31:10 PM8/16/16
to afl-users

Dear all,


tldr; Checkout AFL on caffeine (= awesome performance improvements) at https://github.com/mboehme/aflfast!


Our paper on improvements of AFL has just been accepted at the 23rd ACM Conference on Computer and Communications Security (CCS'16). You can find the paper here: https://www.comp.nus.edu.sg/~mboehme/paper/CCS16.pdf ("Coverage-based Greybox Fuzzing as Markov Chain"). You can find our modifications here: https://github.com/mboehme/aflfast. The paper makes an attempt to formalise the machinery that drives AFL and discusses of challenges and opportunities (+technical improvements). Kudos to Michal Zalewski for the awesome work on AFL and everybody else who has contributed! Without your work, ours would not be possible.


AFLFast is a fork of AFL that in experiments with Binutils outperformed AFL by an order of magnitude! It helped in the success of Team Codejitsu at the finals of the DARPA Cyber Grand Challenge where their bot Galactica took 2nd place in terms of #POVs proven (see red bar at https://www.cybergrandchallenge.com/event#results; maybe it is even a tie for 1st place). In our experiments, AFLFast exposed several previously unreported CVEs that could not be exposed by AFL in 24 hours and otherwise exposed vulnerabilities significantly faster than AFL while generating orders of magnitude more unique crashes. Our experiments ran on AFL 1.95b but we ported AFLFast to the most recent version AFL 2.30b. Check out AFLFast on Github and evaluate it yourself.


Essentially, we observed that most generated inputs exercise the same few “high-frequency” paths and developed strategies to gravitate towards low-frequency paths, to stress significantly more program behavior in the same amount of time. We devised several search strategies that decide in which order the seeds should be fuzzed and power schedules that smartly regulate the number of inputs generated from a seed (i.e., the time spent fuzzing a seed). We call the number of inputs generated from a seed, the seed's energy.

We find that AFL's exploitation-based constant schedule assigns too much energy to seeds exercising high-frequency paths (e.g., paths that reject invalid inputs) and not enough energy to seeds exercising low-frequency paths (e.g., paths that stress interesting behaviors). Technically, we modified the computation of a seed's performance score (calculate_score), which seed is marked as favourite (update_bitmap_score), and which seed is chosen next from the circular queue (main).


More details on the power schedule in the paper and here: https://github.com/mboehme/aflfast/blob/master/Readme.md


Happy to hear your thoughts!


Best regards,

- Marcel


Michal Zalewski

unread,
Aug 16, 2016, 9:39:16 PM8/16/16
to afl-users
Thanks for sharing. Interesting optimization! I'll experiment with it
a bit and report back.

Skimming through the paper, it sounds like your implementation is more
aggressive about skipping deterministic steps:

"Secondly, AFL executes the deterministic stage the first time ti is
fuzzed. Since our power schedules assign significantly less energy for
the first stage, our extension executes the deterministic stage later
when the assigned energy is equal to the energy spent by deterministic
fuzzing."

Have you benchmarked this against afl-fuzz -d? The deterministic
stages are an intentional performance trade off designed to produce
smaller diffs for easily discoverable issues, to simplify debugging at
the expense of discovering paths more slowly - so I wonder if some of
the speedup is attributable to this.
/mz

Marcel Böhme

unread,
Aug 16, 2016, 10:34:25 PM8/16/16
to afl-users
The explore-schedule is essentially "afl -d" but with 1% of inputs generated from a seed compared to "afl -d"). We find that explore generates more unique crashes than any other schedule for the first few hours. However, it quickly plateaus and most other schedules start to outperform the explore schedule considerably.

In the beginning it is easy for explore to make progress. Later, energy is wasted on fuzzing seeds exercising high-frequency paths versus low frequency paths.

Michal Zalewski

unread,
Aug 17, 2016, 12:12:02 AM8/17/16
to afl-users
Nice. I'm running some experiments right now. I'm trying to do this
incrementally, so here's what my first tests are focusing on:

1) I'm looking specifically at the ability to resolve execution paths
(i.e., the "bitmap density" metric), rather than crash counts. This
makes it easier to experiment with a broader range of targets, and is
how I normally benchmark changes to AFL. In the past, I found this to
be a very good proxy for the overall quality of the mutation & queue
scoring algorithms.

2) In my initial runs, I'm trying to simply compare the traditional -d
mode to -d with your checksum frequency optimization. This is not to
reject the idea of skipping deterministic checks for less interesting
paths - but my reasoning is that this is a simpler (albeit more noisy
and perhaps less effective) comparison - and it takes less time for
the algorithm to reach a plateau, so I don't have to wait several days
to be sure.

3) I'm trying to benchmark against a different range of targets. This
is just out of abundance of caution - in the past, I've often noticed
that optimizations that work for one target often don't work for
others. For example, adding -fno-if-conversions -fno-if-conversions2
to CFLAGS improves coverage in libpng, but actually reduces it in
libjpeg-turbo :-)

Stay tuned!

/mz

Michal Zalewski

unread,
Aug 17, 2016, 12:37:41 AM8/17/16
to afl-users
> 3) I'm trying to benchmark against a different range of targets. This
> is just out of abundance of caution - in the past, I've often noticed
> that optimizations that work for one target often don't work for
> others. For example, adding -fno-if-conversions -fno-if-conversions2
> to CFLAGS improves coverage in libpng, but actually reduces it in
> libjpeg-turbo :-)

Specifically, my usual benchmark targets are, in the order of
complexity: zlib (minigzip -d), GNU patch, libjpeg-turbo (djpeg, with
dictionary), libpng (readpng, CRC check removed, with dictionary), and
sqlite3 (with dictionary).

/mz

Gustavo Grieco

unread,
Aug 17, 2016, 7:27:08 AM8/17/16
to afl-...@googlegroups.com
Hi,

This paper looks very impressive indeed. I have just one question to
ask. I can't find in the evaluation any measure of unique crashes
according to its backtraces (as it is usual in papers measuring the
real amount of bugs/vulnerabilities found in a fuzzing campaing). I'm
saying this because i'm doing some simple experiments on seed
selection to try to improve AFL and it is easy to see that in most of
the programs i tested you can reduce the number of seeds computed by
afl-cmin to obtain more "afl-unique" crashes in the first 48hs.

Let me show you an example: I attached an example of an experiment
using approach (labeled "vd") vs AFL seed selection (labeled "afl") to
try to fuzz an old version of tidy with html fragments from the
internet.

Despite our approach can find 10 times more unique crashes, there is
only 2 *real* different crashes and the original AFL is better! The
question is: have you reviewed this type of measure just to be sure
that the results are not telling a completely different story?

Regards,
Gustavo.
> --
> You received this message because you are subscribed to the Google Groups "afl-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to afl-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
crashes.png
uniq_crashes.png

Marcel Böhme

unread,
Aug 17, 2016, 8:11:56 AM8/17/16
to afl-users
Hi Gustavo,

Yes, we also evaluate the number of vulnerabilities/bugs exposed. For nm, we went through the >1.2k unique crashes to identify the number of vulnerabilities and bugs that were exposed by the unique crashes. AFLFast turns out to expose 6 vulnerabilities and 2 bugs about 7x faster than AFL in 24h experiments -- plus AFLFast exposed 3 vulnerabilities and 1 bug that AFL did not expose. However, we expect that given more time AFL would still find the same vulnerabilities since our modifications have no detrimental impact of AFL's effectiveness.

By "expose" we mean to generate the first unique crash that witnesses the vulnerability/bug.

In order to identify which unique crash belongs to which vulnerability/bug, we fixed one vulnerability/bug at a time, prepared one fixed version for each vulnerability/bug, and executed the crashing input on the version with and without the vulnerability/bug.

Best regards,
- Marcel

Michal Zalewski

unread,
Aug 18, 2016, 10:23:58 AM8/18/16
to afl-users
Hey folks,

Okay, so I think your approach works. But based on my preliminary
experiments, there are two potential caveats:

1) The improvements I'm seeing in my (admittedly crude) testing aren't
as dramatic as outlined in the paper. That said, I'm looking at
different targets and using different metrics, so maybe that's not a
very interesting observation.

2) More curiously, I was able to get very close to the results I'm
seeing with AFLFast with a two-line change to stock AFL; as a
consequence, I am not sure if the checksum counting algorithm is truly
responsible for the bulk of the gains you are seeing in your code
(although I have no doubt that it confers some benefits).

Specifically, I sought to replicate two artifacts of your
implementation: the fact that it spends a much greater proportion of
time on non-deterministic fuzzing early on in the game; and that most
of the non-deterministic cycles it carries out end up being very
short, causing the queue to be cycled very rapidly, causing new queue
entries to get air time almost right away. I approximated these by
reducing HAVOC_CYCLES and SPLICE_HAVOC 20-fold (to roughly match the
queue cycling cadence of AFLFast) and then by specifying -d.

I'll refer to this modified setup as FidgetyAFL.

Some rudimentary measurements follow.

== Experiment 1 ==

This involved running AFL against the contrib/libtests/readpng.c
utility from libpng (CRC patch applied). I like this benchmark because
it's fast and predictable, but the target is fairly complex, and it
can take a week or two for AFL to reach a plateau.

I ran this benchmark on FreeBSD, single core, library compiled using
afl-clang. I did two sessions, each 6 hours long. The command line
was:

$ ./afl-fuzz -i testcases/images/png -x dictionaries/png.dict -t 5 -o
out_dir ./readpng

Stock AFL vs AFLFast: stock AFL was consistently but very slightly
(~1%) ahead of AFLFast in terms of the number of tuples seen, with
little variation over time.

FidgetyAFL: this implementation outperformed both "competitors" by
around +10% at the 1h mark, then trailed very slightly by the 6h mark
(-1%).

== Experiment 2 ==

This involved running AFL against the djpeg utility from the latest
libjpeg-turbo. It's another very predictable and easy-going benchmark
with appreciable complexity but no hard puzzles to solve.

I tried this one on Linux, single core, library compiled with afl-gcc.
Two sessions, each six hours long. The command line was:

$ ./afl-fuzz -i testcases/images/jpeg -x dictionaries/jpeg.dict -t 5
-o out_dir ./djpeg

Stock AFL vs AFLFast: AFLFast took a robust lead (+15% tuples seen) by
the 1h mark. It remained in the lead until the end, although its
advantage gradually tapered off to +5%.

FidgetyAFL: robustly outperforms stock AFL and scores slightly below
AFLFast at the 1h mark (+13%). At the 6h mark, the advantage tapered
off to +4%.

== Experiment 3 ==

This involved running AFL against the minigzip [-d] utility from the
latest zlib. It's a simple and graceful target that doesn't take too
long to meaningfully explore, so it's good for shorter-running
benchmarks.

I did three one-hour sessions on Linux, single core, library compiled
with afl-gcc. The command line was:

$ ./afl-fuzz -i testcases/archives/common/gzip -t 5 -o out_dir ./minigzip -d

Stock AFL vs AFLFast: the implementations go toe-to-toe for a while,
but AFLFast tends to take off and secure and keep a +20% lead later
on.

FidgetyAFL: the implementation roughly tied with AFLFast.

== Experiment 4 ==

This involved running AFL against the GNU patch utility. In contrast
to other experiments, it started with a bogus input file, in order to
test for the ability to discover grammar. It's a very graceful target
for that, because it supports several input formats and parses them in
interesting ways.

The test environment was FreeBSD + afl-clang. I did three one-hour
runs. Cmdline:

$ ./afl-fuzz -i testcases/others/text -t 5 -o out_dir ./patch --dry-run

Stock AFL vs AFLFast: AFLFast tends to win *very* decisively,
typically +50% tuples compared to stock AFL. A very clear success
story.

FidgetyAFL: the implementation tied with AFLFast.

/mz

Marcel Böhme

unread,
Aug 19, 2016, 10:26:41 PM8/19/16
to afl-users
Hi Michal,

Great to hear you are experimenting with AFLFast!

There are a few points that we would like to make,
* FidgetyAFL appears to be similar to our Explore schedule which significantly outperforms the other schedules in terms of number of crashing inputs for the first few hours BUT it it definitely the worst schedule at the 24hour mark. Perhaps we should also time-to-exposure for each vulnerability as a better measure than #unique crashes.
* The number of tuples seen over time might not be a good measure of efficiency. One might think that low-frequency paths stress more interesting behavior, that vulnerabilities lurk in paths that are not "usually" exercised. More specifically, explore (or FidgetyAFL for that matter) might still exercise more tuples (or paths) after 6hours but the #vulnerabilities exposed is still much lower. Importantly, AFLFast increases the number of low-frequency paths explored.
* There is a lot of randomness. Sometimes AFLFast doesn't move at all. For this reason we ran each experiment 8 times and reported averages.
* In recent experiments with more subjects, Cut-Off-Exponential (COE) seems to consistently outperform even the exponential schedule (FAST).

Rules of the Game (Suggestion for Experiments):
1) 8 runs of 24 (better 48) hours with AFLFAST (COE) versus AFLFAST (FAST) versus FidgetyAFL versus Stock AFL.
2) Take a few subjects with a known vulnerability and measure the time to generate the first vulnerability-exposing input (Thats what we are interested in, right?).
3) Alternatively, take LibJPEG and measure the time to generate the first valid jpeg :)

Again, great work and good to hear you are taking AFLFast for a spin! We are getting our hands dirty with FidgetyAFL and do some experiments. Reporting back soon.

Cheers!
- Marcel

PS: Make some noise on Twitter #aflfast #CCS16 :)

Michal Zalewski

unread,
Aug 19, 2016, 10:52:22 PM8/19/16
to afl-users
Hey,

Perhaps we should also time-to-exposure for each vulnerability as a better measure than #unique crashes.

FWIW, I'm not a huge fan of measuring crashes, for two reasons. First of all, they are relatively sparse in targets that are actually interesting to fuzz, which (a) limits your choice of targets; (b) makes the results more heavily affected by random flukes (sometimes, one AFL run will find a particular bug in 5 minutes, only to need a day on a second run); and (c) makes it easier to accidentally come up with optimizations that do not generalize to other targets, without a robust way to detect that outcome. (That last part is a very clear failing in some of the other academic papers on automated vulnerability detection, although probably not in yours.)

The other reason why I'm wary of this metric is that crashes are difficult to automatically de-dupe without introducing additional biases. Every de-duping technique has fairly common failure modes. This applies about equally to attempts to fingerprint backtraces, to looking at faulting addresses, and to relying on the AFL crash counter. The picture is muddied even more by many classes of memory corruption issues that may not abort the program immediately, but corrupt program state and cause latent crashes in completely unrelated and variable locations later on (here, ASAN and MSAN can help to a large extent).

I've sort of come to accept that the ability to resolve tuples in several classes of targets is a better predictor of how the tool will fare in the wild. This is not to say that your methodology is not valid, just that I'm very nervous when using it as a benchmark for AFL or any other testing tool :-)

PS. If you're playing with it, one recommendation I can probably make is to compare parallelized instances, since it reduces the effects of blind luck and some of the disparities you may see simply as a consequence of shifting the balance between deterministic and non-deterministic fuzzing stages. For example, on an 8-core machine, it should be quite easy to do an A/B experiment with two sets consisting of 1 master (-M) and 3 slaves (-S) syncing within their respective sets.

Cheers,
/mz

Marcel Böhme

unread,
Aug 19, 2016, 11:31:58 PM8/19/16
to afl-users

Absolutely agree. In general, #unique crashes over time is one of the worst measures of effectivness. I believe Gustavo made that point earlier. We painstakingly de-duped by understanding and fixing the root cause of each unique crash. If the crashing input was the first to not crash in the fixed version, it was considered to expose that vulnerability. MITRE gave us separate CVEs for each vulnerability.

However, using the Markov chain model, we informally argue that our modifications have no detrimental impact on effectiveness. That is, AFL has the *same potential* to generate the unique crashes that AFLFast generated earlier. Because of the circular queue, given the same seeds, there is no reason why AFL should not generate the same unique crashes (at a later point in time) -- on average. We plan to formalize this argument in the Journal extension. Simply said: In expectation, AFLFast generates the same unique crashes only at different points in time (as it turns out much faster). So, in our case #unique crashes over time should be a fairly good approximation of efficiency. 

To address the problem of evaluating with a small number of targets, we put our modifications of AFL out there for everybody to evaluate. Thanks again for the great work on AFL and considering to integrate some of our ideas. Its awesome to see how vulnerability detection research and practice finally connects :)

Cheers!
- Marcel

Michal Zalewski

unread,
Aug 20, 2016, 12:10:48 AM8/20/16
to afl-users
However, using the Markov chain model, we informally argue that our modifications have no detrimental impact on effectiveness. That is, AFL has the *same potential* to generate the unique crashes that AFLFast generated earlier. Because of the circular queue, given the same seeds, there is no reason why AFL should not generate the same unique crashes (at a later point in time) -- on average.

Sure, but that's not equivalent to having no detrimental effect, at least not as the average user would understand it; just about any fuzzer will generate just about any crashing input given enough time, but you obviously don't want to /use/urandom when a better tool will hit the condition in a more reasonable amount of time :-)

But back to testing... by the way, if anyone is following this, your results would be very welcome, too!

Cheers,
/mz

Brian 'geeknik' Carpenter

unread,
Aug 21, 2016, 4:34:44 AM8/21/16
to afl-...@googlegroups.com
Fuzzing Perl: A Tale of Two American Fuzzy Lops is up at http://www.geeknik.net/71nvhf1fp. A very unscientific study performed over the last 48 hours. =)

Kurt Roeckx

unread,
Aug 21, 2016, 6:23:18 AM8/21/16
to afl-...@googlegroups.com
On Sun, Aug 21, 2016 at 03:34:02AM -0500, Brian 'geeknik' Carpenter wrote:
> Fuzzing Perl: A Tale of Two American Fuzzy Lops is up at
> http://www.geeknik.net/71nvhf1fp. A very unscientific study performed over
> the last 48 hours. =)

So one thing I noticed is that for AFLFast you have a stability of
87%, for FidgetyAFL 84%.

You might want to look into getting that higher, it seems to have
had a great effect for me. I got them all higher than 99.5%
currently.


Kurt

Marcel Böhme

unread,
Aug 21, 2016, 7:04:21 AM8/21/16
to afl-users
On Sunday, 21 August 2016 16:34:44 UTC+8, Brian Carpenter wrote:
Fuzzing Perl: A Tale of Two American Fuzzy Lops is up at http://www.geeknik.net/71nvhf1fp. A very unscientific study performed over the last 48 hours. =)

Nice write-up! Did you also try the Cut-Off-Exponential: 'afl-fuzz -p coe ..'? 
Seems, the Cut-Off-Exponential might have an edge over the default exponential schedule (fast) after all.

Brian 'geeknik' Carpenter

unread,
Aug 21, 2016, 7:36:23 AM8/21/16
to afl-...@googlegroups.com
I did not, but this is the first in a series that documents what works and doesn't work so well. Watch for a follow-up next week. Missed a few things in my haste to get the first one out the door. 

--
You received this message because you are subscribed to the Google Groups "afl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to afl-users+unsubscribe@googlegroups.com.

Emilien Gaspar

unread,
Aug 21, 2016, 4:24:32 PM8/21/16
to afl-...@googlegroups.com
On Sun, Aug 21, 2016 at 12:23:15PM +0200, Kurt Roeckx wrote :
> On Sun, Aug 21, 2016 at 03:34:02AM -0500, Brian 'geeknik' Carpenter wrote:
> > Fuzzing Perl: A Tale of Two American Fuzzy Lops is up at
> > http://www.geeknik.net/71nvhf1fp. A very unscientific study performed over
> > the last 48 hours. =)
>
> So one thing I noticed is that for AFLFast you have a stability of
> 87%, for FidgetyAFL 84%.
>

BTW, maybe the section 8 of docs/status_screen.txt should be updated to
explain what "stability" is.

Thank you :-).

--
gapz -- https://residus.eu.org

Kurt Roeckx

unread,
Aug 21, 2016, 4:49:31 PM8/21/16
to afl-...@googlegroups.com
It's already explained in that section?


Kurt

Emilien Gaspar

unread,
Aug 21, 2016, 5:06:47 PM8/21/16
to afl-...@googlegroups.com
On Sun, Aug 21, 2016 at 10:49:28PM +0200, Kurt Roeckx wrote :
> > BTW, maybe the section 8 of docs/status_screen.txt should be updated to
> > explain what "stability" is.
>
> It's already explained in that section?

My mistake, checked against an older version.

Simon Pinfold

unread,
Aug 21, 2016, 9:03:32 PM8/21/16
to afl-users
On Saturday, August 20, 2016 at 4:10:48 PM UTC+12, Michal Zalewski wrote:
if anyone is following this, your results would be very welcome, too!

I am doing a university project looking into the nature of parallelizing AFL.
So far I have results for libxml, tcpdump, libjpeg and libpng at 1, 2, 4, 8 and 16 slave instances (with no master) on 2.30b. I'll run the same tests on 2.32b and get back to you on the differences. Depending on what I find from this I can do tests for afl-fast and with master instances.

Bhargava Shastry

unread,
Oct 19, 2016, 7:06:39 AM10/19/16
to afl-users
Hi all,

With timely and very helpful support from @rc0r, I managed to hack together A/B test support in the dev branch of Orthrus[1] to enable exactly the kind of experiments being talked about in this thread. You can read more about how you can use Orthrus to carry out A/B tests here[2]. Some console shots, if you will (control group: afl-2.33b vs experiment group: aflfast-2.33b both running in master/slave mode each with 60 virtual cores and default fuzzer args). Control group is afl-fuzz and experiment group is afl-fuzz-fast:

======
A/B test status
Control group
               Fuzzers alive : 60
              Total run time : 31 days, 6 hours
                 Total execs : 78 million
            Cumulative speed : 1710 execs/sec
               Pending paths : 5190 faves, 75208 total
          Pending per fuzzer : 86 faves, 1253 total (on average)
               Crashes found : 97 locally unique

             Triaged crashes : 1
Experiment group
               Fuzzers alive : 60
              Total run time : 31 days, 6 hours
                 Total execs : 55 million
            Cumulative speed : 1200 execs/sec
               Pending paths : 3305 faves, 80605 total
          Pending per fuzzer : 55 faves, 1343 total (on average)
               Crashes found : 50 locally unique

             Triaged crashes : 1
===

Known limitations:
  - Supports only fuzzer comparison at the moment
  - Since both control and experiment fuzzers are run simultaneously, OS scheduling comes into play
     - Initial experiments show that the two fuzzers do not have the **exact** same CPU exposure time, but on average and over a long period (say 12 hours) they get roughly the cumulative CPU exposure (as the console shot shows)

Please try it out! Feedback welcome :-)


Cheers,
Bhargava
Reply all
Reply to author
Forward
0 new messages