Improving honggfuzz for fuzzbench, part 3 (end-of-March, mid-April 2020)

205 views

Skip to first unread message

Robert Święcki

unread,

Apr 15, 2020, 4:42:28 PM4/15/20

to fuzzing-discuss

For part 1 click here, for part 2 click here.

Removing -fno-inline

hfuzz-cc (hfuzz-clang, hfuzz-clang++) used to use -fno-inline as one of its added clang/gcc parameters. The original thinking behind that was that it would make the code branching more explicit. Such non-inlined functions would then be reachable through a separate function, possibly returning better signals to the fuzzer.

As much as it can still be the case, it had a surprising effect on the jsoncpp_fuzzer. Honggfuzz was consistently achieving 5% smaller code coverage numbers than other fuzzers. After a not-so-quick debugging session, I figured out it’s because honggfuzz used -fno-inline with its hfuzz-cc compilation wrapper. So, I removed it.

In retrospect it seems understandable. The inlined code is also instrumented and seen by the coverage binary as actually new code blocks and not a single function, just sprayed across the code, Using -fno-inline prevented honggfuzz from producing inputs, which would make the coverage binary used by fuzzbench to reach those inlined code blocks.

I don’t think this change made honggfuzz better wrt discovering new code paths, it merely made it performing better during fuzzbench tests. All in all, it was a fun thing to debug, improved the honggfuzz’ fuzzbench score a lot, but probably didn't improve honggfuzz that much.

Adding a new code mutation function - mangle_ASCIINumChange()

After the aforementioned change (removing -fno-inline from the compilation wrappers) happened, the jsoncpp_fuzzer benchmark results started to look much better. Still, the maximum code coverage for honggfuzz and for other fuzzers was capped at 634 edges (see the Sample statistics and statistical significance table posted under this benchmark), and only libfuzzer/entropic were able to reach 635 edges (of course, those numbers per-se are not meaningful). A small, still a bit irritating problem :).

It turned out that libfuzzer and entropic implement a unique mutation function, the Mutate_ChangeASCIIInteger(). Which, if you haven’t guessed yet from the function’s name, randomly modifies numbers represented as ASCII in input files. In order to achieve 635 edges, a block of zeroes (0) would have to be changed to another, yet pretty much random number. It’s something that now libfuzzer/entropic did easily, and what was very hard for other fuzzers.

After a similar functionality has been added to honggfuzz, now it can reach this magical 635 edges in this specific test too. See the Sample statistics and statistical significance section posted under the benchmark for the actual numbers.

Implementing dynamic timeouts for “broken” benchmarks

IMO, the irssi_server-fuzz benchmark is somewhat broken. It tends to accumulate state, and over time the fuzzing speed drops from ~2k iterations/second to ~50-100 iterations/second (within 60 or so seconds). Honggfuzz does poorly in this test, as it simply implements the (default) 1 second timeout, which doesn’t cover this case. The same problem seems to affect libfuzzer/entropic. OTOH AFL and its descendants try to guess the correct timeout values by timing the input corpus. When the fuzzing binary goes over that initial timeout for some period time, it’s restarted and the fuzzing speed rises temporarily to 2k iterations/sec. again This seems to work reasonably well for this specific case.

I’ve implemented a similar functionality in honggfuzz, but I still need to tune it properly in order to get good results.

I talked to the fuzzbench team members about the fate of this specific benchmark, and after a discussion, I think the consensus is for it to stay. There’s some value in testing for how well the benchmarks which performance degrades over time are handled by various fuzzing engines.

EOPart 3

Kostya Serebryany

unread,

Apr 15, 2020, 6:41:38 PM4/15/20

to Robert Święcki, fuzzing-discuss

On Wed, Apr 15, 2020 at 1:42 PM Robert Święcki <rob...@swiecki.net> wrote:

For part 1 click here, for part 2 click here.

Removing -fno-inline

hfuzz-cc (hfuzz-clang, hfuzz-clang++) used to use -fno-inline as one of its added clang/gcc parameters. The original thinking behind that was that it would make the code branching more explicit. Such non-inlined functions would then be reachable through a separate function, possibly returning better signals to the fuzzer.

As much as it can still be the case, it had a surprising effect on the jsoncpp_fuzzer. Honggfuzz was consistently achieving 5% smaller code coverage numbers than other fuzzers. After a not-so-quick debugging session, I figured out it’s because honggfuzz used -fno-inline with its hfuzz-cc compilation wrapper. So, I removed it.

In retrospect it seems understandable. The inlined code is also instrumented and seen by the coverage binary as actually new code blocks and not a single function, just sprayed across the code, Using -fno-inline prevented honggfuzz from producing inputs, which would make the coverage binary used by fuzzbench to reach those inlined code blocks.

I don’t think this change made honggfuzz better wrt discovering new code paths, it merely made it performing better during fuzzbench tests. All in all, it was a fun thing to debug, improved the honggfuzz’ fuzzbench score a lot, but probably didn't improve honggfuzz that much.

Adding a new code mutation function - mangle_ASCIINumChange()

After the aforementioned change (removing -fno-inline from the compilation wrappers) happened, the jsoncpp_fuzzer benchmark results started to look much better. Still, the maximum code coverage for honggfuzz and for other fuzzers was capped at 634 edges (see the Sample statistics and statistical significance table posted under this benchmark), and only libfuzzer/entropic were able to reach 635 edges (of course, those numbers per-se are not meaningful). A small, still a bit irritating problem :).

It turned out that libfuzzer and entropic implement a unique mutation function, the Mutate_ChangeASCIIInteger(). Which, if you haven’t guessed yet from the function’s name, randomly modifies numbers represented as ASCII in input files. In order to achieve 635 edges, a block of zeroes (0) would have to be changed to another, yet pretty much random number. It’s something that now libfuzzer/entropic did easily, and what was very hard for other fuzzers.

After a similar functionality has been added to honggfuzz, now it can reach this magical 635 edges in this specific test too. See the Sample statistics and statistical significance section posted under the benchmark for the actual numbers.

Wow, thanks for this detail!

I hope we can find similar simple things missing in libFuzzer :)

Implementing dynamic timeouts for “broken” benchmarks

IMO, the irssi_server-fuzz benchmark is somewhat broken. It tends to accumulate state, and over time the fuzzing speed drops from ~2k iterations/second to ~50-100 iterations/second (within 60 or so seconds). Honggfuzz does poorly in this test, as it simply implements the (default) 1 second timeout, which doesn’t cover this case. The same problem seems to affect libfuzzer/entropic. OTOH AFL and its descendants try to guess the correct timeout values by timing the input corpus. When the fuzzing binary goes over that initial timeout for some period time, it’s restarted and the fuzzing speed rises temporarily to 2k iterations/sec. again This seems to work reasonably well for this specific case.

I’ve implemented a similar functionality in honggfuzz, but I still need to tune it properly in order to get good results.

I talked to the fuzzbench team members about the fate of this specific benchmark, and after a discussion, I think the consensus is for it to stay. There’s some value in testing for how well the benchmarks which performance degrades over time are handled by various fuzzing engines.

EOPart 3

--
You received this message because you are subscribed to the Google Groups "fuzzing-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fuzzing-discu...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/fuzzing-discuss/CAP145piX3a7msfJ4YBofuq7UjjuORHzBNq26W_4zzSPWzaVGNw%40mail.gmail.com.

Marcel Boehme

unread,

Apr 15, 2020, 8:10:39 PM4/15/20

to Robert Święcki, fuzzing-discuss

Hi Robert,

Thanks so much for your insights!

On a high level, it seems these were the main bottlenecks

1) needed better instrumentation / coverage-feedback (e.g., "removing -fno-inline", "Switching to the 8bit-inline instrumentation")

2) needed better mutation operators / dictionaries (e.g., "Adding a new code mutation function - mangle_ASCIINumChange()”, "Adding 4- and 8-byte integers to the dynamic dictionary”, “Improving string/mem*cmp calls, by adding good candidates to the internal/dynamic dictionary")

3) needed better power schedules (e.g., “Preferring faster/smaller/more-potent inputs")

4) needed better fuzzer configuration (“Reading inputs in smaller batches”, “Fixing builds”)

5) needed to restart benchmark (e.g., "Implementing dynamic timeouts for “broken” benchmarks”)

I am wondering about (5) where the benchmark is restarted. Does the benchmark slow down because the generated inputs just take that long, or is it because of accumulating state? If it is the latter, is "state pollution" a general problem for fuzzers that link to a dedicated fuzzing method (like libfuzzer)? For instance, (in OSS-Fuzz) are there lots of crashes that cannot be reproduced (false positives)?

Best regards,

- Marcel

---

Marcel Böhme

Lecturer at Monash University

https://mboehme.github.io

Robert Święcki

unread,

Apr 15, 2020, 8:34:38 PM4/15/20

to Marcel Boehme, fuzzing-discuss

Hi,

Thanks so much for your insights!

On a high level, it seems these were the main bottlenecks
1) needed better instrumentation / coverage-feedback (e.g., "removing -fno-inline", "Switching to the 8bit-inline instrumentation")
2) needed better mutation operators / dictionaries (e.g., "Adding a new code mutation function - mangle_ASCIINumChange()”, "Adding 4- and 8-byte integers to the dynamic dictionary”, “Improving string/mem*cmp calls, by adding good candidates to the internal/dynamic dictionary")
3) needed better power schedules (e.g., “Preferring faster/smaller/more-potent inputs")
4) needed better fuzzer configuration (“Reading inputs in smaller batches”, “Fixing builds”)
5) needed to restart benchmark (e.g., "Implementing dynamic timeouts for “broken” benchmarks”)

Thanks for compiling it up!

I am wondering about (5) where the benchmark is restarted. Does the benchmark slow down because the generated inputs just take that long, or is it because of accumulating state?

My guess (based on skimming through the code) is that it's accumulating state. It runs a single server, and feeds it with data, slowing down with every iteration, restarting it does help (temporarily).

With some fuzzing targets such mode works nicely, e.g. with Apache httpd - https://github.com/google/honggfuzz/tree/master/examples/apache-httpd - where hitting it with millions/billions of requests doesn't make it slow down, if it's configured correctly (e.g. by disabling/limiting logs).

If it is the latter, is "state pollution" a general problem for fuzzers that link to a dedicated fuzzing method (like libfuzzer)?

To a greater degree than for other fuzzers, yes. But it's also a general problem for all types of fuzzing engines: those restarts that afl and now honggfuzz implement seem to me to be just band-aids.

For instance, (in OSS-Fuzz) are there lots of crashes that cannot be reproduced (false positives)?