Improving honggfuzz for fuzzbench, part 1 (end-of-February - mid-March 2020)

208 views
Skip to first unread message

Robert Święcki

unread,
Apr 7, 2020, 9:54:36 AM4/7/20
to fuzzing...@googlegroups.com

When fuzzbench was made publicly available, the very first results were not that bad for honggfuzz. It was in the middle of the pack. Also, honggfuzz-specific features (minimizing hamming distance for certain string/mem*cmp calls) were responsible for it doing particularly well in the libxml2 and in the php benchmarks.


During the next 2 few weeks I worked on improving other benchmarks, and the results can be seen here (2020-03-16). Some benchmarks (like proj4, systemd or freetype2) improved, while others stayed pretty much the same. Of the “improvements” I made, there are 2 categories


“Improvements” that maybe have helped, but nobody knows for sure:


Reading inputs in smaller batches:

Originally (as it happens with afl+descendants and libfuzzer+descendants) honggfuzz was reading files from the seed corpus one by one, testing whether they add to coverage metrics (otherwise discarding them). I thought that maybe by starting with some smaller input size, like 1kB or even a couple of bytes, testing it for coverage, and then increasing the size in the next step (either by a factor of 2, or by adding next 1kB and repeating the process) I could make honggfuzz prefer smaller inputs and discard unnecessary parts of seed files.


This potentially could work well for e.g. the sqlite3 benchmark, which provided big seed files (>1MB). But given that honggfuzz achieved pretty much the same results in the initial and in the mid-March run, this is unlikely. Still the feature is present, and can be seen here.


Preferring faster/smaller/more-potent inputs:

In a similar vein to the above, I tried to improve the fuzzing scheduler by preferring potentially more interesting inputs (already in the dynamic corpus/queue) over others. The procedure is self-explanatory in the code, enough to say that inputs with bigger penalty factor were picked less likely for processing, by using a form of:


if (rnd() % per_input_skip_factor) {

  .. skip_input_in_this_round ..

};


It smoothened the execution speed of the sqlite3 benchmark, resulting in higher execs/sec than before. But, again, this didn’t help it much to achieve better results.



Improvements that actually worked:


Fixing builds:

The initial results for the systemd benchmark were disappointing. After a somewhat lengthy debugging session, I figured out that it’s not about the fuzzer operations, but it was caused by the code linking stage.


Systemd uses the main instrumented binary and the instrumented libsystemd.so where most of the benchmark’s logic resides. I will spare you the more technical details, by just saying that hfuzz-cc incorrectly interacted with ASAN: while the main binary was instrumented properly, all instrumentation calls in the libsystemd.so were redirected to empty ASAN stubs.


After the fix was implemented, honggfuzz did well in this category.


Figuring all of it was not easy, and required some longish hours with binutils and gdb, but I feel like I understand static linking way better now :). Enough to say that this was all very honggfuzz/asan-specific.



Improving string/mem*cmp calls, by adding good candidates to the internal/dynamic dictionary:

Both honggfuzz and libfuzzer+descendants (and some afl-inspired fuzzers) are using a rather obvious trick to improve code coverage. They are instrumenting str/mem*cmp function calls, tring to minimize the hamming distance (or byte-based hamming distance) of specific string/memory comparisons. This typically takes a form of:


intercept_strcmp(const char* s1, const char *s2) {

  uintptr_t pc = __builtin_return_address(0);

  int score = compute_score(s1, s2);

  if (score < cmpmap_score_for_pc(pc)) {

   cmpmap_score_for_pc(pc) = score;

   save_input_as_interesting();

  }

  return original_strcmp(s1, s2);

}


It still works reasonably well, being in part responsible (along with fixing the build) for good results of the systemd benchmark, which is heavily text-based. Still, something better was needed, and libfuzzer was already using it, by utilizing a form of a temporary dictionary for recent string/mem*cmp values.


I decided to go a step further and actually check whether values used in mem/str*cmp comparisons are valid dictionary candidates, by checking whether pointers passed to those functions reside within read-only sections of loaded files (binary + libraries). This can be done by calling dl_iterate_phdr() (or by analyzing /proc/<pid>/maps, but this would be Linux-specific), and checking whether addresses used fall within bounds of loaded sections of files. If so, fuzzer adds them to a dynamic dictionary. The current implementation of this check can be seen here.


Caveat: doing dl_iterate_phdr() with each string/mem*cmp will slow down the fuzzing process immensely for some benchmarks, hence the call is only performed every ~100 calls or so. This is fine, as fuzzed inputs repeat often, and within a minute or so pretty much all interesting string/memory values land in the dynamic dictionary, and are available for the corpus mutation stage.


This improved honggfuzz results for the freetype2 (initial vs mid-March) and for proj4 (initial vs mid-March) benchmarks.


EOPart 1.


--
Robert Święcki

Andrea Fioraldi

unread,
Apr 8, 2020, 4:24:52 AM4/8/20
to fuzzing-discuss

Hi Robert, glad to see that tokens extraction improved honggfuzz in fuzzbench.

In AFL++ also we log routines tokens from the execution (however this is a mode never tested in fuzzbench).
https://github.com/AFLplusplus/AFLplusplus/blob/master/llvm_mode/afl-llvm-rt.o.c#L545

We don't log simply the tokens of *cmp functions, but from all functions that takes two pointers as first arguments.
We dump the first 32 bytes of memory as you can see.

I found that this improves coverage not only when logging read-only tokens as many tokens are computed at runtime (many are in heap).
To log only read-only tokens, why not do it before fuzzing extracting them from binaries?

Andrea

Andrea

Robert Święcki

unread,
Apr 8, 2020, 5:39:53 AM4/8/20
to Andrea Fioraldi, fuzzing-discuss
Hi,

śr., 8 kwi 2020 o 10:24 Andrea Fioraldi <andreaf...@gmail.com> napisał(a):

Hi Robert, glad to see that tokens extraction improved honggfuzz in fuzzbench.

In AFL++ also we log routines tokens from the execution (however this is a mode never tested in fuzzbench).
https://github.com/AFLplusplus/AFLplusplus/blob/master/llvm_mode/afl-llvm-rt.o.c#L545

I can only say: go for it :). Fuzzbench authors seemed so far kind enough to allow experiments.
 
We don't log simply the tokens of *cmp functions, but from all functions that takes two pointers as first arguments.
We dump the first 32 bytes of memory as you can see.

I found that this improves coverage not only when logging read-only tokens as many tokens are computed at runtime (many are in heap).
To log only read-only tokens, why not do it before fuzzing extracting them from binaries?

I tried that, but there were problems with it. However, maybe we could find some working solution here.

When doing it at run-time, the str/mem*cmp interceptor gets a raw pointer for memory comparison arguments, and can verify it's in the bounds of the loaded binary (what AFL++ does too). And with performing this check every n invocations of *cmp(), I get the best of two worlds: speed and precision, even if one needs to wait a minute or two to populate the dictionary.

It turned out to be pretty problematic to extract valid strings and memory blocks from the binary. These are not marked in any specific way, and some heuristics is needed (unless one wants to decompile the program resolving relocations and look for arguments to *cmp). One such check could be whether a memory block resides in the .rodata segment (but what about .text and .data) and there's still the problem of figuring out whether given data is actually used for actual comparisons. Even my libhfuzz.a, which gets linked with the final binary, delivers tons of strings (eg. for logging) which are not interesting from fuzzing perspective. The initial dictionary could be in 90% composed of something not very useful then. But, hey, fuzzing is a statistical business, maybe that's enough :).

--
Robert Święcki

van Hauser

unread,
Apr 8, 2020, 6:21:59 AM4/8/20
to fuzzing-discuss
> To log only read-only tokens, why not do it before fuzzing extracting them from binaries?

that was my thinking too.
Why not at compile time gather all fixed tokens from strcmp, memcmp,
etc. and pass that to the fuzzer on start? This would have no overhead
compared to what is implemented in honggfuzz.

Regards,
vh/Marc

Robert Święcki

unread,
Apr 8, 2020, 9:45:47 AM4/8/20
to van Hauser, fuzzing-discuss
Hi,
Indeed, though if we wanted to keep those tokens somehow still attached to the binary (e.g. in some separate section), because in my view the instrumented binary should be the only thing that's needed to start effective fuzzing, then this would require at least some libclang module (which might be a good idea anyway)? Pretty hard to do that with preprocessor + gcc|clang options I guess, but I need to think about those libclang modules, these are pretty powerful.

PS: There would be still 2 small drawbacks, but maybe not that important on the larger scale of things: a). it'll log/save tokens which are unreachable in the current code, this can happen with huge projects which are mostly linked statically, and which have extensive (yet mostly unused) deps. b). it'll not save tokens from libraries which are not instrumented - e.g. if a project links dynamically in uninstrumented libXML.so to parse some quick XML file - but maybe that's too esoteric.

--
Robert Święcki

van Hauser

unread,
Apr 8, 2020, 10:03:08 AM4/8/20
to Robert Święcki, fuzzing-discuss
Hi!

On 08.04.20 15:45, Robert Święcki wrote:
> śr., 8 kwi 2020 o 12:21 van Hauser <v...@thc.org <mailto:v...@thc.org>>
> napisał(a):
>
> > To log only read-only tokens, why not do it before fuzzing
> extracting them from binaries?
>
> that was my thinking too.
> Why not at compile time gather all fixed tokens from strcmp, memcmp,
> etc. and pass that to the fuzzer on start? This would have no overhead
> compared to what is implemented in honggfuzz.
>
>
> Indeed, though if we wanted to keep those tokens somehow still attached
> to the binary (e.g. in some separate section), because in my view the
> instrumented binary should be the only thing that's needed to start
> effective fuzzing, then this would require at least some libclang module
> (which might be a good idea anyway)? Pretty hard to do that with
> preprocessor + gcc|clang options I guess, but I need to think about
> those libclang modules, these are pretty powerful.

yes that was my thinking, to put this into the laf-intel llvm_mode pass.
I will implement a prototype in the next days in afl++ (branch:
autodictionary).

I would put that dictionary at compile time into a global array that is
transfered via the forkserver FDs to afl-fuzz.

Lets see how this will work out.

(that being said, Andrea's redqueen/cmplog implementation is way more
powerful and effective )

> PS: There would be still 2 small drawbacks, but maybe not that important
> on the larger scale of things: a). it'll log/save tokens which are
> unreachable in the current code, this can happen with huge projects
> which are mostly linked statically, and which have extensive (yet mostly
> unused) deps.

yes true. but this should be not a big issue as this just means extra
useless fuzzing tests which on the other hand is saved against not
needing to intercept memcmp, strcmp etc at runtime and testing where
their arguments are, if they have already been learned, etc.

> b). it'll not save tokens from libraries which are not
> instrumented - e.g. if a project links dynamically in uninstrumented
> libXML.so to parse some quick XML file - but maybe that's too esoteric.

yes. IMHO that is not an issue though. If you dont instrument it then
its not the target.
So I would even say - you might fuzz into things when you learn/solve
these strings that are not the goal of the fuzzing campaign.

Regards,
Marc/vH

Robert Święcki

unread,
Apr 8, 2020, 10:21:18 AM4/8/20
to van Hauser, fuzzing-discuss
Hi,
 
> Indeed, though if we wanted to keep those tokens somehow still attached
> to the binary (e.g. in some separate section), because in my view the
> instrumented binary should be the only thing that's needed to start
> effective fuzzing, then this would require at least some libclang module
> (which might be a good idea anyway)? Pretty hard to do that with
> preprocessor + gcc|clang options I guess, but I need to think about
> those libclang modules, these are pretty powerful.

yes that was my thinking, to put this into the laf-intel llvm_mode pass.
I will implement a prototype in the next days in afl++ (branch:
autodictionary).

Cool, I suggest testing with the 'proj4' benchmark, it can be easily built locally with just ./configure && make, and it has tons of interesting strings which help it to gain coverage

$ cat /proc/"`pidof -s honggfuzz`"/fd/6 | strings | wc -l
274

$ cat /proc/"`pidof -s honggfuzz`"/fd/6 | strings | tail -n 10
no_uoff
o_proj
epoch
tobs
transpose
dtheta
approx
theta
tilt
south

--
Robert Święcki

Josh Bundt

unread,
Apr 8, 2020, 12:47:54 PM4/8/20
to fuzzing-discuss
Robert,

Thanks for sharing your insights!  Not sure if you looked at your improvements longitudinally, but the attached report shows 19 Mar to 01 April, when the edge counts were comparative.  You made significant improvements on most all of the benchmarks.  Do you have any ideas on what would have caused the regressions for curl, openssl and mbedtls? 

Curl and openssl both have a large number of seeds, so maybe reading inputs in small batches slows down the initial coverage for those two?  Mbedtls only has two seeds, so the difference there must be something else, and actually the regression happened after 03-19, which is earlier than the other two.


Josh
honggfuzz_report.tgz

Robert Święcki

unread,
Apr 8, 2020, 1:03:51 PM4/8/20
to Josh Bundt, fuzzing-discuss
Hi,

Thanks for sharing your insights!  Not sure if you looked at your improvements longitudinally, but the attached report shows 19 Mar to 01 April, when the edge counts were comparative.  You made significant improvements on most all of the benchmarks.  Do you have any ideas on what would have caused the regressions for curl, openssl and mbedtls? 

Curl and openssl both have a large number of seeds, so maybe reading inputs in small batches slows down the initial coverage for those two?  Mbedtls only has two seeds, so the difference there must be something else, and actually the regression happened after 03-19, which
 
I think so. It's probably switching from pc-guard counting to 8-bit-inline counters with https://github.com/google/honggfuzz/commit/64295f5487ac60110f7a249e0e3feeb257ef8787. Some benchmarks benefited from it (e.g. the range of findings for the zlib benchmarks), but some degraded (like curl) because of the difference of fuzzing speed of those two methods in some cases. I'll send another post later today covering this exact topic, but also I would like to test both instrumentation methods on fuzzbench soon, as it's not obvious which one produces better results for an average benchmark.

PS: The way edges are counted changed sometime in mid-March, probably because ASAN was disabled for the majority of the fuzzers/targets. So, in certain cases it might not make sense to compare raw numbers.

--
Robert Święcki

Jonathan Metzman

unread,
Apr 8, 2020, 4:24:18 PM4/8/20
to Robert Święcki, fuzzing...@googlegroups.com
Thanks for the post Robert!

I just want to clarify something about seeds in FuzzBench.

>This potentially could work well for e.g. the sqlite3 benchmark, which provided big seed files (>1MB).

FuzzBench actually removes any seed that is larger than 1 MB for all fuzzers. This is because we want to do a fair comparison and AFL
(i.e. which most fuzzers in FuzzBench are based off of) refuses to run when given a seed larger than 1 MB.
But the point about large seeds still applies since a seed that is almost 1 MB is still very large IMO.

--
You received this message because you are subscribed to the Google Groups "fuzzing-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fuzzing-discu...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/fuzzing-discuss/CAP145pg6WVsSbTXGpLg%3DAwuSQAbCphvyB%3DX1C5gBUM8Aw9aeaA%40mail.gmail.com.

Robert Święcki

unread,
Apr 9, 2020, 7:47:46 AM4/9/20
to van Hauser, fuzzing-discuss
 
yes that was my thinking, to put this into the laf-intel llvm_mode pass.
I will implement a prototype in the next days in afl++ (branch:
autodictionary).

I would put that dictionary at compile time into a global array that is
transfered via the forkserver FDs to afl-fuzz.

Lets see how this will work out.


$  rm corpus/*; tar -xvzf ~/Downloads/corpus-archive-0042.tar.gz; rm *.sancov; mkdir out; rm out/*; ASAN_OPTIONS=hard_rss_mb=0 ./cov.png -reduce_inputs=0 -merge=1 -dump_coverage=1 out/ corpus/queue/
...
SanitizerCoverage: ./cov.png.438790.sancov: 629 PCs written
...

Only honggfuzz was able to go over 600 edges in recent runs: https://www.fuzzbench.com/reports/2020-04-01/libpng-1.2.56_coverage_growth.svg. I guess it's working. The corpus, based on timestamps, is ~10h into the fuzzing.


$ rm corpus/*; tar -xvzf ~/Downloads/corpus-archive-0042.tar.gz; rm *.sancov; mkdir out; rm out/*; ASAN_OPTIONS=hard_rss_mb=0 ./cov.png -reduce_inputs=0 -merge=1 -dump_coverage=1 out/ corpus/queue/
...
SanitizerCoverage: ./cov.png.439176.sancov: 517 PCs written
...

--
Robert Święcki
Reply all
Reply to author
Forward
0 new messages