For part 1 click here.
Adding 4- and 8-byte integers to the dynamic dictionary:
In the previous post I mentioned the benefits of a dynamic dictionary for interesting str/mem*cmp() blocks of bytes (tokens). Those tokens which already exist in the binary file are, over time, added to such dict and used in the mutation stage. But this idea doesn’t exhaust all possibilities here:
Earlier in March honggfuzz was already using a method of minimizing the hamming distance between two integers when comparing them (by saving inputs which fared better on a per an individual comparison basis, where “individual” is defined as an unique __builtin_return_address(0)). This method is also used in libfuzzer and in some afl forks.
As with str/mem*cmp tokens, honggfuzz started adding interesting 4-byte and 8-byte integers to the dynamic dictionary. However, only those values which are marked as const by the compiler could be used/added directly. And unfortunately, not all of them are. In some cases (as it happens with e.g. the the lcms benchmark) it’s because those const values are in many cases not really marked as constant, but located inside the data section of a binary (either by a programmer’s mistake or for other reasons). Therefore we cannot always trust the instrumentation to use __sanitizer_cov_trace_const_cmp1|2|4|8() with them.
Currently honggfuzz tries to verify which comparison values are interesting by searching through the process loaded sections (to be exact: those integer values which exist in the program's image at the program load time) looking for those values with a linear memory scan. Again this happens with the help of dl_iterate_phdr(). If a given value exists in the program (binary image), it’s added to the dynamic dictionary. The current implementation of this scan can be seen here. Analogically to the the str/mem*cmp() instrumentation, such operations are CPU-intensive, so they are performed only every n-th (here every 4095-th) invocation of __sanitizer_cov_trace*_cmp*(). But this seems “good enough” for the purpose.
This allowed honggfuzz to improve results of at least two benchmarks: libpng -- where checks for new PNG tags are implemented as 4-byte asm comparisons, and with lcms -- here many interesting tags are grouped in a non-const arrays of 32bit values, and searched for sequentially.
Switching to the 8bit-inline instrumentation (from PC-guard callbacks):
Previously honggfuzz didn’t use the concept of counting invocations of PC-PC edges. If a given edge was used more than once, that didn’t change anything with regard to the fuzzing logic. Following the libfuzzer’s lead, I implemented support for 8-bit-inline-counters (by using -fsanitize=fuzzer-no-link in place of -fsanitize-coverage=trace-pc-guard), and made it the default instrumentation mode in hfuzz-clang.
This seems to help some benchmarks to perform a bit better. Eg. the zlib one is now able to produce a better corpus (w/o 8-bit-inlines vs with 8-bit-inlines, note: it’s not about raw values here, but about the fact that honggfuzz can produce corpora of a better maximal coverage).
Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.
AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.
EOPart 2.
Switching to the 8bit-inline instrumentation (from PC-guard callbacks):
Previously honggfuzz didn’t use the concept of counting invocations of PC-PC edges. If a given edge was used more than once, that didn’t change anything with regard to the fuzzing logic. Following the libfuzzer’s lead, I implemented support for 8-bit-inline-counters (by using -fsanitize=fuzzer-no-link in place of -fsanitize-coverage=trace-pc-guard), and made it the default instrumentation mode in hfuzz-clang.
This seems to help some benchmarks to perform a bit better. Eg. the zlib one is now able to produce a better corpus (w/o 8-bit-inlines vs with 8-bit-inlines, note: it’s not about raw values here, but about the fact that honggfuzz can produce corpora of a better maximal coverage).
Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.
AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.
Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.
AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.
Very interesting!We've been aware of this problem in libFuzzer for a while now.We've seen libFuzzer for certain large targets spend a disproportionate amount of time iterating over the coverage array.I guess the fact that AFL's coverage map has a fixed size is what makes it better here.Nice to have some data on which method is better.+Kostya Serebryany who looked into this problem (I think Kostya tried using page faults to determine which parts of the coverage array were written to on each execution).
AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.
Yes, that's why we don't use trace-pc-guard but a custom LLVM pass with inlined instrumentation for anything. SanCov is still supported as optional compile flag, but we discourage it's use cause it is decremental in perf and many of the new AFL++ features (like code whitelist) in LLVM mode are not compatible.
>
I think Kostya tried using page faults to determine which parts of the coverage array were written to on each execution
I was doing a similar stuff for QEMU mode in order to remove collision and support a variable sized map. My idea is to assign IDs based on distance from the entry point. This is easy in QEMU cause I assign an ID to an edge at runtime when two blocks are linked for the first time (ie. when a new edge is discovered) and so the progress of the fuzzer itself define the distance between the edge and the entry point. For LLVM a proxy can be the number of hops in the CFG I guess.
Returning to the variable sized map, each time that an edge is executed I increase the corresponding map entry and bit-OR the IDs in a shared integer size. At the end of the execution, iterate the map only up to size (or next power of 2 of size).
In this way, all the invalid inputs generated by the fuzzer will have a low size and so a low overhead when iterating the coverage map.
Hi,Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.
AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.
Very interesting!We've been aware of this problem in libFuzzer for a while now.We've seen libFuzzer for certain large targets spend a disproportionate amount of time iterating over the coverage array.I guess the fact that AFL's coverage map has a fixed size is what makes it better here.Nice to have some data on which method is better.+Kostya Serebryany who looked into this problem (I think Kostya tried using page faults to determine which parts of the coverage array were written to on each execution).
At the risk of saying something you might had already analyzed, I've spent some time thinking about optimizations and the following came to mind:1. Ask a x86 asm magician to implement memchr() (but with a semantics of looking for non-NUL bytes, and not for specific byte-values) using SSE4 or even AVX(2). The inline bitmap (bytemap actually) is pretty much empty, so speeding up the process of looking for a first non-NULL byte can help here. For example, with curl's 80k edges (bytes), only a few hundred/thousand are set on average. Glibc already implements memchr() and friends with such optimizations.
2. I wonder if we could ask OS to zero the coverage bitmap memory for us by using some hack: like doing munmap/mmap over it or maybe with some other clever hack. I guess on the OS/HW level it's possible to zero memory pages very quickly? (but maybe not, not sure). Though, this probably will not help much, because clearing bytes when iterating over bitmap is faster than with memset() after full bitmap scan, due to memory caching effects.
--Robert Święcki
At the risk of saying something you might had already analyzed, I've spent some time thinking about optimizations and the following came to mind:1. Ask a x86 asm magician to implement memchr() (but with a semantics of looking for non-NUL bytes, and not for specific byte-values) using SSE4 or even AVX(2). The inline bitmap (bytemap actually) is pretty much empty, so speeding up the process of looking for a first non-NULL byte can help here. For example, with curl's 80k edges (bytes), only a few hundred/thousand are set on average. Glibc already implements memchr() and friends with such optimizations.I've tried a couple of times and the results weren't much better than the current code that reads 64 bits at a time.Cache is the bottleneck, not the instructions retired.2. I wonder if we could ask OS to zero the coverage bitmap memory for us by using some hack: like doing munmap/mmap over it or maybe with some other clever hack. I guess on the OS/HW level it's possible to zero memory pages very quickly? (but maybe not, not sure). Though, this probably will not help much, because clearing bytes when iterating over bitmap is faster than with memset() after full bitmap scan, due to memory caching effects.Same here: I did a few experiments that didn't show improvements.It is quite possible that I am missing something.
Jonathan mentions my experiment with page faults, etc.It did speed up several HUGE benchmarks by ~10%, but I still think it's not worth doing.I'll probably delete this code next time I venture into libFuzzer.(Also: it's probably obvious anyway: I have not been working on libFuzzer for quite a while now; I still hope to get back sometime soon)