Improving honggfuzz for fuzzbench, part 2 (mid-March, end-of-March 2020)

295 views
Skip to first unread message

Robert Święcki

unread,
Apr 8, 2020, 1:34:45 PM4/8/20
to fuzzing-discuss

For part 1 click here.



Adding 4- and 8-byte integers to the dynamic dictionary:


In the previous post I mentioned the benefits of a dynamic dictionary for interesting str/mem*cmp() blocks of bytes (tokens). Those tokens which already exist in the binary file are, over time, added to such dict and used in the mutation stage. But this idea doesn’t exhaust all possibilities here:


Earlier in March honggfuzz was already using a method of minimizing the hamming distance between two integers when comparing them (by saving inputs which fared better on a per an individual comparison basis, where “individual” is defined as an unique __builtin_return_address(0)). This method is also used in libfuzzer and in some afl forks.


As with str/mem*cmp tokens, honggfuzz started adding interesting 4-byte and 8-byte integers to the dynamic dictionary. However, only those values which are marked as const by the compiler could be used/added directly. And unfortunately, not all of them are. In some cases (as it happens with e.g. the the lcms benchmark) it’s because those const values are in many cases not really marked as constant, but located inside the data section of a binary (either by a programmer’s mistake or for other reasons). Therefore we cannot always trust the instrumentation to use __sanitizer_cov_trace_const_cmp1|2|4|8() with them.


Currently honggfuzz tries to verify which comparison values are interesting by searching through the process loaded sections (to be exact: those integer values which exist in the program's image at the program load time) looking for those values with a linear memory scan. Again this happens with the help of dl_iterate_phdr(). If a given value exists in the program (binary image), it’s added to the dynamic dictionary. The current implementation of this scan can be seen here. Analogically to the the str/mem*cmp() instrumentation, such operations are CPU-intensive, so they are performed only every n-th (here every 4095-th) invocation of  __sanitizer_cov_trace*_cmp*(). But this seems “good enough” for the purpose.


This allowed honggfuzz to improve results of at least two benchmarks: libpng -- where checks for new PNG tags are implemented as 4-byte asm comparisons, and with lcms -- here many interesting tags are grouped in a non-const arrays of 32bit values, and searched for sequentially.



Switching to the 8bit-inline instrumentation (from PC-guard callbacks):


Previously honggfuzz didn’t use the concept of counting invocations of PC-PC edges. If a given edge was used more than once, that didn’t change anything with regard to the fuzzing logic. Following the libfuzzer’s lead, I implemented support for 8-bit-inline-counters (by using -fsanitize=fuzzer-no-link in place of -fsanitize-coverage=trace-pc-guard), and made it the default instrumentation mode in hfuzz-clang.


This seems to help some benchmarks to perform a bit better. Eg. the zlib one is now able to produce a better corpus (w/o 8-bit-inlines vs with 8-bit-inlines, note: it’s not about raw values here, but about the fact that honggfuzz can produce corpora of a better maximal coverage).


Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.


AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.



EOPart 2.


--
Robert Święcki

Jonathan Metzman

unread,
Apr 8, 2020, 4:46:47 PM4/8/20
to Robert Święcki, fuzzing-discuss, Kostya Serebryany
On Wed, Apr 8, 2020 at 10:34 AM Robert Święcki <rob...@swiecki.net> wrote:


Switching to the 8bit-inline instrumentation (from PC-guard callbacks):


Previously honggfuzz didn’t use the concept of counting invocations of PC-PC edges. If a given edge was used more than once, that didn’t change anything with regard to the fuzzing logic. Following the libfuzzer’s lead, I implemented support for 8-bit-inline-counters (by using -fsanitize=fuzzer-no-link in place of -fsanitize-coverage=trace-pc-guard), and made it the default instrumentation mode in hfuzz-clang.


This seems to help some benchmarks to perform a bit better. Eg. the zlib one is now able to produce a better corpus (w/o 8-bit-inlines vs with 8-bit-inlines, note: it’s not about raw values here, but about the fact that honggfuzz can produce corpora of a better maximal coverage).


Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.


AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.




Very interesting! 
We've been aware of this problem in libFuzzer for a while now.
We've seen libFuzzer for certain large targets spend a disproportionate amount of time iterating over the coverage array.
I guess the fact that AFL's coverage map has a fixed size is what makes it better here.
Nice to have some data on which method is better.

+Kostya Serebryany who looked into this problem (I think Kostya tried using page faults to determine which parts of the coverage array were written to on each execution).

Robert Święcki

unread,
Apr 8, 2020, 5:14:42 PM4/8/20
to Jonathan Metzman, fuzzing-discuss, Kostya Serebryany
Hi,
 

Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.


AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.




Very interesting! 
We've been aware of this problem in libFuzzer for a while now.
We've seen libFuzzer for certain large targets spend a disproportionate amount of time iterating over the coverage array.
I guess the fact that AFL's coverage map has a fixed size is what makes it better here.
Nice to have some data on which method is better.

+Kostya Serebryany who looked into this problem (I think Kostya tried using page faults to determine which parts of the coverage array were written to on each execution).

At the risk of saying something you might had already analyzed, I've spent some time thinking about optimizations and the following came to mind:

1. Ask a x86 asm magician to implement memchr() (but with a semantics of looking for non-NUL bytes, and not for specific byte-values) using SSE4 or even AVX(2). The inline bitmap (bytemap actually) is pretty much empty, so speeding up the process of looking for a first non-NULL byte can help here. For example, with curl's 80k edges (bytes), only a few hundred/thousand are set on average. Glibc already implements memchr() and friends with such optimizations.

2. I wonder if we could ask OS to zero the coverage bitmap memory for us by using some hack: like doing munmap/mmap over it or maybe with some other clever hack. I guess on the OS/HW level it's possible to zero memory pages very quickly? (but maybe not, not sure). Though, this probably will not help much, because clearing bytes when iterating over bitmap is faster than with memset() after full bitmap scan, due to memory caching effects.

--
Robert Święcki

Andrea Fioraldi

unread,
Apr 9, 2020, 3:59:22 AM4/9/20
to fuzzing-discuss
>

AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.


Yes, that's why we don't use trace-pc-guard but a custom LLVM pass with inlined instrumentation for anything. SanCov is still supported as optional compile flag, but we discourage it's use cause it is decremental in perf and many of the new AFL++ features (like code whitelist) in LLVM mode are not compatible.


>
I think Kostya tried using page faults to determine which parts of the coverage array were written to on each execution


I was doing a similar stuff for QEMU mode in order to remove collision and support a variable sized map. My idea is to assign IDs based on distance from the entry point. This is easy in QEMU cause I assign an ID to an edge at runtime when two blocks are linked for the first time (ie. when a new edge is discovered) and so the progress of the fuzzer itself define the distance between the edge and the entry point. For LLVM a proxy can be the number of hops in the CFG I guess.


Returning to the variable sized map, each time that an edge is executed I increase the corresponding map entry and bit-OR the IDs in a shared integer size. At the end of the execution, iterate the map only up to size (or next power of 2 of size).

In this way, all the invalid inputs generated by the fuzzer will have a low size and so a low overhead when iterating the coverage map.

Kostya Serebryany

unread,
Apr 9, 2020, 3:33:35 PM4/9/20
to Robert Święcki, Jonathan Metzman, fuzzing-discuss
On Wed, Apr 8, 2020 at 2:14 PM Robert Święcki <rob...@swiecki.net> wrote:
Hi,
 

Unfortunately this change is also a double-edge sword: while 8bit-inlines could be faster for smaller targets which also execute slower per exec, when the number of edges per target increases to ~10-20k and per-iteration-exec-time is small, the number of execs/sec drops significantly below what’s offered by the PC-guard method. It’s because a fairly large memory bitmap must be analyzed and cleared after every fuzzing iteration, and this eats lots of CPU cycles. This became a problem with e.g. the curl benchmark which instruments pretty much every library used there (openssl, zlib) totalling ~80k edges to analyze. With 8bit-inline-counters honggfuzz is able to perform only around 20k iterations per seconds vs >60k iterations when PC-guards are used.


AFL+descendants seem to be using a concept of counters by utilizing the PC-guards method (it's possible to implement it in honggfuzz as well), and I think that’s what I’ll use in the future there. Unfortunately the 70% drop in execs/sec performance in some cases seems too costly to depend on 8bit-inline-counters in the future, even if they’re simpler to use.




Very interesting! 
We've been aware of this problem in libFuzzer for a while now.
We've seen libFuzzer for certain large targets spend a disproportionate amount of time iterating over the coverage array.
I guess the fact that AFL's coverage map has a fixed size is what makes it better here.
Nice to have some data on which method is better.

+Kostya Serebryany who looked into this problem (I think Kostya tried using page faults to determine which parts of the coverage array were written to on each execution).

Yea, I am not particularly happy with the current instrumentation either. 
inline-8bit-counters are very cheap during execution, but for large binaries and short executions the cost of post-processing is indeed too high. 
Any callback-based instrumentation suffers from the extra cost of the callbacks. 
More complicated custom inline instrumentation is harder to maintain. 
Perhaps, at some point, we need to experiment with callback-based instrumentation combined with LTO-based late inlining. 

Hower, let me explain my current choice: inline-8bit-counters are 
pretty good for small targets, and for relatively long-running targets. 
It's miserable for huge targets that exit every quickly. 
But we should discourage the users from using such targets (I know, it's a poor excuse). 

Note also that libFuzzer prints the currently archived coverage (as symbolized PCs) as a form of user-friendly logging, 
and if we just use a compressed map (w/o  precise counter => PC mapping) we'll lose this ability. 


At the risk of saying something you might had already analyzed, I've spent some time thinking about optimizations and the following came to mind:

1. Ask a x86 asm magician to implement memchr() (but with a semantics of looking for non-NUL bytes, and not for specific byte-values) using SSE4 or even AVX(2). The inline bitmap (bytemap actually) is pretty much empty, so speeding up the process of looking for a first non-NULL byte can help here. For example, with curl's 80k edges (bytes), only a few hundred/thousand are set on average. Glibc already implements memchr() and friends with such optimizations.

I've tried a couple of times and the results weren't much better than the current code that reads 64 bits at a time. 
Cache is the bottleneck, not the instructions retired. 
 

2. I wonder if we could ask OS to zero the coverage bitmap memory for us by using some hack: like doing munmap/mmap over it or maybe with some other clever hack. I guess on the OS/HW level it's possible to zero memory pages very quickly? (but maybe not, not sure). Though, this probably will not help much, because clearing bytes when iterating over bitmap is faster than with memset() after full bitmap scan, due to memory caching effects.

Same here: I did a few experiments that didn't show improvements. 
It is quite possible that I am missing something. 

Jonathan mentions my experiment with page faults, etc. 
It did speed up several HUGE benchmarks by ~10%, but I still think it's not worth doing. 
I'll probably delete this code next time I venture into libFuzzer. 
(Also: it's probably obvious anyway: I have not been working on libFuzzer for quite a while now; I still hope to get back sometime soon) 
 
--kcc 

--
Robert Święcki

Robert Święcki

unread,
Apr 11, 2020, 10:28:59 AM4/11/20
to Kostya Serebryany, Jonathan Metzman, fuzzing-discuss
Hi,

In the end I changed the default instrumentation mode from 8bit-counters back to the pc-guards in honggfuzz. With the introduction of an auxiliary guard map (it's needed in a multi-threaded setup), it's possible to use counters now. And the effects are pretty much identical to 8bit-counters, but it's all faster.

As an example, for the curl benchmark (~80k edges): with 8bit-counters, 1 thread, and on an i9-9900KS I was getting on avg 18k iterations/s. With pc-guards and simulated counters it's reaching 43k iterations/s. Without the counters at all, it's ~65k iterations/s.

At the risk of saying something you might had already analyzed, I've spent some time thinking about optimizations and the following came to mind:

1. Ask a x86 asm magician to implement memchr() (but with a semantics of looking for non-NUL bytes, and not for specific byte-values) using SSE4 or even AVX(2). The inline bitmap (bytemap actually) is pretty much empty, so speeding up the process of looking for a first non-NULL byte can help here. For example, with curl's 80k edges (bytes), only a few hundred/thousand are set on average. Glibc already implements memchr() and friends with such optimizations.

I've tried a couple of times and the results weren't much better than the current code that reads 64 bits at a time. 
Cache is the bottleneck, not the instructions retired. 
 

2. I wonder if we could ask OS to zero the coverage bitmap memory for us by using some hack: like doing munmap/mmap over it or maybe with some other clever hack. I guess on the OS/HW level it's possible to zero memory pages very quickly? (but maybe not, not sure). Though, this probably will not help much, because clearing bytes when iterating over bitmap is faster than with memset() after full bitmap scan, due to memory caching effects.

Same here: I did a few experiments that didn't show improvements. 
It is quite possible that I am missing something. 

Yup, my tests confirm that. Two observations in case someone will be ever trying to repeat those experiments:

1). It might be faster to clear memory with fallocate(fd, FALLOC_FL_PUNCH_HOLE...) than with memset(). It requires that such memory block is mapped over some file, but as it can be shm/memfd_create filedescriptor, it should be as fast as regular memory. It retires pages when doing fallocate (or, even { ftruncate(fd, 0); ftruncate(fd, size) }) and maps new ones in place (possibly those are cleared on the HW level, so fast), and this could be faster than filling pages iteratively with zeroes by CPU.
 
2). Using HugeTLB - it seems slower under Linux than regular 4kB pages for some reason when using fallocate() trick to clear memory. Not sure why, technically it should be faster, as for, say a 10MB memory block, it should be replacing just 5 memory pages instead of 2500 for 4kB, but it seems not to be the case here.
 
Jonathan mentions my experiment with page faults, etc. 
It did speed up several HUGE benchmarks by ~10%, but I still think it's not worth doing. 
I'll probably delete this code next time I venture into libFuzzer. 
(Also: it's probably obvious anyway: I have not been working on libFuzzer for quite a while now; I still hope to get back sometime soon) 

This was an interesting idea, and as I couldn't find code in the llvm repo, I've written something on my own. It uses page_faults very indirectly, but to similar effect (by using lseek(SEEK_DATA)). Now it's obsolete, but I'm attaching the code for posterity:

$ ./page-read-write `expr $RANDOM '*' $RANDOM`
Writing byte at offset 248888601
Memory written at page offset 248885248

$ ./page-read-write `expr $RANDOM '*' $RANDOM`
Writing byte at offset 20189351
Memory written at page offset 20189184

--
Robert Święcki
page-read-write.c
Reply all
Reply to author
Forward
0 new messages