speeding up instrumentation

680 views
Skip to first unread message

Michal Zalewski

unread,
Feb 15, 2015, 6:42:56 PM2/15/15
to afl-users
The instrumentation injected by AFL isn't free, particularly for
computationally-heavy targets (Spidermonkey, V8, etc). You can always
work around this with AFL_INST_RATIO, but it's a so-so compromise.

So, here's a challenge: try to think of ways to make afl-as.h faster!

My usual benchmarks include gzip with a 10 MB file from /dev/urandom
and djpeg with a 3 MB JPEG file. These are essentially the worst cases
for the instrumentation. You want to benchmark like so:

time ./afl-showmap -o foo ./gzip -c <big_file >/dev/null

For the record, here's what I played with:

1) Having the trampoline code read a pointer from memory and do a
register call to it, hitting the SHM setup code (default), __afl_log
without the "maybe" part (set if SHM setup succeeds), or a no-op
return routine (if SHM setup fails).

This speeds up non-instrumented runs very significantly, but seems to
have essentially no impact on instrumented runs.

2) Mapping SHM to a fixed address, allowing us to insert __afl_log
without the address-fetch part when instrumenting in-function
conditional branches. This gives us a fairly modest performance boost
(~10%), probably not enough to justify the risk of compatibility
problems with programs that map something of their own into the same
region (possibly due to ASLR).

3) Modifying afl-as.c to scan the entire file to see if a particular
label appears in a context other than non-conditional jmp / call. This
is easy, but as it turns out, 90%+ of the labels we instrument are
associated with conditional code / jump tables, so the gains are very
modest (~5%) at the expense of making the code more complicated.

4) "Textbook" instruction reordering; pipelining and out-of-order
execution seems good enough that, say, moving memory operations few
instructions apart makes no difference.

Here are some other ideas I can think of:

1) Can we *not* touch eax / rax? It's by far the most commonly-used
register, and so, saving, modifying, and restoring it is a common
source of pipeline dependencies. Unfortunately, lahf / sahf
unconditionally use %ah. Can we replace them with a comparably-fast
combo of instructions that would preserve key flags?

2) Alternatively, can we implement the arithmetics without touching
the flags? The __afl_area_ptr check can be dealt away with a fixed
pointer, but the xor / shr / incb stuff is obviously tricky.

3) Can we use fewer registers without sacrificing performance on other fronts?

4) Instead of saving registers, can we ask gcc / clang to leave some
registers for the instrumentation and not rely on them, so that we
don't have to push / pop them (-ffixed-reg)?

5) Alternatively, without writing a compiler plugin (unless you're so
inclined!), any simple and robust way for afl-as.c to figure out which
registers the code is actually using right now?

There's a lot of comments in afl-as.h and afl-as.c to help navigate the code.

Parker Thompson

unread,
Feb 15, 2015, 6:52:13 PM2/15/15
to afl-...@googlegroups.com

Michal, and all.

This is not related to instrumentation speed ups but single testcase runtimes:

I was kinda curious how much of the run time (for certain applications) writing out the testcase file to disk is, and if it would be possible to hijack open/read calls (LD_PRELOAD) to be reading from a second SHM instead of disk. That way all of the I/O never needs to touch the disk.

It would be complex to track file descriptors but it is possible by keeping a table somewhere in afl-fuzz.

Thoughts?

-Parker

--
You received this message because you are subscribed to the Google Groups "afl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to afl-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michal Zalewski

unread,
Feb 15, 2015, 7:04:46 PM2/15/15
to afl-users
> I was kinda curious how much of the run time (for certain applications)
> writing out the testcase file to disk is, and if it would be possible to
> hijack open/read calls (LD_PRELOAD) to be reading from a second SHM instead
> of disk. That way all of the I/O never needs to touch the disk.

You can always do -f /dev/shm/foo.txt on many Linux distros, or set up
a "proper" ramdisk. I could never see a huge difference on physical
Linux boxes (presumably due to heavy caching), but some folks reported
more profound gains on MacOS X and the likes.

Doing it over real SHM would be probably a bit faster by the virtue of
doing fewer context switches, but there's some serious complexity in
getting it right (you essentially need to mock out almost all file
I/O, including auxiliary stuff like *stat(), fcntl(), flock(), mmap(),
lseek(), etc).

/mz

Jann Horn

unread,
Feb 15, 2015, 7:57:25 PM2/15/15
to afl-...@googlegroups.com
On Sun, Feb 15, 2015 at 03:42:35PM -0800, Michal Zalewski wrote:
> 1) Can we *not* touch eax / rax? It's by far the most commonly-used
> register, and so, saving, modifying, and restoring it is a common
> source of pipeline dependencies. Unfortunately, lahf / sahf
> unconditionally use %ah. Can we replace them with a comparably-fast
> combo of instructions that would preserve key flags?
>
> 2) Alternatively, can we implement the arithmetics without touching
> the flags? The __afl_area_ptr check can be dealt away with a fixed
> pointer, but the xor / shr / incb stuff is obviously tricky.
>
> 3) Can we use fewer registers without sacrificing performance on other fronts?

What afl needs isn't really source^dest, it's the transitions, right? In
other words, what if you instrument conditional jumps instead of basic
blocks and give each one two offsets in the bitmap, one for "jump taken"
and another for "jump not taken"? If you combine that with mapping the
bitmap to a fixed memory position, as far as I can tell, the
instrumentation could use INC with a memory operand, the only remaining
trouble would be saving the flags.

I'm imagining replacing "jne original_target" with something like this:

...
jne jump_taken_helper
; inc counter for "jump not taken" here
...

; outside of any methods:
jump_taken_helper:
; inc counter for "jump taken" here
jmp original_target


Avoiding the flags might be possible using the LEA instruction - load
the counter into a register, then treat the counter as a pointer and
use LEA to compute an address relative to it (offset 1), then write the
resulting "address" back into the counter. That would still require one
register to be saved though, I think. Something like this (in intel
syntax)?

; no need to decrement rax if we just save one value to another memory
; location
mov [saved_rax], rax
mov al, [counter] ; load the counter
; add one. the upper bits of rax are garbage here
lea rax, [rax + 1]
mov [counter], al
mov rax, [saved_rax]

I haven't tested that yet though, and I could imagine that the
processor doesn't appreciate the trickery and slows down.

(I've thought about this stuff a bit because I also want to take a shot
at implementing binary instrumentation - but that's going to take some
time, I have a lot to do at the moment.)
signature.asc

Michal Zalewski

unread,
Feb 15, 2015, 8:19:39 PM2/15/15
to afl-users
> What afl needs isn't really source^dest, it's the transitions, right?

Kinda, but it's tricky. For example, the compiler will often output
jump tables for larger switch or multi-part if statements. That can be
dealt with if you detect and tweak the jump tables.

It gets harder when dealing with function pointers that are passed
around in data structured (not uncommon in C, extremely common in
C++); these have "many-to-one" control flow patterns. You could
conceivably fix this by using old-style style of instrumentation for
function prologues (which are executed fairly infrequently), and the
jump-helper trick for conditional jumps within the body of the
function.

Worth playing with...

> (I've thought about this stuff a bit because I also want to take a shot
> at implementing binary instrumentation - but that's going to take some
> time, I have a lot to do at the moment.)

Why?:-) What's the benefit over QEMU?

/mz

Jann Horn

unread,
Feb 15, 2015, 10:41:03 PM2/15/15
to afl-...@googlegroups.com
On Sun, Feb 15, 2015 at 05:19:17PM -0800, Michal Zalewski wrote:
> > What afl needs isn't really source^dest, it's the transitions, right?
>
> Kinda, but it's tricky. For example, the compiler will often output
> jump tables for larger switch or multi-part if statements. That can be
> dealt with if you detect and tweak the jump tables.
>
> It gets harder when dealing with function pointers that are passed
> around in data structured (not uncommon in C, extremely common in
> C++); these have "many-to-one" control flow patterns. You could
> conceivably fix this by using old-style style of instrumentation for
> function prologues (which are executed fairly infrequently), and the
> jump-helper trick for conditional jumps within the body of the
> function.
>
> Worth playing with...

I was thinking of replacing code like "jmp rax" with something
similar to the current style, where the jump target (the value of rax
in this case) is XORed into a random value specific to the
instrumentation point and the truncated result is taken as index into
the array, and then performing the actual jump or call with the same
register afterwards.


> > (I've thought about this stuff a bit because I also want to take a shot
> > at implementing binary instrumentation - but that's going to take some
> > time, I have a lot to do at the moment.)
>
> Why?:-) What's the benefit over QEMU?

No big benefit and a lot of disadvantages, I guess - but it'd be fun to
see whether it's possible to get the same speed as with source
instrumentation. Unless I've missed something, QEMU is still slower
than source instrumentation by a small linear factor, right?

My idea is to create a modified copy of the code of the target in a
new memory location so that every conditional or non-fixed jump/call
is instrumented. When a new branch is taken on a calculated jump,
that would cause the jump target to be rewritten and the control flow
to be redirected into the rewritten copy.
You could probably even make it so that code from certain libraries
doesn't get instrumented by making the original of the instrumented
code unexecutable and using a SIGSEGV handler to catch callbacks by
second-order library functions (although the performance of that
would probably suck).

As far as I know, the only rewriting necessary apart from the
instrumentation and fixing up call/jump targets would be RIP-relative
addressing. In 32-bit mode, the way it is done is pretty ugly and
probably hard to rewrite automatically, but with the RIP-relative
addressing mode for instructions with a mod r/m byte in 64-bit code
it should be possible to easily fix up offsets.

I think it should be possible to get close to the speed of source
instrumentation with this. But yeah, I'm not sure whether it's worth
the effort. :D
signature.asc

Michal Zalewski

unread,
Feb 15, 2015, 11:08:59 PM2/15/15
to afl-users
>> > (I've thought about this stuff a bit because I also want to take a shot
>> > at implementing binary instrumentation - but that's going to take some
>> > time, I have a lot to do at the moment.)
>>
>> Why?:-) What's the benefit over QEMU?
>
> No big benefit and a lot of disadvantages, I guess - but it'd be fun to
> see whether it's possible to get the same speed as with source
> instrumentation. Unless I've missed something, QEMU is still slower
> than source instrumentation by a small linear factor, right?

It's quite a bit slower, yup. "Making it faster" sounds like an OK goal :-)

/mz

Sami Liedes

unread,
Feb 16, 2015, 4:58:23 AM2/16/15
to afl-users
On Sun, Feb 15, 2015 at 03:42:35PM -0800, Michal Zalewski wrote:
> The instrumentation injected by AFL isn't free, particularly for
> computationally-heavy targets (Spidermonkey, V8, etc). You can always
> work around this with AFL_INST_RATIO, but it's a so-so compromise.
>
> So, here's a challenge: try to think of ways to make afl-as.h faster!

I should really try out this idea before thinking out loud, but I
don't have time for that now. (Sorry.) Also, my assembly-fu is too
rusty to know if this should be fast :-) Perhaps some useful ideas can
still be distilled from this.

> 2) Alternatively, can we implement the arithmetics without touching
> the flags? The __afl_area_ptr check can be dealt away with a fixed
> pointer, but the xor / shr / incb stuff is obviously tricky.

How about this: In setup, allocate a buffer of some size; easiest
would be 64k, but bigger sizes might make sense. At each
instrumentation point, only append the code location into this buffer.
Only if the buffer becomes full (I think it should be possible to use
jcxz here so you don't need to rely on flags if your buffer is
properly aligned), call a function that processes the buffer into the
bitmap. Do all the xor/shr/incb stuff in this function. This way, you
should mostly have predictable branches, and if I'm not mistaken - I
might well be! - it should be possible to avoid touching flags in the
fast path.

Sami
signature.asc

Michal Zalewski

unread,
Feb 16, 2015, 10:22:49 AM2/16/15
to afl-users
> How about this: In setup, allocate a buffer of some size; easiest
> would be 64k, but bigger sizes might make sense. At each
> instrumentation point, only append the code location into this buffer.
> Only if the buffer becomes full (I think it should be possible to use
> jcxz here so you don't need to rely on flags if your buffer is
> properly aligned), call a function that processes the buffer into the
> bitmap. Do all the xor/shr/incb stuff in this function.

Interesting. And you could use lea to increment buffer offset without
touching flags. Easy enough to do a mock version and try it out!

/mz

Michal Zalewski

unread,
Feb 16, 2015, 10:47:36 AM2/16/15
to afl-users
> Interesting. And you could use lea to increment buffer offset without
> touching flags. Easy enough to do a mock version and try it out!

Oddly enough, doesn't seem to be a huge gain:

$ time ~/devel/afl/afl-showmap -o foo ./gzip -c -9 <bigfile >/dev/null
user 0m2.513s

$ time ~/devel/afl/afl-showmap -o foo ./gzip_old_method -c -9 <bigfile
>/dev/null
user 0m2.602s

I suspect this is due to doing more memory access, which likely undoes
the gains.

Here's the mock version that doesn't touch flags or eax (eax also
commented out in trampoline_fmt, edx used to pass around ID):

"__afl_maybe_log:\n"
" movl __afl_area_ptr, %ecx\n"
" jecxz __afl_setup\n" // <- this is also modified to return in ecx

"__afl_store:\n"
" movl %ecx, %edi\n" // edi = area_ptr; edx = cur_id
" movl (%edi), %ecx\n" // ecx = area offset
" leal 4(%ecx), %ecx\n" // offset += 4
" jcxz __afl_rewind\n" // if overflow...
"__afl_rewound:\n"
" movl %ecx, (%edi)\n" // update offset in memory
" movw %dx, (%edi, %ecx, 1)\n" // store cur_loc; not exactly right
but good enough for testing perf

"__afl_return:\n"
" ret\n"

"__afl_rewind:\n"
" leal -65532(%ecx), %ecx\n" // go back, staying clear of the
offset ptr (4 first bytes)
" jmp __afl_rewound\n"

Sami Liedes

unread,
Feb 16, 2015, 11:26:16 AM2/16/15
to afl-users
On Mon, Feb 16, 2015 at 07:47:15AM -0800, Michal Zalewski wrote:
> > Interesting. And you could use lea to increment buffer offset without
> > touching flags. Easy enough to do a mock version and try it out!
>
> Oddly enough, doesn't seem to be a huge gain:

Have you inspected (with something like perf or even vtune, the latter
probably being the more intuitive & informative of the two...) where
the perf is going? Mispredicted branches, cache misses, something
else?

Sami
signature.asc

Michal Zalewski

unread,
Feb 16, 2015, 12:21:05 PM2/16/15
to afl-users
> Have you inspected (with something like perf or even vtune, the latter
> probably being the more intuitive & informative of the two...) where
> the perf is going? Mispredicted branches, cache misses, something
> else?

I haven't, I don't have $$$ for vtune. But it's just a couple lines of
assembly... we almost certainly spend most of the time waiting for
cache. We have some stack ops which probably go through L1, and the
bitmap likely fits into L2.

My second best guess would be the pipeline cost of register
dependencies. The branches we take are ~100% predictable, so that part
shouldn't matter much.

/mz

svfu...@gmail.com

unread,
Feb 16, 2015, 2:00:16 PM2/16/15
to afl-...@googlegroups.com

I could gain a speedup of a few cycles with a little re-ordering:

diff -du afl-as_orig.h afl-as.h
--- afl-as_orig.h    2015-02-16 09:39:32.802106544 -0800
+++ afl-as.h    2015-02-16 10:42:25.074170941 -0800
@@ -125,18 +125,10 @@
   "\n"
   "/* --- AFL TRAMPOLINE (64-BIT) --- */\n"
   "\n"
-  ".align 4\n"
-  "\n"
-  "leaq -(128+24)(%%rsp), %%rsp\n"
-  "movq %%rdx,  0(%%rsp)\n"
-  "movq %%rcx,  8(%%rsp)\n"
-  "movq %%rax, 16(%%rsp)\n"
+  "leaq -(128)(%%rsp), %%rsp\n"
+  "push %%rcx\n"
   "movq $0x%08x, %%rcx\n"
   "call __afl_maybe_log\n"
-  "movq 16(%%rsp), %%rax\n"
-  "movq  8(%%rsp), %%rcx\n"
-  "movq  0(%%rsp), %%rdx\n"
-  "leaq (128+24)(%%rsp), %%rsp\n"
   "\n"
   "/* --- END --- */\n"
   "\n";
@@ -380,12 +372,14 @@
   ".align 8\n"
   "\n"
   "__afl_maybe_log:\n"
+  "  push %rax\n"
   "\n"
 #ifdef  __OpenBSD__
   "  .byte 0x9f /* lahf */\n"
 #else
   "  lahf\n"
 #endif /* ^__OpenBSD__ */
+  "  push %rdx\n"
   "  seto  %al\n"
   "\n"
   "  /* Check if SHM region is already mapped. */\n"
@@ -412,13 +406,16 @@
   "\n"
   "__afl_return:\n"
   "\n"
+  "  movq 24(%rsp), %rcx\n"
   "  addb $127, %al\n"
+  "  pop %rdx\n"
 #ifdef  __OpenBSD__
   "  .byte 0x9e /* sahf */\n"
 #else
   "  sahf\n"
 #endif /* ^__OpenBSD__ */
-  "  ret\n"
+  "  pop %rax\n"
+  "  ret $128+8\n"
   "\n"
   ".align 8\n"
   "\n"

(I didn't bother looking at the 32-bit case... but I guess similar tricks might work there.)

The gain is larger on bigger programs because the smaller trampoline means a smaller I-cache hit.  The use of push/pop instead of movs to and from the stack seem to be better in this case.  The "slow" ret $n instruction is offset by the removal of the very large lea instruction in the trampoline.

What would really speed things up is a way to delay the start of the fork-server until needed.  I'm using a hack where I toggle a weak global variable inspected by the afl setup routine.  It'd be nice if there were a more official way to do this.

Other possible improvements:
1) Use the return-address pointer on the stack instead of the explicit random number in the hash.
2) Use some other hash instead of xor(old/2, new).  How about keeping a hash like new_hash = hash(old_hash / 256, new_value).  This would keep some more entropy describing the "path" taken to get to this point, rather than only the previous single label.  This would provide more information to the fuzzer about what is going on with control flow, perhaps.
3) How about instrumenting cmp instructions?
  Replace 64 bit, 32 bit and 16-bit cmps with a series of 8-bit cmp instructions, followed by the "real" n-bit cmp.  This would provide a way for the fuzzer to solve the difficult case
   cmp %rax, $password
   je somewhere
by noticing correct fragments a byte at a time.  Otherwise, it has to get all 64 bits right before anything happens.
You might want to do similar tricks for instructions used by the compiler for bit-tricks.  sbb and adc, for example, where you'd add an extra instrumented jc or jnc beforehand.

Michal Zalewski

unread,
Feb 16, 2015, 2:51:30 PM2/16/15
to afl-users
> The gain is larger on bigger programs because the smaller trampoline means a
> smaller I-cache hit. The use of push/pop instead of movs to and from the
> stack seem to be better in this case. The "slow" ret $n instruction is
> offset by the removal of the very large lea instruction in the trampoline.

That's interesting! I tried the same optimization on 32-bit before,
and it was markedly slower; in fact, older versions were doing push /
pop before switching over to lea. So perhaps we just need to approach
64-bit and 32-bit differently, but I'm surprised by the discrepancy.

> What would really speed things up is a way to delay the start of the fork-server until needed. I'm using a hack where I toggle a weak global variable inspected by the afl setup routine. It'd be nice if there were a more official way to do this.

Sure. I couldn't think of anything robust / simple enough without
refactoring the forkserver code or requiring other, more substantial
changes to how binaries are linked - but ideas welcome :-)

(Ideally, it would be cool to be able to specify something like
AFL_SKIP=nnnn to skip nnnn initial instrumentation points; but some
compile-time "start here" shim may also work, with less flexibility.)

> 1) Use the return-address pointer on the stack instead of the explicit
> random number in the hash.

This is probably a no-go unless we explicitly disable ASLR (which
would be platform-dependent and messy). Otherwise, if addresses differ
across runs, things will go south.

We could rely just on the less-significant bits of the address, but
these are often aligned (especially around jumps, so 3-4 bits of that
will show heavy bias).

> 2) Use some other hash instead of xor(old/2, new). How about keeping a hash
> like new_hash = hash(old_hash / 256, new_value). This would keep some more
> entropy describing the "path" taken to get to this point, rather than only
> the previous single label.

I played with that, essentially doing >> 8 (I think it came up in one
of the earlier threads). It didn't seem to have a particularly
meaningful impact on the ultimate edge coverage with my benchmark
targets, while producing a denser bitmap. Anyway, it's a very simple
tweak if you want to play with it :-)

> 3) How about instrumenting cmp instructions?
> Replace 64 bit, 32 bit and 16-bit cmps with a series of 8-bit cmp
> instructions, followed by the "real" n-bit cmp. This would provide a way
> for the fuzzer to solve the difficult case
> cmp %rax, $password
> je somewhere
> by noticing correct fragments a byte at a time. Otherwise, it has to get
> all 64 bits right before anything happens.

There's something equivalent in experimental/, replacing memcmp with
an unrolled loop. It works, sorta. I also did a similar trick
expanding rep / loop statements on assembly level (not unrolling them,
just turning them into an instrumented loop).

But part of the problem is that counterintuitively, the !strcmp(foo,
"secret!") is actually a pretty rare pattern in programs that parse
human-readable languages. Many optimized parsers instead use something
resembling char-by-char binary search, because it's more efficient
than a long series of disjoint strcmp calls. There is a hand-written
parser designed like that in libxml2; many auto-generated parsers look
roughly the same.

A more common pattern are magic values in binary formats, which is a
challenge if the value is 32-bit long. That certainly gets in the way
of afl-fuzz inventing PNG files, because the likelihood of randomly
generating section headers are low. But this is hard to solve. I guess
you could generate a dictionary by grepping for immediate values in
testl / cmpl, which will get you the most obvious cases.

Anyway, the unroll trick can help every now and then, but I've
actually come to believe that it's less useful than it seems.
Dictionaries pieced together by grepping the binary or so are probably
the best we can do for now.

/mz

Sami Liedes

unread,
Feb 16, 2015, 5:11:47 PM2/16/15
to afl-users
Ok, here's some perf measurements on bzip2, non-instrumented,
instrumented and instrumented & running under afl-showmap using
unmodified afl-fuzz 1.46b. Non-instrumented execution time is roughly
4.5 seconds. From three runs of perf stat:

Non-instr Instr showmap
cycles 1.00 3.95 4.38
ref-cycles 1.00 3.97 4.42
cpu-clock 1.00 3.96 4.43
bus-cycles 1.00 3.97 4.42
instructions 1.00 4.24 4.56
cache-references 1.00 1.03 1.07
cache-misses 1.00 1.50 2.07
branches 1.00 5.12 4.10
branch-misses 1.00 0.98 1.04
idle-cycles-frontend 1.00 2.55 3.73
idle-cycles-backend 1.00 1.74 2.41
L1-dcache-loads 1.00 5.31 7.41
L1-dcache-load-misses 1.00 1.15 1.18
L1-dcache-stores 1.00 4.74 7.54
L1-dcache-store-misses 1.00 1.17 1.14
L1-dcache-prefetch-misses 1.00 1.27 1.29
L1-icache-load-misses 1.00 3.77 5.41
LLC-loads 1.00 1.06 1.19
LLC-stores 1.00 1.13 1.07
LLC-prefetches 1.00 1.09 1.28
dTLB-loads 1.00 5.46 7.69
dTLB-load-misses 1.00 0.97 1.10
dTLB-stores 1.00 4.85 7.73
dTLB-store-misses 1.00 0.74 0.82
iTLB-loads 1.00 1.93 4.07
iTLB-load-misses 1.00 1.63 2.23
branch-loads 1.00 5.17 4.15
branch-load-misses 1.00 0.99 1.05

And some calculations:

idle-cycles-frontend/cycles 41.5% 28.6% 35.4%
idle-cycles-backend/cycles 28.6% 12.6% 15.7%

----

The instrumented binary causes (not surprisingly) quite a bit more L1
instruction cache misses, but I don't know how much that matters. The
bigger memory bandwidth usage also shows. However, if I interpret the
numbers correctly, ultimately the instrumented binaries execute more
instructions per cycle than the non-instrumented ones, although of
course most of those instructions are instrumentation. Turns out those
few instructions add up; you are executing 4x as many instructions as
the non-instrumented binary.

I don't know how much the data dependencies matter in the big picture
(they might); some analysis anyway: Under showmap, 25% of the the
front-end stalls happen in __afl_store (mainly the incb (%rdx,%rcx,1),
10% in __afl_return (add 127,%al) and 5% in __afl_maybe_log (seto %al,
then lahf). In one tight loop, the mov 0x10(%rsp),%rax after callq
__afl_maybe_log causes a frontend stall that amounts to roughly 4% of
global frontend stalls.

Sami
signature.asc

ronno...@gmail.com

unread,
Feb 16, 2015, 5:15:10 PM2/16/15
to afl-...@googlegroups.com


> What would really speed things up is a way to delay the start of the fork-server until needed.  I'm using a hack where I toggle a weak global variable inspected by the afl setup routine.  It'd be nice if there were a more official way to do this.

Sure. I couldn't think of anything robust / simple enough without
refactoring the forkserver code or requiring other, more substantial
changes to how binaries are linked - but ideas welcome :-)

(Ideally, it would be cool to be able to specify something like
AFL_SKIP=nnnn to skip nnnn initial instrumentation points; but some
compile-time "start here" shim may also work, with less flexibility.)

Yeah, I'm using a hacked-up "start here" method.  If the weak global isn't linked in, then the afl code can notice that, and ignore it.  If the weak variable is linked in, then its state can control when afl initializes the fork server.  It's a little tricky because you need to make sure afl gets called by doing a traced branch when you want to initialize.  I think an explicit function call that sets the variable and turns everything on might be the way to go.  (Otherwise you can enter read() before afl has forked, and then all the copies use the same input.)

> 1) Use the return-address pointer on the stack instead of the explicit
> random number in the hash.

This is probably a no-go unless we explicitly disable ASLR (which
would be platform-dependent and messy). Otherwise, if addresses differ
across runs, things will go south.

Actually, having differing runs have differing collisions in the hash table might be a good thing.  It certainly reduces the affect of false collisions in a parallel fuzz-test.   The problem here is that it seems to be slightly slower than the explicit random number method because converting the 64-bit size address into a n-bit hash isn't done at compile time.

We could rely just on the less-significant bits of the address, but
these are often aligned (especially around jumps, so 3-4 bits of that
will show heavy bias).

Yes, this method probably wants to use all the address bits in the hash.  (With the corresponding slow-down of doing so.)
 
> 2) Use some other hash instead of xor(old/2, new).  How about keeping a hash
> like new_hash = hash(old_hash / 256, new_value).  This would keep some more
> entropy describing the "path" taken to get to this point, rather than only
> the previous single label.

I played with that, essentially doing >> 8 (I think it came up in one
of the earlier threads). It didn't seem to have a particularly
meaningful impact on the ultimate edge coverage with my benchmark
targets, while producing a denser bitmap. Anyway, it's a very simple
tweak if you want to play with it :-)

It's a little more subtle than that.  Instead of storing the previous location, you store a hash of the previous set of locations.  This hash is updated with the new location in such a way that statistically older locations make less and less difference to the result.  The easiest way of doing that is to only use ~1byte of stored hash, making sure to mix that entropy well in creating the new hash.

i.e.

table_entry = old_hash ^ new_location
trace_table[table_entry]++   (or trace_table[table_entry] |= 1)
new_hash = (table_entry * large_odd_integer) >> 56

might be one possibility.  (You can save on the shift by using a byte-sized load from memory to retrieve old_hash).  What this does is make more than two locations determine the trace entries, however not so many that the table gets full of garbage.  This probably alters where the "interesting" cut-off values are for the trace counts since low values and collisions are more likely.
The code I'm fuzzing actually is using the wide compares for fast parsing of text.  (No memcmp/strcmp, or byte-by-byte checks.)   I know, because I wrote it. :-P  Also, the comparisons aren't with constant values, because there is quite a bit of hashing and masking going on, and everything interesting is stored within a pre-computed table.

Anyway, in more detail, the idea is to replace
"cmp foo, bar"
"je / jle" etc.
with

Adjust the stack out of the red zone
save rax, rdx on the stack
mov foo, rax
mov bar, rdx
save rax and rdx again.
shr 56, rax
shr 56  rdx
cmp al, dl
jne done
<afl label + call to update trace here>
get rax, rdx again from the stack
shr 48, rax
shr 48, rdx
cmp al,dl
jne done
<afl label etc.>
...

shr 24,rax
shr 24,rdx
cmp al, dl
jne done
<afl label etc.>
get rax,rdx
cmp ah,dh
jne done
<afl label etc.>
done:
get rax, rdx
cmp rax, rdx
restore rax and rdx to original values
restore stack.

je/jl etc. from original code.

This might be a little large... so a function call that somehow does all this at once is probably better.  Notice how we can check lazily, high order bytes on down.  We also don't have to check the lowest byte since that will implicitly be tested in the conditional jump from the original code.  Also note how each test is a test for equality/inequality.  This works even when the original jump is something different (say for greater-than or less-than) because boundary values tend to be the interesting ones.

This trick will probably let the fuzzer find 32-bit constants fairly easily for PNG fuzzing.

Michal Zalewski

unread,
Feb 16, 2015, 5:30:46 PM2/16/15
to afl-users
Thanks!

> The instrumented binary causes (not surprisingly) quite a bit more L1
> instruction cache misses, but I don't know how much that matters. The
> bigger memory bandwidth usage also shows. However, if I interpret the
> numbers correctly, ultimately the instrumented binaries execute more
> instructions per cycle than the non-instrumented ones, although of
> course most of those instructions are instrumentation. Turns out those
> few instructions add up; you are executing 4x as many instructions as
> the non-instrumented binary.

Yup. It's probably worth noting that large file compression /
decompression is essentially the worst case: a tight loop with a bunch
of conditionals being executed over and over again, with relatively
little code in between. In more practical settings, most of the
overhead should be somewhere else (well, fork and context switching,
probably), up until you get to very heavy-duty apps (Spidermonkey &
co).

> (they might); some analysis anyway: Under showmap, 25% of the the
> front-end stalls happen in __afl_store (mainly the incb (%rdx,%rcx,1),

Some time ago, I tried moving the calculation of rcx (tuple offset)
higher up, but ultimately, there are only so many instructions in the
shim. I thought that switching rdx (SHM base address) to a fixed
offset would help, but it helps only very slightly (~5%), and it's
always risky to use a fixed address because somebody else might have
the same brilliant idea in the fuzzed code =)

/mz

Michal Zalewski

unread,
Feb 16, 2015, 5:44:51 PM2/16/15
to afl-users
>> > 1) Use the return-address pointer on the stack instead of the explicit
>> > random number in the hash.
>> This is probably a no-go unless we explicitly disable ASLR (which
>> would be platform-dependent and messy). Otherwise, if addresses differ
>> across runs, things will go south.
>
> Actually, having differing runs have differing collisions in the hash table
> might be a good thing.

Sure, but you don't want the same locations giving you different
tuples within a single job, otherwise, it becomes impossible to tell
if the mutation produced any changes in the behavior of the target
binary (well, within the scope of what afl-fuzz tries to do).

It's less of a problem for afl-fuzz, since the fork() clones will have
stable mappings, but it becomes an issue if you need to disable the
forkserver (there are several cases where this is necessary) or with
tools such as afl-tmin or afl-cmin (neither of which use the
forkserver approach).

>> I played with that, essentially doing >> 8 (I think it came up in one
>> of the earlier threads). It didn't seem to have a particularly
>> meaningful impact on the ultimate edge coverage with my benchmark
>> targets, while producing a denser bitmap. Anyway, it's a very simple
>> tweak if you want to play with it :-)
>
> It's a little more subtle than that. Instead of storing the previous
> location, you store a hash of the previous set of locations.

Sorry, what I meant to say is, I experimented with this:

shm_region[cur_loc ^ prev_loc]++;
prev_loc = cur_loc ^ (prev_loc >> 4);

This, in effect, ensures that that the write location is a function of
four locations, not just two. So, you preserve more info about control
flow. But it did not seem to make a noteworthy difference compared to
simple edge cov.

> This trick will probably let the fuzzer find 32-bit constants fairly easily
> for PNG fuzzing.

So would extracting immediate values from testl / testq / cmpl / cmpq
to build a dictionary at compile time, right?

/mz

svfu...@gmail.com

unread,
Feb 16, 2015, 8:20:21 PM2/16/15
to afl-...@googlegroups.com


On Monday, February 16, 2015 at 2:44:51 PM UTC-8, Michal Zalewski wrote:
>> > 1) Use the return-address pointer on the stack instead of the explicit
>> > random number in the hash.
>> This is probably a no-go unless we explicitly disable ASLR (which
>> would be platform-dependent and messy). Otherwise, if addresses differ
>> across runs, things will go south.
>
> Actually, having differing runs have differing collisions in the hash table
> might be a good thing.

Sure, but you don't want the same locations giving you different
tuples within a single job, otherwise, it becomes impossible to tell
if the mutation produced any changes in the behavior of the target
binary (well, within the scope of what afl-fuzz tries to do).

It's less of a problem for afl-fuzz, since the fork() clones will have
stable mappings, but it becomes an issue if you need to disable the
forkserver (there are several cases where this is necessary) or with
tools such as afl-tmin or afl-cmin (neither of which use the
forkserver approach).

I guess you are right.  I was only considering afl-fuzz, not the other cases.
 

>> I played with that, essentially doing >> 8 (I think it came up in one
>> of the earlier threads). It didn't seem to have a particularly
>> meaningful impact on the ultimate edge coverage with my benchmark
>> targets, while producing a denser bitmap. Anyway, it's a very simple
>> tweak if you want to play with it :-)
>
> It's a little more subtle than that.  Instead of storing the previous
> location, you store a hash of the previous set of locations.

Sorry, what I meant to say is, I experimented with this:

shm_region[cur_loc ^ prev_loc]++;
prev_loc = cur_loc ^ (prev_loc >> 4);

This, in effect, ensures that that the write location is a function of
four locations, not just two. So, you preserve more info about control
flow. But it did not seem to make a noteworthy difference compared to
simple edge cov.

Yep, it probably depends on the program being instrumented.  Using tuple pairs is the simplest, and probably the fastest way to go.


> This trick will probably let the fuzzer find 32-bit constants fairly easily
> for PNG fuzzing.

So would extracting immediate values from testl / testq / cmpl / cmpq
to build a dictionary at compile time, right?

/mz

Using the values directly probably wouldn't work.  They can be quite "munged" from what the input is.  However, a user could fill a dictionary with whatever the code-words are beforehand, saving afl from the effort of finding them.  It just would be nice if afl could automagically get them itself. :-)

As an aside... it seems to be possible to do this without touching the flags at all:
 
mov %rdx, __afl_save_rdx(%rip)
movzxbl 1+__afl_old_hash(%rip),%edx
mov %rax, __afl_save_rax(%rip)
lea random+__afl_area(%rip), %rax
lea (%rax, %rdx), %rdx
movb (%rdx), %al
lea 1(%eax),%eax
movb %al, (%rdx)
mov %edx, __afl_old_hash(%rip)
mov __afl_save_rax(%rip), %rax
mov __afl_save_rdx(%rip), %rdx


This hack assumes you can remap the static __afl_area's in each object file to all point to the same underlying SHM mapping.  With mremap, this doesn't seem to be difficult.  You'd just need to use a linker-set to get all the memory ranges from the object files in order to find what to remap.

The problem here is that lea isn't quite as nice as xor.  You'd need to also use the circular buffer trick of mapping the page-beyond-the-top to the bottom page.  This prevents the 255-byte overflow from being a problem.

For initialization, you can either do it at __attribute__(constructor) or __attribute__(ifunc) time.  I think ifunc is earlier.  This saves the decision about whether or not afl is enabled from every hook invocation.

Of course... since the resulting instructions are much larger and more interdependent, the code may end up being slower.  It also becomes quite Linux-centric.  This may not be the right way to go. :-(

Steven

Peter Gutmann

unread,
Feb 16, 2015, 8:27:32 PM2/16/15
to afl-...@googlegroups.com
Michal Zalewski <lca...@gmail.com> writes:

>I don't have $$$ for vtune.

You can download a 30-day free trial copy from Intel's web site. OTOH I
didn't find it that useful (probably because I'm not focusing on a few
critical-instruction hot spots), the Visual Studio profiler was more useful.

Peter.

Sami Liedes

unread,
Feb 17, 2015, 3:19:09 AM2/17/15
to afl-users
On Mon, Feb 16, 2015 at 02:44:28PM -0800, Michal Zalewski wrote:
> > This trick will probably let the fuzzer find 32-bit constants fairly easily
> > for PNG fuzzing.
>
> So would extracting immediate values from testl / testq / cmpl / cmpq
> to build a dictionary at compile time, right?

I think the proposed method would also help with

unsigned long x = 0;
scanf("%ul", &x);
if (x == SOME_CONSTANT) ...

and even in cases where x is compared against a non-costant somehow
computed. Statically extracting values cannot help there.

Sami
signature.asc

peter....@gmail.com

unread,
Nov 22, 2016, 5:10:37 PM11/22/16
to afl-users
Have you tried this with non-temporal stores? The technique is described here: https://www.cs.virginia.edu/kim/docs/upton09improving.pdf
Reply all
Reply to author
Forward
0 new messages