afl qemu mode

1,052 views
Skip to first unread message

Tim Newsham

unread,
Feb 8, 2016, 12:26:53 AM2/8/16
to afl-users
The AFL qemu patches use the following to generate a "random" value for each basic block:

  cur_loc  = (cur_loc >> 4) ^ (cur_loc << 8);
  cur_loc &= MAP_SIZE - 1;

this is quick but dirty.  The speed is useful here since this is executed each time the basic block is executed. Though I am a bit skeptical of the quality of (x>>4)^(x<<8) as a has function.

I think a better approach here might be to use a (deterministic) random function such as SHA1(cur_loc) & (MAP_SIZE - 1).  This would be much more expensive, but instead of executing this every time, it could be generated once into the JIT-emitted code, amortizing the cost over how many times the basic block is executed per compilation.  Something like:

        tcg_gen_movi_tl(aflCurLoc_reg, hashAddr(pc));
        gen_helper_afl_maybe_log(aflCurLoc_reg);

where hashAddr(target_ulong addr) returns SHA1(addr) & (MAP_SIZE - 1), and gen_helper_afl_maybe_log generates a call to the afl helper in C.

Tim

Peter Gutmann

unread,
Feb 8, 2016, 3:27:09 AM2/8/16
to afl-...@googlegroups.com
Tim Newsham <tim.n...@gmail.com> writes:

>I think a better approach here might be to use a (deterministic) random
>function such as SHA1(cur_loc) & (MAP_SIZE - 1). This would be much more
>expensive, but instead of executing this every time, it could be generated
>once into the JIT-emitted code, amortizing the cost over how many times the
>basic block is executed per compilation.

You'd get the same results as SHA-1 much faster using any kind of hash
function that provides good avalanche characteristics (avalanche = mixing of
bits, as a simplified explanation). SHA-1 is a cryptographic hash function
that provides properties like preimage resistance which is pretty unnecessary,
what I would do is use something like the finaliser function from Murmurhash:

h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;

or for 64-bit systems:

h ^= h >> 33;
h *= 0xff51afd7ed558ccd;
h ^= h >> 33;
h *= 0xc4ceb9fe1a85ec53;
h ^= h >> 33;

That's almost as fast as the original but gets much better avalanche.

Peter.

Andrew Griffiths

unread,
Feb 8, 2016, 12:53:35 PM2/8/16
to afl-users
Hello,

Question for my own education - would your suggestion change based on the following observations:

1) a lot of architectures are 4 bytes aligned (basic block would start often at % 4 == 0), though maybe it might be worth using those bits if compiling for an arch that doesn't have that requirement.
2) compilers often put functions at paragraph alignment ( % 16 ==  0) - start of a new basic block
3) (assuming 32 bit address space) top byte is often at a fixed location (if a static binary; probably top 10-12 is often fixed).

(There are some other requirements that may not be immediately obvious - predictability between child and parent process, which, from memory rules out using random()).

Based on those - is there a more optimal avalanche/hash function in this scenario?

Changing the hash function to be better but slower is an acceptable - all the heavy lifting is disassembly / reassembly work, and wouldn't be noticeable at all :-) 


--
You received this message because you are subscribed to the Google Groups "afl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to afl-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michal Zalewski

unread,
Feb 8, 2016, 1:32:50 PM2/8/16
to afl-users
Yo all,

> Based on those - is there a more optimal avalanche/hash function in this
> scenario?

The original "algorithm" is basically meant to be as fast as possible,
because we end up executing that code a lot, and literally, every CPU
cycle counts. We don't need any crypto properties or even a real
avalanche effect - the only thing that would be bad is that if we end
up with far more collisions than we would with a stronger hash.

Replacing it with something that costs comparably few CPU cycles is
fine, but I would be wary of adding much more unless we see clear
evidence that the current approach is badly broken.

Tim, are you seeing any evidence of that? I was doing some
benchmarking with targets such as libpng and wasn't seeing huge
differences between QEMU and native maps.

/mz

Tim Newsham

unread,
Feb 8, 2016, 5:39:18 PM2/8/16
to afl-users, pgu...@cs.auckland.ac.nz
On Sunday, February 7, 2016 at 10:27:09 PM UTC-10, Peter Gutmann wrote:

You'd get the same results as SHA-1 much faster using any kind of hash
function that provides good avalanche characteristics (avalanche = mixing of
bits, as a simplified explanation).  SHA-1 is a cryptographic hash function


fair enough.. but I think if you perform the hash only when code genning a
basic block, the performance of the hash function will hardly matter..
 

That's almost as fast as the original but gets much better avalanche.


Thank you for the Murmur hash info..
 

Peter.

Tim Newsham

unread,
Feb 8, 2016, 5:48:06 PM2/8/16
to afl-users
Tim, are you seeing any evidence of that? I was doing some
benchmarking with targets such as libpng and wasn't seeing huge
differences between QEMU and native maps.

No. honestly this might be a little bit of premature optimization. As it turns out I had implemented my own qemu edge coverage stuff and I ended up doing it with SHA in the code gen, and then later I came back and looked at what AFL's qemu support does. (I'm planning to switch to using murmurhash from SHA)

For perf reasons I would think doing it in the code generator rather than recomputing the hash each basic block execution would be a big win.. but I havent measured.  ie basic blocks get two instructions added at the top: "mov immediate somereg, #hashconstant" and "call helper afl_maybe_log".
 
/mz

Peter Gutmann

unread,
Feb 9, 2016, 9:37:45 PM2/9/16
to afl-...@googlegroups.com
afl-...@googlegroups.com <afl-...@googlegroups.com> writes:

>Based on those - is there a more optimal avalanche/hash function in this
>scenario?

No. You either get good avalanche or you don't, the definition is that a
single changed input bit should change 50% of the output bits so it doesn't
matter what the input is.

Replying to mz's comment, given that fmix() from Murmurhash is going to be,
at most, a few cycles slower than the current approach, it can't hurt to use
it.

The longer reply is that you'd need to benchmark things to see if it really is
an issue, but I'd just use fmix().

Peter.

Tim Newsham

unread,
Feb 11, 2016, 1:46:14 AM2/11/16
to afl-users
Attached is an implementation of what I had in mind.  Its based heavily on the original AFL qemu patch and patches the same version of qemu.  It splits up the log function into two halves -- generating a hash based on the address, implemented at compile time, and the logging, which happens during execution.  Unfortunately this requires patching the target-specific code (as far as I know) instead of the generic code, but the patch itself is portable and could be applied to each target without further porting.  I also took the opportunity to patch in the hash function that Peter Gutmann recommended (with the old hash function ifdef'd in).

Quick tests with afl-showmap and test-instr.c seem to indicate its working, but this is not heavily tested yet.

Btw, this version of qemu is dated.. it wouldnt be hard to revamp this for the latest version of qemu if desired...

Tim
qemu-afl-patch.txt

Tim Newsham

unread,
Mar 12, 2016, 9:09:14 PM3/12/16
to afl-users
On Wednesday, February 10, 2016 at 8:46:14 PM UTC-10, Tim Newsham wrote:
Attached is an implementation of what I had in mind. 

I finally got some time to do some measurements.  I'm using a slightly different version than the patch I provided above, but based on the same concept.
Summary: the new version runs about 25% faster and measures slightly more edges (indicating a bug in the original afl-qemu-trace!).

To recap the patch's concept -- the idea is instead of having qemu compute the hash each time a basic block is executed, use the code generator to emit an instruction that has the hash hardwired into it, that calls into the trace function. 

I compared the edge coverage maps of my version to the original and found that my version was capturing more edges.  This surprised me at first so I spent some time to track down why -- it turns out that the current qemu version does not record an edge trace when a basic block branches back to itself.  I did not explore why this is but I imagine it has to do with where the AFL patch was added into cpu-exec.c.  The new version captures these the same way as it captures other edges since it happens in the JIT'd code. This means the new version records a superset of the edges recorded by the stock afl/qemu version.  

NOTE: This seems like a bug in the current afl-qemu-trace, and should probably be investigated further!

I measured the performance using "/usr/bin/lame /tmp/test.wav" [*].  I measured a baseline without qemu, and then I measured this with "afl-qemu-trace" (which doesnt perform any actual tracing) and with "afl-showmap -Q" (which uses qemu and does perform tracing.  The results are:

             real    user   sys
base    -   0.406   0.401 0.004

not tracing:
orig    -  15.591  15.583 0.00      - 38x
gen     -  12.947  12.935 0.004     - 32x   20% faster

tracing:
orig    - 290.677 290.467 0.087     - 716x
gen     - 218.020 217.826 0.094     - 537x  25% faster


The original version of afl/qemu ran 716x slower than the baseline and the new version rand 537x slower than the baseline.  The new version ran 25% faster than the original while tracing and 20% faster than the original when not tracing.

Tim


[*] Actual commands:
  time /usr/bin/lame /tmp/test.wav
  time ./afl-qemu-trace /usr/bin/lame /tmp/test.wav
  time ./afl-showmap -o out1 -Q -- ./afl-qemu-trace /usr/bin/lame /tmp/test.wav

Tim Newsham

unread,
Mar 15, 2016, 3:21:32 PM3/15/16
to afl-users

NOTE: This seems like a bug in the current afl-qemu-trace, and should probably be investigated further!

I investigated the afl-qemu-trace weirdness I saw earlier -- afl/qemu is tracing basic blocks before they are executing but before it checks for various exceptional conditions.  If an exception occurs the afl/qemu tracer will be invoked several times for the same basic block, leading to some spurious self edges.  

This can be fixed by adjusting the snippet 2 macro and placing it in a better location, after exceptions have been dealt with and after the block has actually executed (since the block can be aborted and restarted in some situations):

----
#define AFL_QEMU_CPU_SNIPPET2(env, pc) do { \
    if(pc == afl_entry_point) { \
      afl_setup(); \
      afl_forkserver(env); \
    } \
    afl_maybe_log(pc); \
  } while (0)

----
diff --git a/afl/qemu_mode/qemu/cpu-exec.c b/afl/qemu_mode/qemu/cpu-exec.c
index 04142d7..96cfc9e 100644
--- a/afl/qemu_mode/qemu/cpu-exec.c
+++ b/afl/qemu_mode/qemu/cpu-exec.c
@@ -198,6 +198,7 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, ui
 #endif /* DEBUG_DISAS */
 
     cpu->can_do_io = 0;
+    target_ulong pc = env->eip;
     next_tb = tcg_qemu_tb_exec(env, tb_ptr);
     cpu->can_do_io = 1;
     trace_exec_tb_exit((void *) (next_tb & ~TB_EXIT_MASK),
@@ -216,7 +217,11 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, u
             assert(cc->set_pc);
             cc->set_pc(cpu, tb->pc);
         }
+    } else {
+        /* we executed it, trace it */
+        AFL_QEMU_CPU_SNIPPET2(env, pc);
     }
+
     if ((next_tb & TB_EXIT_MASK) == TB_EXIT_REQUESTED) {
         /* We were asked to stop executing TBs (probably a pending
          * interrupt. We've now stopped, so clear the flag.
 

Tim Newsham

unread,
Mar 16, 2016, 7:21:01 PM3/16/16
to afl-users
On Tuesday, March 15, 2016 at 9:21:32 AM UTC-10, Tim Newsham wrote:
I investigated the afl-qemu-trace weirdness I saw earlier -- afl/qemu is tracing basic blocks before they are executing but before it checks for various exceptional conditions.  If an exception occurs the afl/qemu tracer will be invoked several times for the same basic block, leading to some spurious self edges.  


I found one other change that's required to get deterministic traces from qemu, this block should be commented out of cpu-exec.c.  It doesn't come up often, but every once in a while chaining TBs causes slight differences in traces from run to run.  With this change and the last one I'm seeing the same edge traces from run to run.

#ifdef ALLOW_CHAIN  /* chaining complicates AFL logging */
                /* see if we can patch the calling TB. When the TB
                   spans two pages, we cannot safely do a direct
                   jump. */
                if (next_tb != 0 && tb->page_addr[1] == -1) {
                    tb_add_jump((TranslationBlock *)(next_tb & ~TB_EXIT_MASK),
                                next_tb & TB_EXIT_MASK, tb);
                }
#endif


 

Tim Newsham

unread,
Mar 17, 2016, 1:57:49 AM3/17/16
to afl-users, pgu...@cs.auckland.ac.nz
On Sunday, February 7, 2016 at 10:27:09 PM UTC-10, Peter Gutmann wrote:

or for 64-bit systems:

  h ^= h >> 33;
  h *= 0xff51afd7ed558ccd;
  h ^= h >> 33;
  h *= 0xc4ceb9fe1a85ec53;
  h ^= h >> 33; 

That's almost as fast as the original but gets much better avalanche.


I recorded an exact basic-block trace for some test program and extracted aprox 50k unique edges from it (so a bit bigger than your average AFL run) and tried out both the original hash function and this improved hash function on the resulting edges. (Note: using hash(src) ^ (hash(sink)>>1) as the edge hash function for both). The new hash function has considerably fewer collisions.  Here's the simulation results

original hash function:
----
map size 15 32768
edge collisions 24581 of 50252 -- 48.9%
node collisions 6912 of 24195 -- 28.6%
map size 16 65536
edge collisions 15741 of 50252 -- 31.3%
node collisions 4553 of 24195 -- 18.8%
map size 17 131072
edge collisions 9552 of 50252 -- 19.0%
node collisions 2820 of 24195 -- 11.7%
map size 18 262144
edge collisions 6295 of 50252 -- 12.5%
node collisions 2135 of 24195 -- 8.8%
map size 19 524288
edge collisions 4538 of 50252 -- 9.0%
node collisions 1993 of 24195 -- 8.2%
map size 20 1048576
edge collisions 3781 of 50252 -- 7.5%
node collisions 1989 of 24195 -- 8.2%
map size 21 2097152
edge collisions 2136 of 50252 -- 4.3%
node collisions 945 of 24195 -- 3.9%
map size 22 4194304
edge collisions 1270 of 50252 -- 2.5%
node collisions 408 of 24195 -- 1.7%
map size 23 8388608
edge collisions 777 of 50252 -- 1.5%
node collisions 159 of 24195 -- 0.7%
map size 24 16777216
edge collisions 543 of 50252 -- 1.1%
node collisions 0 of 24195 -- 0.0%


improved hash function:
-----
map size 15 32768
edge collisions 24517 of 50252 -- 48.8%
node collisions 7027 of 24195 -- 29.0%
map size 16 65536
edge collisions 15155 of 50252 -- 30.2%
node collisions 3939 of 24195 -- 16.3%
map size 17 131072
edge collisions 8540 of 50252 -- 17.0%
node collisions 2092 of 24195 -- 8.6%
map size 18 262144
edge collisions 4549 of 50252 -- 9.1%
node collisions 1074 of 24195 -- 4.4%
map size 19 524288
edge collisions 2425 of 50252 -- 4.8%
node collisions 541 of 24195 -- 2.2%
map size 20 1048576
edge collisions 1201 of 50252 -- 2.4%
node collisions 286 of 24195 -- 1.2%
map size 21 2097152
edge collisions 606 of 50252 -- 1.2%
node collisions 143 of 24195 -- 0.6%
map size 22 4194304
edge collisions 301 of 50252 -- 0.6%
node collisions 80 of 24195 -- 0.3%
map size 23 8388608
edge collisions 170 of 50252 -- 0.3%
node collisions 38 of 24195 -- 0.2%
map size 24 16777216
edge collisions 60 of 50252 -- 0.1%
node collisions 18 of 24195 -- 0.1% 

Peter.

Reply all
Reply to author
Forward
0 new messages