Fuzzing Binary-Only Code - perf limitations

1,838 views
Skip to first unread message

Edward Williamson

unread,
Feb 1, 2015, 1:30:12 AM2/1/15
to afl-...@googlegroups.com
Hi AFL users,

I was investigating the ability to use AFL on binary-only code over winter break, and I think some of my findings might be useful.

The good news is that the current QEMU-trace tool seems to outperform everything I attempted.

With that out of the way, I looked at writing a PIN tool, a DynamoRIO tool, and AFL integration with perf and Intel BTS (branch trace store). Like it seems you've already found, PIN and DynamoRIO are prohibitively slow, so I won't spend much time talking about that.

In case no one else has looked at using perf and the Intel BTS mechanism, I found that there were a few issues with the approach:

1) There is no filtering, so the CPU records all branches and we have no way of tracking the module in which execution is taking place. One advantage of my DynamoRIO tool was that I had a "blacklist" of prohibited modules (libc.so.6, libdynamorio.so, etc.), so we could get branch information from the binary and any interesting libraries that it loaded with very little user interaction, which I'm aware is one of AFL's design goals. This meant dynamic linking or PIE wasn't an issue in the way it would be with BTS. It's too bad DynamoRIO was too slow.

2) perf would often fail to record all of the branches if it couldn't flush the branch cache quickly enough, even on relatively modern Intel chips (< 2 years old). It might be possible to build a realtime kernel to deal with this, but that sounds terrible. Someone improved the performance of this with some data structure changes in a patch to the Linux kernel in 2012 or so, but that would have to be ported to the current kernel and doesn't promise to fix the situation. (I don't have the link handy ATM.) And we want to avoid custom kernels for obvious reasons.

3) The overhead of writing to disk kills what sounds like a cool performance gain by getting the branch info from hardware, and even when it didn't drop tuples it was very slow - less than 10 execs/second.

For these reasons I would urge anyone who is curious about hardware tracing to wait for something like Intel Processor Trace, which seems very exciting. It already has support in the Linux kernel, so we just have to wait for the first iteration of these processors to be released. (see https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing)

Congratulations on the QEMU implementation. It seems to really be the best answer for now, and it works quite well for me in some initial testing.

Regards,
Edward Williamson

Michal Zalewski

unread,
Feb 1, 2015, 3:15:26 AM2/1/15
to afl-users
Hey,

> In case no one else has looked at using perf and the Intel BTS mechanism, I
> found that there were a few issues with the approach:

Thanks, this is useful!

I've been thinking about BTS, but the spotty user-space support on
Linux and essentially no support on other OSes seemed a bit
discouraging at this point.

> It's too bad DynamoRIO was too slow.

PIN may be ill-suited for our needs, but DynamoRIO may be more viable,
at least to get in the same ballpark as QEMU. I think Andrew played
with it a bit.

Cheers,
/mz

Edward Williamson

unread,
Feb 1, 2015, 11:25:25 AM2/1/15
to afl-...@googlegroups.com
DynamoRIO may be more viable

I agree. I found that DynamoRIO was much faster, clocking in at around
100 execs/second on small binaries. I wrote in the fork server code but it
will need tinkering to get forking working properly to see if that results in
a speedup.

It might be interesting to compare QEMU, DynamoRIO, and afl-instrumented
speeds more rigorously. DynamoRIO does have the finer control over which
modules are traced compared to QEMU, so that may be advantageous in
some situations.

-Edward

Steven Vittitoe

unread,
Feb 1, 2015, 11:33:57 AM2/1/15
to afl-...@googlegroups.com
Nice work Edward!

This was awhile ago but I've had good results with BTF on Windows OS. You can filter by module by turning off BTF when you step outside of a target module and the set a software break point on the last interesting return from a stack back trace. I've also seen this done with IDA integration. I've never used it on Linux though so YMMV.

Yet another way to go is with a hypervisor based solution. I like this because it means we can apply AFL to kernel-land. Xen has really good support for libvmi (https://github.com/libvmi/libvmi) -- I am looking at this now.

Steve


--
You received this message because you are subscribed to the Google Groups "afl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to afl-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Griffiths

unread,
Feb 1, 2015, 6:36:08 PM2/1/15
to afl-users


On Sun, Feb 1, 2015, 12:15 AM Michal Zalewski <lca...@gmail.com> wrote:


> It's too bad DynamoRIO was too slow.

PIN may be ill-suited for our needs, but DynamoRIO may be more viable,
at least to get in the same ballpark as QEMU. I think Andrew played
with it a bit.

DynamoRio seems slow. I've implemented the fork server code, and it was clocking in at 60 execs a second. Something seems non deterministic, with some fuzz runs at 10 execs a second, then ctrl-c, restart, and it then sitting on 30 execs a second. Restart again, it would sit on 60 a second. Haven't debugged that, just an observation.

Qemu with just fork server support was around 130 execs a second.

So good news on the DynamoRio side, the authors have filed bugs for translate but don't execute support, so that it might be possible to have performance closer to the current qemu implementation. They're interested in checking out performance against qemu, so there might be some easy wins there.

There are some things I can do to make the code faster, however. I need to play with the persistence support more (caching translations to disk, it appears..), and perhaps in lining the instrumentation call. But from what I've read, that should probably be optimized automatically for me with -opt_clean call

I'll continue playing with it to see how good I can get it.

If we can get DynamoRio support to a decent speed, it would be interesting if someone ported AFL to Windows 😁 just don't expect lcamtuf to do it, or support it :)



Ben Nagy

unread,
Feb 12, 2015, 1:54:08 AM2/12/15
to afl-...@googlegroups.com
> On Monday, February 2, 2015 at 8:36:08 AM UTC+9, Andrew Griffiths wrote:
> If we can get DynamoRio support to a decent speed, it would be interesting if someone ported AFL to Windows :) just don't expect lcamtuf to do it, or support it :)

I think your biggest issue is going to be porting your forkserver approach. If anyone can do THAT in a portable manner then I will buy them about a hundred beers, because it would be useful for a ton of other tools.

Cheers,

ben

Parker Thompson

unread,
Feb 12, 2015, 1:58:35 AM2/12/15
to afl-...@googlegroups.com

Yeah. I agree 100% with Ben. My attempts to build a fork server inside of pin tool for aflpin failed gloriously with internal errors from pin such as 'thread not found' so I would love to see that work in DynamoRio.

-Parker

--

Andrew Griffiths

unread,
Feb 12, 2015, 10:22:32 AM2/12/15
to afl-...@googlegroups.com

Forkserver for DynamoRIO under Linux works well. I've written code to do it, it's just not as fast as the QEMU approach ... Just yet. It's about half the speed of qemu mode is without the translate but don't execute mode.

I've tried passing various options to DynamoRIO without much luck to get improved performance.

I've spoken to the DynamoRIO authors, and they're interested in implementing a translate but don't execute option (which is why the qemu mode is pretty fast without any other optimization on my behalf*).

Additionally, they've indicated they want to spend some time comparing the performance of DynamoRIO against QEMU, and seeing if their is any techniques they can borrow or improve upon.

As for windows, (Hi Ben, BTW) , that would be the most difficult part, due to no native fork from what I recall.

http://stackoverflow.com/questions/985281/what-is-the-closest-thing-windows-has-to-fork/985525#985525

So it's probably possible and feasible to do it, if you have suitable Windows development skills.

Finishing up afl network support is probably a more achievable goal for myself at the moment.

* one idea, you could keep count of the expected order of execution of basic block translation and ensue they're in the quick lookup hash table, instead of the current approach where they are pushed to the end. Another idea, use shared memory instead of read/write system calls for message passing.

That increase of performance is probably minimal to the decode, translate, encode steps for binary translation, however.

Ben Nagy

unread,
Feb 12, 2015, 6:30:42 PM2/12/15
to afl-...@googlegroups.com
On Fri, Feb 13, 2015 at 12:22 AM, 'Andrew Griffiths' via afl-users
<afl-...@googlegroups.com> wrote:
> Forkserver for DynamoRIO under Linux works well. I've written code to do it,
> it's just not as fast as the QEMU approach ... Just yet. It's about half the
> speed of qemu mode is without the translate but don't execute mode.

So I have heard DynamoRIO is kind of flaky on OSX, but I'd like to
give it a shot nevertheless. Is this code available anywhere? I looked
at fixing up the linux-user stuff because I got all excited when I saw
you say bsd support might just be a quick fix to bsd-user, but then I
found out about darwin-user ( which died years ago and never actually
worked ). DynamoRIO might be slow but I suspect that it is the only
game in town for closed-source OSX apps.

> As for windows, (Hi Ben, BTW) , that would be the most difficult part, due
> to no native fork from what I recall.
>
> http://stackoverflow.com/questions/985281/what-is-the-closest-thing-windows-has-to-fork/985525#985525
>
> So it's probably possible and feasible to do it, if you have suitable
> Windows development skills.

I think "suitable" is a pretty high bar. So there's the old posix
ZwCreateProcess() route you mentioned, but after that there's still
the work of notifying the rest of "normal" userspace about the process
so you can get debug ports, csrss etc etc. There's an old Windows
Internals book whose name I forget that had code for Win2003 or
something, but it also involved lots of manual tinkering with PEB/TEB
structures, which is going to be a portability "issue". Anyway, yes,
definitely possible, but LOTS of internals knowledge probably needed.

PS: Hi! I loved your linux-user trick. So good. :)

Cheers,

ben

Andrew Griffiths

unread,
Feb 12, 2015, 9:27:16 PM2/12/15
to afl-users
On Thu, Feb 12, 2015 at 3:30 PM, Ben Nagy <b...@iagu.net> wrote:

So I have heard DynamoRIO is kind of flaky on OSX, but I'd like to
give it a shot nevertheless. Is this code available anywhere?

I've attached the DynamoRIO code from when I last touched it. Interested in hearing about how well it works - send patches to lcamtuf so he can include it in the next release :)

Note that if the target is amendable to it - you could place the fork server away from _start and closer to parsing the file, saving redundant / uninteresting work (just hard code the address of the basic block start, I guess).. threading would be the biggest blocker I think.

According to afl docs, fork() on MacOSX is slow. I wonder if there's a mach way of implementing fork() that's faster / better. Something to ask nemo^W^Wresearch at some stage.

 


 
client.c

Michal Zalewski

unread,
Feb 12, 2015, 11:34:34 PM2/12/15
to afl-users
> According to afl docs, fork() on MacOSX is slow. I wonder if there's a mach
> way of implementing fork() that's faster / better. Something to ask
> nemo^W^Wresearch at some stage.

It appears that fork() on Darwin is kinda messed up - see "caveats" here:

https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fork.2.html

It essentially seems not far off from vfork() on
vaguely-POSIX-compliant OSes. I was told that their libc compensates
for this a bit by cleaning up the state to make normal unix programs
behave, but this is (a) very slow, and (b) breaks when you depend on
Darwin-specific libraries that don't do equivalent cleanup and will
just crash. The bottom line is that afl-fuzz in a Linux VM runs
several times faster than when executed in the host OS.

At this point, given that I don't have a MacOS X box to routinely play
with, that the performance is bad, that the user experience of fuzzing
some binaries is partly broken, and that I needed a significant number
of kludges to address a host of other oddball issues [*]... in
retrospect, I'm not sure that supporting the platform was the right
call.

In terms of useful features, I think the top priorities I can think of are:

1) Network support, if we can get it working nicely,

2) Performance improvements for the compile-time instrumentation. For
example, if we did it as GCC or clang plugin, we could likely improve
performance by instrumenting more conservatively and by not having to
save all registers, etc. I think there are significant gains here.

3) Making the binary-only instrumentation better. I'm not sure there's
a lot to be gained by moving between QEMU and DR or something else.
One obvious option would be relying on static translation by
disassembling the binary and putting it back together just once. I've
been told that mcsema may be worth looking at.

4) Making the fork server better. One option may be preforking, but
this breaks the one-fuzzer-takes-one-core deal. Another would be
auto-detecting a more distant forkserver init location by watching how
many instrumented locations we can skip without changing the observed
behavior of the binary. (That last part is particularly interesting!)

5) Test cases and dictionaries! We particularly need a good PDF dictionary.

6) Perhaps improvements to fuzzing strategies. For example, radamsa
had this nice idea of trying to determine if a particular chunk of the
file looks like text or binary data and applying slightly different
mutations depending on that. We get instrumentation feedback to see if
any of this would yield better coverage per number of execs done.

/mz

[*] Say: an outdated version of 'as' in Xcode that doesn't work with
newer versions of clang and flat out crashes with some inputs;
multiple differences in clang-emitted code compared to Linux and *BSD,
necessitating a patchwork of ifdefs in afl-as.c; differences in how
relocations are done in Mach-O binaries compared to ELF; presence of a
crash reporter without a clear C-accessible API to query its state;
etc.

Andrew Griffiths

unread,
Feb 13, 2015, 12:22:38 AM2/13/15
to afl-users
On Thu, Feb 12, 2015 at 8:34 PM, Michal Zalewski <lca...@gmail.com> wrote:

1) Network support, if we can get it working nicely,


In progress. Changed my approach, broke it, needs fixing again. But.. see below
 
2) Performance improvements for the compile-time instrumentation. For
example, if we did it as GCC or clang plugin, we could likely improve
performance by instrumenting more conservatively and by not having to
save all registers, etc. I think there are significant gains here.

3) Making the binary-only instrumentation better. I'm not sure there's
a lot to be gained by moving between QEMU and DR or something else.
One obvious option would be relying on static translation by
disassembling the binary and putting it back together just once. I've
been told that mcsema may be worth looking at.


I've played with using mcsema to do exactly that. I gave up on using it pretty quickly... as I couldn't get their example code and instructions to work under Linux. Perhaps Windows is their more supported target. No idea. I want to go back to that later on when I get more time.

Plus I suspect mcsema is probably just an experimental/ approach if/when it works - it seems a little fiddly and at odds of simplicity goals.
 
4) Making the fork server better. One option may be preforking, but
this breaks the one-fuzzer-takes-one-core deal. Another would be
auto-detecting a more distant forkserver init location by watching how
many instrumented locations we can skip without changing the observed
behavior of the binary. (That last part is particularly interesting!)

Oh, my changed approach for network support will fix that for you.. so sneak peak:

__AFL_WANTED=0: for stdin

__AFL_WANTED=F:/etc/passwd 

and it'll start the forkserver on read/recv/recvfrom/recvmsg on related fd.
 
 
5) Test cases and dictionaries! We particularly need a good PDF dictionary.

6) Perhaps improvements to fuzzing strategies. For example, radamsa
had this nice idea of trying to determine if a particular chunk of the
file looks like text or binary data and applying slightly different
mutations depending on that. We get instrumentation feedback to see if
any of this would yield better coverage per number of execs done.

/mz

[*] Say: an outdated version of 'as' in Xcode that doesn't work with
newer versions of clang and flat out crashes with some inputs;
multiple differences in clang-emitted code compared to Linux and *BSD,
necessitating a patchwork of ifdefs in afl-as.c; differences in how
relocations are done in Mach-O binaries compared to ELF; presence of a
crash reporter without a clear C-accessible API to query its state;
etc.

Michal Zalewski

unread,
Feb 13, 2015, 12:33:05 AM2/13/15
to afl-users
> Oh, my changed approach for network support will fix that for you.. so sneak
> peak:
>
> __AFL_WANTED=0: for stdin
> __AFL_WANTED=F:/etc/passwd
>
> and it'll start the forkserver on read/recv/recvfrom/recvmsg on related fd.

Cool! But it will still explode with threads or using a
non-instrumented binary to read from the file, right?

Named pipes are actually a nice way to stop on read / write, except if
the program does stat() or something like that to check file type,
read length, etc; or if it wants to mmap() input.

/mz

Ben Nagy

unread,
Feb 13, 2015, 12:45:35 AM2/13/15
to afl-...@googlegroups.com
> 1) Network support, if we can get it working nicely,

IMHO this is hard. Like crazy hard, unless you totally separate your
instrumentation, generation and delivery units. Additionally, network
stuff is, generally speaking, super fast, so you'll have less
incremental win over dumber fuzzing approaches. On top of that, you've
got, as you've obviously already thought about, threads and children
EVERYWHERE at the server side. Having said that, looking forward to
being proved wrong :)

> 3) Making the binary-only instrumentation better. I'm not sure there's
> a lot to be gained by moving between QEMU and DR or something else.
> One obvious option would be relying on static translation by
> disassembling the binary and putting it back together just once. I've
> been told that mcsema may be worth looking at.

Robust static instrumentation on closed source binaries is a unicorn,
but yes, that idea is pretty exciting. On top of that, it would be
(more) portable, if you could just have a spec for instrumentation
output and a way to deliver it to the generator. On the Windows side
there's Detours and friends, on the unix side there might be some kind
of LLVM round-tripping. I know that at least one research project
productised a recompiler based on that approach.

Thinking more strategically, though, imhvo this is an area where
incremental gains are hard, and the use-case is more restricted.

> 4) Making the fork server better. One option may be preforking, but
> this breaks the one-fuzzer-takes-one-core deal. Another would be
> auto-detecting a more distant forkserver init location by watching how
> many instrumented locations we can skip without changing the observed
> behavior of the binary. (That last part is particularly interesting!)

I wonder if this could even be presented to the user? Handwaving,
showmap should be able to pretty quickly isolate trace heads that
never vary, and suggest a new fork location based on that, which the
user could pass to the fuzzer on startup?

> 5) Test cases and dictionaries! We particularly need a good PDF dictionary.

Well I've got a _bad_ one? I did tokenise the spec, but it's still
hard to do automatically. Going through pdftotext was a bust for the
tables that contain all the /CrazyPDFCommand entries. TBH just sitting
down for another 2-3 hours is probably the best angle.

I know I sound like a broken record, but this is the kind of thing
that strangers on the Internet will DO FOR YOU if you put the project
on github :) People like to help and this kind of thing is low-hanging
fruit.

In general I think that afl as a corpus creator is one of the biggest
features. It combines "found on the internet" set minimising with
I-can't-believe-it's-not-generational 'exhaustive' cases. Anything
that would improve the features and/or marketing of this angle I think
is great. It means that while I'm waiting for magic "works on all
closed source targets everywhere" I can still get a big win by running
an afl farm to improve the corpus quality of my existing toolchains
for free.

> 6) Perhaps improvements to fuzzing strategies. For example, radamsa
> had this nice idea of trying to determine if a particular chunk of the
> file looks like text or binary data and applying slightly different
> mutations depending on that. We get instrumentation feedback to see if
> any of this would yield better coverage per number of execs done.

I think that was one of my feature reqs ;)

Is this different from the effector maps you were talking about, or is
that feature still TODO? You're talking intra-file strategy changes,
yes?

Cheers,

ben

Michal Zalewski

unread,
Feb 13, 2015, 1:22:11 AM2/13/15
to afl-users
>> 1) Network support, if we can get it working nicely,
>
> IMHO this is hard. Like crazy hard, unless you totally separate your
> instrumentation, generation and delivery units.

Yeah, we had a discussion about this not long ago; check out:
https://groups.google.com/forum/#!topic/afl-users/qDa2ccHyUKQ

I was always fairly skeptical. Andrew is hopeful :-)

> Additionally, network stuff is, generally speaking, super fast, so you'll
> have less incremental win over dumber fuzzing approaches.

Wouldn't the relative advantage be essentially the same no matter how
many execs/sec you get? It does come down to how easy it is to
exhaustively cover everything through random mutations. If it's easy
(e.g., for trivial file formats or network services), then AFL doesn't
give you much. Thankfully, that's not very common =)

>> 4) Making the fork server better. One option may be preforking, but
>> this breaks the one-fuzzer-takes-one-core deal. Another would be
>> auto-detecting a more distant forkserver init location by watching how
>> many instrumented locations we can skip without changing the observed
>> behavior of the binary. (That last part is particularly interesting!)
>
> I wonder if this could even be presented to the user? Handwaving,
> showmap should be able to pretty quickly isolate trace heads that
> never vary, and suggest a new fork location based on that, which the
> user could pass to the fuzzer on startup?

I'd rather not have people copy values between tools if we can help
it; but either way, I think the simplest way to implement it without
adding extra code in the critical path would be something like:

1) Have a global counter initialized by reading __AFL_SKIP_TO in the
SHM setup code and set to -1 if getenv fails,

2) If _AFL_SKIP_TO >= 0, decrease counter, return without initializing
SHM or storing tuple data; otherwise, proceed with setup,

3) Sequentially look for largest _AFL_SKIP_TO for which two fork
server cycles return the same trace.

However, this would fail with variable behavior binaries, including V8
and several other slow targets :-/ There is also some risk of
overshooting; if you skip so much that you end up at the end of main,
subsequent runs will look the same. Hence the "sequentially" part,
even though binary search would be much faster.

An arguably better option would be to set up SHM and start recording
instrumentation right away, and then just move the fork server
location alone. That said, this would require an extra conditional /
register call in the critical path, with some performance cost.

>> 6) Perhaps improvements to fuzzing strategies. For example, radamsa
>> had this nice idea of trying to determine if a particular chunk of the
>> file looks like text or binary data and applying slightly different
>> mutations depending on that. We get instrumentation feedback to see if
>> any of this would yield better coverage per number of execs done.
>
> Is this different from the effector maps you were talking about, or is
> that feature still TODO? You're talking intra-file strategy changes,
> yes?

Effector maps are in and work very well, but they are a binary on-off
switch. We either do deterministic fuzzing on a particular section of
the file or we don't. The deterministic steps are the same no matter
if the underlying chunk looks like binary or text. It may very well be
that the distinction doesn't really matter [1], but we can certainly
test and compare. The coverage metric gives you instant feedback.

[1] In essence, when afl-fuzz sees "15" as a string in the input file,
it will probably try "14", "16", "-5", "05", "95", "15555555555555",
and a bunch of other things even though it doesn't know that it's a
decimal representation of an integer. Radamsa tries to detect that and
handle it specially, which may prompt it to try a couple other
known-tricky values.

/mz

rob...@swiecki.net

unread,
Feb 26, 2015, 1:33:43 PM2/26/15
to afl-...@googlegroups.com
Hi, FWIW I've recently implemented BTS inside honggfuzz (0.5) - https://code.google.com/p/honggfuzz/

The CPU performance penalty is currently around 10% of wall-time, and around 200%-250% of sys/user time, depending on how one counts it.

Example:
W/O BTS (just mangling a 4096 bytes input file from /dev/urandom and feeding it to djpeg)
$ time ~/src/honggfuzz/honggfuzz -F4096 -N 100 -s -q -n1 -r 0.0 -f in/ -- ./djpeg.static | grep ' 0m'

real 0m3.982s
user 0m0.031s
sys 0m0.127s

With BTS (starting with the same 4096 bytes input file, and modifying it in memory), the strategy here is to maximize number of edges (branch pairs)

$ time ~/src/honggfuzz/honggfuzz -Dp -F4096 -N 100 -s -q -n1 -r 0.0 -f in/ -- ./djpeg.static | grep ' 0m'

real 0m4.365s
user 0m0.096s
sys 0m0.337s

So, just FYI - in case you are still considering BTS as an additional feedback source for AFL

The specific code is here: 

Michal Zalewski

unread,
Feb 26, 2015, 1:50:33 PM2/26/15
to afl-users, rob...@swiecki.net
> Hi, FWIW I've recently implemented BTS inside honggfuzz (0.5) -
> https://code.google.com/p/honggfuzz/

Cool! I'll try it out.

> The CPU performance penalty is currently around 10% of wall-time, and around
> 200%-250% of sys/user time, depending on how one counts it.

> $ time ~/src/honggfuzz/honggfuzz -F4096 -N 100 -s -q -n1 -r 0.0 -f in/ --
> ./djpeg.static | grep ' 0m'

Hmm, but that's mostly just benchmarking execve(), right? The
benchmarks I used for testing instrumentation in afl were essentially
a single run of something like gzip or djpeg, but with a large input
file (so you get a good sense of the actual computational overhead,
not the cost of creating and cleaning up a process and doing all the
context switches).

Cheers,
/mz

Robert Święcki

unread,
Feb 26, 2015, 2:38:55 PM2/26/15
to Michal Zalewski, afl-users
2015-02-26 19:50 GMT+01:00 Michal Zalewski <lca...@gmail.com>:
>> Hi, FWIW I've recently implemented BTS inside honggfuzz (0.5) -
>> https://code.google.com/p/honggfuzz/
>
> Cool! I'll try it out.
>
>> The CPU performance penalty is currently around 10% of wall-time, and around
>> 200%-250% of sys/user time, depending on how one counts it.
>
>> $ time ~/src/honggfuzz/honggfuzz -F4096 -N 100 -s -q -n1 -r 0.0 -f in/ --
>> ./djpeg.static | grep ' 0m'
>
> Hmm, but that's mostly just benchmarking execve(), right?

That's possible.

> The
> benchmarks I used for testing instrumentation in afl were essentially
> a single run of something like gzip or djpeg, but with a large input
> file (so you get a good sense of the actual computational overhead,
> not the cost of creating and cleaning up a process and doing all the
> context switches).

I tested it (with a modified binary, so it doesn't actually change the
contents of input, what would randomize the results) now with 100 runs
of zlib-flate (deflate() from libz) on 10MB file from /dev/urandom
(which doesn't compress for obvious reasons, but it probably doesn't
matter here).

W/O BTS (just execve and processing):
$ time ~/src/honggfuzz/honggfuzz -F20000000 -N 100 -s -q -n1 -r 0.0 -f
test -- /usr/bin/zlib-flate -uncompress

real 0m15.989s
user 0m0.723s
sys 0m2.062s

With BTS
$ time ~/src/honggfuzz/honggfuzz -Dp -F20000000 -N 100 -s -q -n1 -r
0.0 -f test -- /usr/bin/zlib-flate -uncompress
real 0m42.496s
user 0m41.471s
sys 0m6.814s

So, this time the results are way different. The kernel CPU penalty
seems acceptable and in line with previous resutls (3x increase)

The user time increase is associated with analyzing the amount of
edges in the user-land (maintaining the list of branches,
de-duplicating it). In the the first run it was avoided as it's just
simple file mangling. However, if a tool performs analysis of branches
it must, well, analyze it and it's up to its implementation how
fast/slow it is :).

But I don't rule out that a lot of this CPU times goes for simply
copying the data around from the mmap() buffer. So, no guarantees! :)

--
Robert Święcki
Reply all
Reply to author
Forward
0 new messages