Source-to-source transform vs. compiler-based instrumentation?

38 views
Skip to first unread message

thepudds

unread,
May 11, 2019, 11:32:47 AM5/11/19
to Golang Fuzzing
I am starting a new thread on item 10 from the bigger "List of possible modifications to the March 2017 proposal" thread:


In this post, I just:

 * Quote what was said on item 10 in that bigger thread.
 * I had asked a related question in that thread about the new `-trimpath` support, and Josh answered, which I also quote below.
 * I also grabbed a sample quote from Josh from the source-to-source vs. SSA related discussion in #29430.
 
Here are those excerpts:

-------------------------------------------------------------------
10. Source-to-source transform vs. compiler-based instrumentation?
-------------------------------------------------------------------
Primary discussion in:
https://github.com/golang/go/issues/29430
 
Does it make sense for the proposal to suggest a perhaps more likely to be accepted approach of source-to-source implementation to start, and save modifying the compiler for possible future work? The results would not be as good and might have more corner cases, so perhaps not.
 
Also, now that the new 1.13 `-trimpath` support landed in #16860, a related question from that same issue in comment:
https://github.com/golang/go/issues/29430#issuecomment-485797818
 
"I think I might understand how that helps with caching the build results of the transformed code (where that transform code might be in a temporary directory).
When it comes to avoiding always needing to repeatedly do the s2s transform itself, is the thinking that if the piece of code that wants to do s2s lives within cmd/go (such as a possible future `go test -fuzz=.`), then that piece of code could directly use the cmd/go/internal/cache internal API to cache the s2s transform results?"
-------------------------------------------------------------------

Josh replied to the `-trimpath` question:

-------------------------------------------------------------------
Your understanding is correct: Adding -trimpath would speed up the 'go 
build' part. 
Right now, the instrumentation itself is pretty cheap and fast. 
(Although I do have ideas for deeper analysis I'd like to do that 
would change that.) So while it would be nice to be able to cache 
instrumented code, I don't think it is a priority. 
The other significant build time cost is in go/packages. I believe 
that the tools team is actively working on optimizing go/packages. 
I'm not sure that this is an area that needs more attention in the 
near term, once the -trimpath piece is sorted out. I have a local 
change that checks the go version and adds -trimpath as appropriate; I 
will turn it into a PR once I can confirm it actually moves the 
needle, which it does not right now, due to a bug. 
-------------------------------------------------------------------

Dmitry also replied to the broader "Source-to-source transform vs. compiler-based instrumentation" question:

-------------------------------------------------------------------
Pain point. 
Compiler instrumentation will be simpler, faster, easier to maintain 
_and_ much simpler to integrate with other build systems. Go team says 
they want this on the side as s2s, plugged via go tool, but they also 
badly want this in bazel and blaze, which are separate huge Java 
systems that are very hard to modify and extend. 
I thing we could do is to prototype the compiler pass for this. To 
show that it works and is simple. Josh, do you know how stable the SSA 
pass API now to maintain a pass on the side for some time? 
-------------------------------------------------------------------

In the #29430 discussion (https://github.com/golang/go/issues/29430), Josh made several points and floated multiple ideas. In general I think he was looking to see if there is a reasonable non-SSA based solution. I won't inline everything he wrote there, but here is one hopefully somewhat representative quote from Josh from #29430:

-------------------------------------------------------------------
But as someone who works on the compiler, I don't want more special things in it, I want less. They increase the testing surface area, make other work harder, etc.
And as someone who writes tools, I want the toolchain to be more flexible and welcoming to tools, not to wonder "how does test coverage do it?" and discover the answer is that it is special.
I want go-fuzz to be built-in to the toolchain. [...] But I also think it is worth trying to face some of the challenges it raises head on, in a general way. If it turns out there are no good answers available, sure, let's make it special. But let's at least try first.
-------------------------------------------------------------------

Regards,
thepudds

thepudds

unread,
May 11, 2019, 11:35:10 AM5/11/19
to Golang Fuzzing
Hi all,

Some related questions that might help the "SSA or source-to-source" conversation some.

Some of these more specific questions might be tough to answer definitively, but even top-of-mind answers might help move thinking forward, or at least trigger some discussion.

Q1. What is a rough count for lines of code that do the source-to-source transformation in go-fuzz today?
      * For example, there are about 900 lines of code in cover.go: https://github.com/dvyukov/go-fuzz/blob/master/go-fuzz-build/cover.go
      * There is another 800 or so lines of code in https://github.com/dvyukov/go-fuzz/blob/master/go-fuzz-build/main.go, but that is mostly higher level than source-to-source transformation.
      * So is an approximate answer to Q1 roughly 900 lines of code?
  
Q2. Is there an SSA pass that exists today that is not extremely intertwined with other things that happens to do something at least somewhat similar to what a theoretical fuzzing instrumentation SSA pass would need to do?

      * For example, is the race detector SSA pass a reasonable example, or maybe there are better examples?

Q3. Based on how things work in SSA today, would a theoretical fuzzing instrumentation SSA pass most likely primarily be imperative Go code, or primarily be SSA rewrite rules, or something else?

Q4. Roughly how stable is the SSA pass API now to maintain a pass on the side for some time? 

      * This is a question from Dmitry in the "List of possible modifications to the March 2017 proposal" thread. 
    
Q5. Is there maybe an order-of-magnitude guess as to how many lines of code a first cut fuzzing instrumentation SSA pass could end up being based on the current SSA infrastructure that exists today? (From a sustainability perspective, 50 vs. 500 vs. 5,000 lines of code likely would make a difference).

Q6. Is there an opportunity to do first something a bit more generic in the SSA world that would then translate to the fuzzing-specific pieces ending up being much smaller?

Regards,
thepudds

Romain Baugue

unread,
May 11, 2019, 1:05:54 PM5/11/19
to Golang Fuzzing
 Given my general illiteracy on this kind of subjects, I will just leave a note so you aren't surprised by me not taking an active part in this discussion. I'll try to get to speed on this, but there are plenty of other subjects I have to read about, so this probably won't happen overnight.

Josh Bleecher Snyder

unread,
May 11, 2019, 1:24:29 PM5/11/19
to thepudds, Golang Fuzzing
> Q1. What is a rough count for lines of code that do the source-to-source transformation in go-fuzz today?
> * For example, there are about 900 lines of code in cover.go: https://github.com/dvyukov/go-fuzz/blob/master/go-fuzz-build/cover.go
> * There is another 800 or so lines of code in https://github.com/dvyukov/go-fuzz/blob/master/go-fuzz-build/main.go, but that is mostly higher level than source-to-source transformation.
> * So is an approximate answer to Q1 roughly 900 lines of code?

To a first approximation, cover.go contains three things:

* Coverage instrumentation
* Sonar instrumentation
* Literal extraction

Literal extraction is read-only, and IMHO does not make sense to put
into the compiler pass. It is also, incidentally, the place where I
have plans to improve (some vague, some already written and ready to
be cleaned up and turned into a PR at some point).

So it is less than 900 LOC.

Note also that 'go test' does code coverage with an s2s transformation
very similar to go-fuzz. (go-fuzz has fallen behind, because it
doesn't use cmd/internal/edit.) So there are opportunities for shared
code there. So possibly it is much less than 900 LOC.


> Q2. Is there an SSA pass that exists today that is not extremely intertwined with other things that happens to do something at least somewhat similar to what a theoretical fuzzing instrumentation SSA pass would need to do?
>
> * For example, is the race detector SSA pass a reasonable example, or maybe there are better examples?

For coverage, no, there isn't a similar existing pass today. Since
coverage only does memory writes, it is probably not too hard to write
and maintain a pass for, probably at the generic (arch-agnostic)
level. One challenge is figuring out the details of when to do a
write. The SSA form's CFG doesn't perfectly match the source code CFG:
Some branches get optimized away, and other branches get inserted
(handling nil pointer checks, bounds checks, write barrier checks,
inlining, etc.). This might be fine, but it will require some thought
and experimentation to (a) find the right places to put the writes and
(b) decide what the correct position information is for a single block
in the CFG that may span a wide, disconnected set of positions. Also,
the backend of the compiler is concurrent, so we'll need to put some
thought into how to name the writes. Also, we need to figure out where
the compiler should write the coverage position output file, how that
interacts with cmd/go, etc.

The story on sonar instrumentation is a bit uglier. We don't yet have
a mechanism to introduce function calls during an SSA pass. Because of
that, race detector instrumentation actually happens during
construction of the SSA form. We could do the same, I suppose, for
sonar instrumentation, but it is likely to have lots of challenging
corner cases. The race detector only cares about memory reads and
writes, which are easier to funnel through a single pinch point.
Comparisons are spread out a bunch more. Also, because SSA
construction occurs before the SSA passes, we'd end up
coverage-covering our sonar instrumentation. Another sticking point is
that sonar instrumentation uses interfaces. But handling of interfaces
(including allocation as needed) occurs prior to the Node-to-SSA
conversion, so we would need to change our sonar instrumentation to
use only concrete types. Even then we might end up with some sticky
spots.

None of this is to say that it is impossible...just that it is not as
simple as porting what we have to a new domain.


> Q3. Based on how things work in SSA today, would a theoretical fuzzing instrumentation SSA pass most likely primarily be imperative Go code, or primarily be SSA rewrite rules, or something else?

I don't think any of this could be done with rewrite rules. I suspect
you would need code in several places, including cmd/go.


> Q4. Roughly how stable is the SSA pass API now to maintain a pass on the side for some time?

As commented above, for the coverage piece only, probably stable
enough. For the other pieces, less so. (Or so I think...I haven't
tried.)


> Q5. Is there maybe an order-of-magnitude guess as to how many lines of code a first cut fuzzing instrumentation SSA pass could end up being based on the current SSA infrastructure that exists today? (From a sustainability perspective, 50 vs. 500 vs. 5,000 lines of code likely would make a difference).

I suspect that the LOC would be similar, depending on how complicated
the glue pieces are (cmd/go) and how much work goes into getting the
corner cases right.

However, working inside the compiler is generally much more fraught.
If you make an s2s mistake, the compiler usually tells you. If you
make a mistake inside the compiler, you find out later, possibly much
later, when things subtly misbehave. It also substantially raises the
bar for new contributors, who need to understand the compiler as
opposed to needing to understand just an AST.


> Q6. Is there an opportunity to do first something a bit more generic in the SSA world that would then translate to the fuzzing-specific pieces ending up being much smaller?

I think coverage is the best bet here, particularly since in theory it
could also be used by cmd/cover. In fact, I'd be tempted to start
there, since cmd/cover (test coverage) is already internal magic.

But as I have said before, I think this should stay s2s. :)


>>> Pain point.
>>> Compiler instrumentation will be simpler, faster, easier to maintain
>>> _and_ much simpler to integrate with other build systems. Go team says
>>> they want this on the side as s2s, plugged via go tool, but they also
>>> badly want this in bazel and blaze, which are separate huge Java
>>> systems that are very hard to modify and extend.

A minor rebuttal: There are multiple Go compilers.

Plus, now that go-fuzz uses go/packages, the input side is
build-system-agnostic. It should work out of the box with any system
that has a go/packages driver, which I believe bazel and blaze do. It
is possible there is more work to do on the 'go build' side, but that
could be done without touching Java. If someone wants this to work
with bazel/blaze, I would hope they would try to get it to work; it
might not be that bad.

-josh

Josh Bleecher Snyder

unread,
May 13, 2019, 1:58:51 AM5/13/19
to thepudds, Golang Fuzzing
> Q2. Is there an SSA pass that exists today that is not extremely intertwined with other things that happens to do something at least somewhat similar to what a theoretical fuzzing instrumentation SSA pass would need to do?
>
>       * For example, is the race detector SSA pass a reasonable example, or maybe there are better examples?

For coverage, no, there isn't a similar existing pass today. Since
coverage only does memory writes, it is probably not too hard to write
and maintain a pass for, probably at the generic (arch-agnostic)
level. One challenge is figuring out the details of when to do a
write. The SSA form's CFG doesn't perfectly match the source code CFG:
Some branches get optimized away, and other branches get inserted
(handling nil pointer checks, bounds checks, write barrier checks,
inlining, etc.). This might be fine, but it will require some thought
and experimentation to (a) find the right places to put the writes and
(b) decide what the correct position information is for a single block
in the CFG that may span a wide, disconnected set of positions. Also,
the backend of the compiler is concurrent, so we'll need to put some
thought into how to name the writes.

Another “naming” question: What do we do with inlined code? Currently, inlined code gets identical instrumentation when inlined and when not inlined, because we are s2s. SSA happens after inlining, so we would either need to accept that executing the same code at different inlining sites produces new coverage, or find some kind of robust content-addressing approach.

-josh 


Dmitry Vyukov

unread,
May 16, 2019, 10:45:02 AM5/16/19
to Josh Bleecher Snyder, thepudds, Golang Fuzzing
Running after inlining should be fine. Maybe even better. We do this
in clang/gcc instrumentation.
Reply all
Reply to author
Forward
0 new messages