[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

173 views
Skip to first unread message

Richard Diamond via llvm-dev

unread,
Nov 2, 2015, 6:57:55 PM11/2/15
to llvm...@lists.llvm.org
Hey all,

I'd like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc.


# Background/Motivation

In Rust we have a crate called `test` which provides a function, `black_box`, which is designed to be a no-op function that prevents constprop, die, etc from interfering with tests/benchmarks but otherwise doesn't negatively affect resulting machine code quality. `test` currently implements this function by using inline asm, which marks a pointer to the argument as used by the assembly. 

At the IR level, this creates an alloca, stores it's argument to it, calls the no-op inline asm with the alloca pointer, and then returns a load of the alloca. Obviously, `mem2reg` would normally optimize this sort of pattern away, however the deliberate use of the no-op asm prevents other desirable optimizations (such as the aforementioned `mem2reg` pass) a little too well.

Existing and upcoming virtual ISA targets also don't have this luxury (PNaCl/JS and WebAssembly, respectively). For these kind of targets, Rust's `test` currently forbids inlining of `black_box`, which crudely achieves the same effect. This is undesirable for any target because of the associated call overhead.

The IR for `test::black_box::<i32>` is currently (it gets inlined, as desired, so I've omitted the function signature):

````llvm
  %dummy.i = alloca i32, align 4
  %2 = bitcast i32* %dummy.i to i8*
  call void @llvm.lifetime.start(i64 4, i8* %2) #1
; Here, the value operand was the original argument to `test::black_box::<i32>`
  store i32 2, i32* %dummy.i, align 4
  call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* %dummy.i) #1, !srcloc !0
  %3 = load i32, i32* %dummy.i, align 4
  call void @llvm.lifetime.end(i64 4, i8* %2) #1
````

This could be better.

# Solution

Add a new intrinsic, called `llvm.blackbox`, which accepts a value of any type and returns a value of the same type. As with many other intrinsics, this intrinsic shall remain unknown to all optimizations, before and during codegen. Specifically, this intrinsic should prevent all optimizations which operate by assuming properties of the value passed to the intrinsic. Once the last optimization pass (of any kind) is finished, all calls can be RAUW its argument.

Table-gen def:

```tablegen
def int_blackbox : Intrinsic<[llvm_any_ty], [LLVMMatchType<0>]>;
```

Thus, using the previous example, `%3` would become:
```llvm
  %3 = call i32 @llvm.blackbox.i32(i32 2)

```


Thoughts and suggestions welcome.

Thanks,
Richard Diamond

Sanjoy Das via llvm-dev

unread,
Nov 2, 2015, 8:20:05 PM11/2/15
to Richard Diamond, llvm...@lists.llvm.org
Why does this need to be an intrinsic (as opposed to generic "unknown
function" to llvm)?

Secondly, have you looked into a volatile store / load to an alloca?
That should work with PNaCl and WebAssembly.

E.g.

define i32 @blackbox(i32 %arg) {
entry:
%p = alloca i32
store volatile i32 10, i32* %p ;; or store %arg
%v = load volatile i32, i32* %p
ret i32 %v
}

-- Sanjoy

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Richard Diamond via llvm-dev

unread,
Nov 2, 2015, 8:24:06 PM11/2/15
to Sanjoy Das, llvm...@lists.llvm.org
On Mon, Nov 2, 2015 at 7:19 PM, Sanjoy Das <san...@playingwithpointers.com> wrote:
Why does this need to be an intrinsic (as opposed to generic "unknown function" to llvm)?

Secondly, have you looked into a volatile store / load to an alloca? That should work with PNaCl and WebAssembly.

E.g.

define i32 @blackbox(i32 %arg) {
 entry:
  %p = alloca i32
  store volatile i32 10, i32* %p  ;; or store %arg
  %v = load volatile i32, i32* %p
  ret i32 %v
}

That volatility would have a negative performance impact.

Richard Diamond

Daniel Berlin via llvm-dev

unread,
Nov 2, 2015, 10:16:24 PM11/2/15
to Richard Diamond, llvm-dev
I'm very unclear and why you think a generic black box intrinsic will have any different performance impact ;-)


I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?
Benchmarks that can be const prop'd/etc away are often meaningless.  Past that, if you want to ensure a particular optimization does a particular thing on a benchmark, ISTM it would be better to generate the IR, run opt (or build your own pass-by-pass harness), and then run "the passes you want on it" instead of "trying to stop certain passes from doing things to it".


Krzysztof Parzyszek via llvm-dev

unread,
Nov 3, 2015, 8:29:31 AM11/3/15
to llvm...@lists.llvm.org
On 11/2/2015 5:57 PM, Richard Diamond via llvm-dev wrote:
>
> Add a new intrinsic, called `llvm.blackbox`, which accepts a value of
> any type and returns a value of the same type. As with many other
> intrinsics, this intrinsic shall remain unknown to all optimizations,
> before and during codegen. Specifically, this intrinsic should prevent
> all optimizations which operate by assuming properties of the value
> passed to the intrinsic. Once the last optimization pass (of any kind)
> is finished, all calls can be RAUW its argument.

This would not prevent dead code elimination from removing it. The
intrinsic would need to have some sort of a side-effect in order to be
preserved in all cases. Are you concerned about cases where the user of
the intrinsic is dead?

-Krzysztof

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Owen Anderson via llvm-dev

unread,
Nov 3, 2015, 2:18:12 PM11/3/15
to Richard Diamond, llvm...@lists.llvm.org
To add on to what Danny and Krzysztof have said, this proposal doesn’t make a lot of sense to me.  You want this intrinsic to inhibit (some) optimizations, but you simultaneously want it not to have a performance impact.  Those are contradictory goals.  Worse, the proposal doesn’t specify what optimizations should/should not be allowed for this intrinsic, since apparently you want at least some applied.  Is CSE allowed? DCE?  PRE?

—Owen

Richard Diamond via llvm-dev

unread,
Nov 3, 2015, 3:29:53 PM11/3/15
to Daniel Berlin, llvm-dev
On Mon, Nov 2, 2015 at 9:16 PM, Daniel Berlin <dbe...@dberlin.org> wrote:
I'm very unclear and why you think a generic black box intrinsic will have any different performance impact ;-)


I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?

It's a cleaner design than current solutions (as far as I'm aware).
 
Benchmarks that can be const prop'd/etc away are often meaningless. 
 
A benchmark that's completely removed is even more meaningless, and the developer may not even know it's happening. I'm not saying this intrinsic will make all benchmarks meaningful (and I can't), I'm saying that it would be useful in Rust in ensuring that tests/benches aren't invalidated simply because a computation wasn't performed.

Past that, if you want to ensure a particular optimization does a particular thing on a benchmark, ISTM it would be better to generate the IR, run opt (or build your own pass-by-pass harness), and then run "the passes you want on it" instead of "trying to stop certain passes from doing things to it".

True, but why would you want to force that speed bump onto other developers? I'd argue that's more hacky than the inline asm.

Richard Diamond via llvm-dev

unread,
Nov 3, 2015, 3:48:53 PM11/3/15
to Owen Anderson, llvm...@lists.llvm.org
On Tue, Nov 3, 2015 at 1:18 PM, Owen Anderson <resi...@mac.com> wrote:
To add on to what Danny and Krzysztof have said, this proposal doesn’t make a lot of sense to me.  You want this intrinsic to inhibit (some) optimizations, but you simultaneously want it not to have a performance impact.  Those are contradictory goals.  Worse, the proposal doesn’t specify what optimizations should/should not be allowed for this intrinsic, since apparently you want at least some applied.  Is CSE allowed? DCE?  PRE?


I apologize for the confusion. I don't think the goals are contradictory. We're talking about code the developer *specifically* doesn't want optimized away, but otherwise doesn't care about what optimization transforms are employed. So yes, I want it to inhibit some optimizations, but without otherwise having a performance impact outside of the obviously prevented optimizations. 

PRE would be fine, as long as the expression in question doesn't make a call to this intrinsic.

Diego Novillo via llvm-dev

unread,
Nov 3, 2015, 3:51:07 PM11/3/15
to Richard Diamond, llvm-dev
I don't see how this is any different from volatile markers on loads/stores or memory barriers or several other optimizer blocking devices.  They generally end up crippling the optimizers without much added benefit.

Would it be possible to stop the code motion you want to block by explicitly exposing data dependencies?  Or simply disabling some optimizations with pragmas?


Diego.

Daniel Berlin via llvm-dev

unread,
Nov 3, 2015, 4:15:36 PM11/3/15
to Richard Diamond, llvm-dev
On Tue, Nov 3, 2015 at 12:29 PM, Richard Diamond <wic...@vitalitystudios.com> wrote:


On Mon, Nov 2, 2015 at 9:16 PM, Daniel Berlin <dbe...@dberlin.org> wrote:
I'm very unclear and why you think a generic black box intrinsic will have any different performance impact ;-)


I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?

It's a cleaner design than current solutions (as far as I'm aware).

For what, exact, well defined goal?

Trying to make certain specific optimizations not work does not seem like a goal unto itself.
It's a thing you are doing to achieve something else, right?
(Because if not, it has a very well defined and well supported solutions - set up a pass manager that runs the passes you want)

What is the something else?

IE what is the problem that led you to consider this solution.

 
Benchmarks that can be const prop'd/etc away are often meaningless. 
 
A benchmark that's completely removed is even more meaningless, and the developer may not even know it's happening.

Write good benchmarks?

No, seriously, i mean, you want benchmarks that tests what users will see when the compiler works, not benchmarks that test what users see if the were to suddenly turn off parts of the optimizers ;)
 
I'm not saying this intrinsic will make all benchmarks meaningful (and I can't), I'm saying that it would be useful in Rust in ensuring that tests/benches aren't invalidated simply because a computation wasn't performed.

Past that, if you want to ensure a particular optimization does a particular thing on a benchmark, ISTM it would be better to generate the IR, run opt (or build your own pass-by-pass harness), and then run "the passes you want on it" instead of "trying to stop certain passes from doing things to it".

True, but why would you want to force that speed bump onto other developers? I'd argue that's more hacky than the inline asm.

Speed bump? Hacky?
It's a completely normal test harness? 

That's in fact, why llvm uses it as a test harness?

I guess i don't see why an intrinsic with not well defined semantics, used in weird ways to try to outsmart some but not all optimizations, is "less hacky" than a harness that says "hey, i want to see the effects of running mem2reg and code gen on this, without running constprop. So i'm just going to run mem2reg and codegen on this, and see the results!".
Because the former is just a way to try to magic the compiler, and the second expresses exactly what you want.


Philip Reames via llvm-dev

unread,
Nov 3, 2015, 6:09:51 PM11/3/15
to Daniel Berlin, Richard Diamond, llvm-dev
The common use case I've seen for a black box like construct is when writing microbenchmarks.  In particular, you're generally looking for a way to "sink" the result of a computation without having that sink outweigh the cost of the thing you're trying to measure.

Common alternate approaches are to use a volatile store (so that it can't be eliminated or sunk out of loops) or a call to an external function with a cheap calling convention. 

As an example:
int a = 5; // initialization is not visible to compiler
int b = 7;
void add_two_globals()
  sink(a+b)
}

If what I'm look into is the code generation around addition, this is a very useful way of testing the entire compiler - frontend, middle end, and backend.

I'll note that we use such a framework extensively. 

What I'm not clear on is why this needs to be an intrinsic.  Why does a call to an external function or a volatile store not suffice? 

Philip

Björn Steinbrink

unread,
Nov 3, 2015, 7:04:58 PM11/3/15
to Philip Reames, llvm-dev
2015-11-04 0:09 GMT+01:00 Philip Reames via llvm-dev <llvm...@lists.llvm.org>:
> What I'm not clear on is why this needs to be an intrinsic. Why does a call
> to an external function or a volatile store not suffice?

I wonder the same. Richard, maybe we just need something more specific
in Rust? Like something that only clobbers memory? Or just the
variable? Seems like we could do with specialized "block_box`
functions. AIUI our `black_box` got extended to prevent more
optimizations as it became obvious that the compiler could still
"defeat" it. Maybe we need to take a step back and say "ok, we'll have
to think a bit harder and decide how hard we'll be on the optimizer"?
Is there anything that speaks against that and requires an intrinsic?

Cheers,
Björn

Richard Diamond via llvm-dev

unread,
Nov 6, 2015, 11:31:31 AM11/6/15
to Diego Novillo, llvm-dev
On Tue, Nov 3, 2015 at 2:50 PM, Diego Novillo <dnov...@google.com> wrote:
I don't see how this is any different from volatile markers on loads/stores or memory barriers or several other optimizer blocking devices.  They generally end up crippling the optimizers without much added benefit.

Volatile must touch memory (right?). Memory is slow.


Would it be possible to stop the code motion you want to block by explicitly exposing data dependencies?  Or simply disabling some optimizations with pragmas?


Code motion would be fine in theory, though as has been proposed, this intrinsic would prevent it (because there isn't an attribute that doesn't allow dead code removal but still permits reordering, as far as I'm aware). Rust doesn't have pragmas, and besides, that would also affect the whole module (or the whole crate, to use Rust's vernacular), whereas this intrinsic would be used in a much more targeted manner (ie at the SSA value level) by the developer and leave the rest of the module unmolested.

Richard Diamond via llvm-dev

unread,
Nov 6, 2015, 11:35:25 AM11/6/15
to Daniel Berlin, llvm-dev
On Tue, Nov 3, 2015 at 3:15 PM, Daniel Berlin <dbe...@dberlin.org> wrote:


On Tue, Nov 3, 2015 at 12:29 PM, Richard Diamond <wic...@vitalitystudios.com> wrote:


On Mon, Nov 2, 2015 at 9:16 PM, Daniel Berlin <dbe...@dberlin.org> wrote:
I'm very unclear and why you think a generic black box intrinsic will have any different performance impact ;-)


I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?

It's a cleaner design than current solutions (as far as I'm aware).

For what, exact, well defined goal? 

Trying to make certain specific optimizations not work does not seem like a goal unto itself.
It's a thing you are doing to achieve something else, right?
(Because if not, it has a very well defined and well supported solutions - set up a pass manager that runs the passes you want)

What is the something else?

IE what is the problem that led you to consider this solution.

I apologize if I'm not being clear enough. This contrived example
```rust
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        (0..1000).fold(0, |old, new| old ^ new);
    });
}
```
is completely optimized away. Granted, IRL production (ignoring the question of why this code was ever used in production in the first place) this optimization is desired, but here it leads to bogus measurements (ie 0ns per iteration). By using `test::black_box`, one would have

```rust
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(1000);  // optional
        test::black_box((0..n).fold(0, |old, new| old ^ new));
    });
}
```
and the microbenchmark wouldn't have bogos 0ns measurements anymore.

Now, as I stated in the proposal, `test::black_box` currently uses no-op inline asm to "read" from its argument in a way the optimizations can't see. Conceptually, this seems like something that should be modelled in LLVM's IR rather than by hacks higher up the IR food chain because the root problem is caused by LLVM's optimization passes (most of the time this code optimization is desired, just not here). Plus, it seems others have used other tricks to achieve similar effects (ie volatile), so why shouldn't there be something to model this behaviour?
 
Benchmarks that can be const prop'd/etc away are often meaningless. 
 
A benchmark that's completely removed is even more meaningless, and the developer may not even know it's happening.

Write good benchmarks?

No, seriously, i mean, you want benchmarks that tests what users will see when the compiler works, not benchmarks that test what users see if the were to suddenly turn off parts of the optimizers ;)

But users are also not testing how fast deterministic code which LLVM is completely removing can go. This intrinsic prevents LLVM from correctly thinking the code is deterministic (or that a value isn't used) so that measurements are (at the very least, the tiniest bit) meaningful.

I'm not saying this intrinsic will make all benchmarks meaningful (and I can't), I'm saying that it would be useful in Rust in ensuring that tests/benches aren't invalidated simply because a computation wasn't performed.

Past that, if you want to ensure a particular optimization does a particular thing on a benchmark, ISTM it would be better to generate the IR, run opt (or build your own pass-by-pass harness), and then run "the passes you want on it" instead of "trying to stop certain passes from doing things to it".

True, but why would you want to force that speed bump onto other developers? I'd argue that's more hacky than the inline asm.

Speed bump? Hacky?
It's a completely normal test harness? 

That's in fact, why llvm uses it as a test harness?

I mean I wouldn't write a harness or some other type of workaround for something like this: Rust doesn't seem to be the first to have encountered this issue, thus it is nonsensical to require every project using LLVM to have a separate harness or other workaround so they don't run into this issue. LLVM's own documentation suggests that adding an intrinsic is the best choice moving forward anyway: "Adding an intrinsic function is far easier than adding an instruction, and is transparent to optimization passes. If your added functionality can be expressed as a function call, an intrinsic function is the method of choice for LLVM extension." (from http://llvm.org/docs/ExtendingLLVM.html). That sounds perfect to me.

At anyrate, I apologize for my original hand-wavy-ness; I am young and inexperienced.

Joerg Sonnenberger via llvm-dev

unread,
Nov 6, 2015, 11:36:13 AM11/6/15
to llvm...@lists.llvm.org
On Fri, Nov 06, 2015 at 10:31:23AM -0600, Richard Diamond via llvm-dev wrote:
> On Tue, Nov 3, 2015 at 2:50 PM, Diego Novillo <dnov...@google.com> wrote:
>
> > I don't see how this is any different from volatile markers on
> > loads/stores or memory barriers or several other optimizer blocking
> > devices. They generally end up crippling the optimizers without much added
> > benefit.
> >
>
> Volatile must touch memory (right?). Memory is slow.

No, it just must not be optimised away. The CPU is still free to cache
it.

Joerg

Richard Diamond via llvm-dev

unread,
Nov 6, 2015, 11:37:02 AM11/6/15
to Philip Reames, llvm-dev
On Tue, Nov 3, 2015 at 5:09 PM, Philip Reames <list...@philipreames.com> wrote:
The common use case I've seen for a black box like construct is when writing microbenchmarks.  In particular, you're generally looking for a way to "sink" the result of a computation without having that sink outweigh the cost of the thing you're trying to measure.

Common alternate approaches are to use a volatile store (so that it can't be eliminated or sunk out of loops) or a call to an external function with a cheap calling convention. 

As an example:
int a = 5; // initialization is not visible to compiler

This intrinsic is designed to provide exactly this (assuming `__builtin_blackbox` is also added to clang):

int a = __builtin_blackbox(5);
 
int b = 7;
void add_two_globals()
  sink(a+b)

And:

__builtin_blackbox(a+b);
 
}

If what I'm look into is the code generation around addition, this is a very useful way of testing the entire compiler - frontend, middle end, and backend.

I'll note that we use such a framework extensively. 

What I'm not clear on is why this needs to be an intrinsic.  Why does a call to an external function or a volatile store not suffice? 

Because Rust's `test::black_box` is generic, it can't be an external function. Also, code using this will actually be executed, so it's also not good enough to leave it undefined.

Richard Diamond via llvm-dev

unread,
Nov 6, 2015, 11:42:21 AM11/6/15
to Joerg Sonnenberger, llvm-dev
On Fri, Nov 6, 2015 at 10:35 AM, Joerg Sonnenberger via llvm-dev <llvm...@lists.llvm.org> wrote:
On Fri, Nov 06, 2015 at 10:31:23AM -0600, Richard Diamond via llvm-dev wrote:
> On Tue, Nov 3, 2015 at 2:50 PM, Diego Novillo <dnov...@google.com> wrote:
>
> > I don't see how this is any different from volatile markers on
> > loads/stores or memory barriers or several other optimizer blocking
> > devices.  They generally end up crippling the optimizers without much added
> > benefit.
> >
>
> Volatile must touch memory (right?). Memory is slow.

No, it just must not be optimised away. The CPU is still free to cache
it.

Hm, okay. Well, at anyrate it isn't guaranteed; I will do some tests to check anyway. 

Daniel Berlin via llvm-dev

unread,
Nov 6, 2015, 12:27:43 PM11/6/15
to Richard Diamond, llvm-dev

Great!

You should then test that this happens, and additionally write a test that can't be optimized away, since the above is apparently not a useful microbenchmark for anything but the compiler ;-)

Seriously though, there are basically three cases (with a bit of handwaving)

1. You want to test that the compiler optimizes something in a certain way.  The above example, without anything else, you actually want to test that the compiler optimizes this away completely.
This doesn't require anything except using something like FIleCheck and producing IR at the end of Rust's optimization pipeline.

2. You want to make the above code into a benchmark, and ensure the compiler is required to keep the number and relative order of certain operations.
Use volatile for this.

Volatile is not what you seem to think it is, or may think about it in terms of what people use it for in C/C++.
volatile in llvm has a well defined meaning:
http://llvm.org/docs/LangRef.html#volatile-memory-accesses

3. You want to get the compiler to only do certain optimizations to your code.

Yes, you have to either write a test harness (even if that test harness is "your normal compiler, with certain flags passed"), or use ours, for that ;-)


It seems like you want #2, so you should use volatile.

But don't conflate #2 and #3.

As said:
If you want the compiler to only do certain things to your code, you should tell it to only do those things by giving it a pass pipeline that only does those things.  Nothing else is going to solve this problem well.

If you want the compiler to do every optimization it knows to your code, but want it to maintain the number and relative order of certain operations, that's volatile.

Mehdi Amini via llvm-dev

unread,
Nov 6, 2015, 1:18:02 PM11/6/15
to Richard Diamond, llvm-dev
Now, as I stated in the proposal, `test::black_box` currently uses no-op inline asm to "read" from its argument in a way the optimizations can't see. Conceptually, this seems like something that should be modelled in LLVM's IR rather than by hacks higher up the IR food chain because the root problem is caused by LLVM's optimization passes (most of the time this code optimization is desired, just not here). Plus, it seems others have used other tricks to achieve similar effects (ie volatile), so why shouldn't there be something to model this behavior?

How would black_box be different from existing mechanism (inline asm, volatile, …)? 
If the effect on the optimizer is not different then there is no reason to introduce a new intrinsic just for the sake of it. It has some cost: any optimization has to take this into account.

On this topic, I think Chandler’s talk at CppCon seems relevant: https://www.youtube.com/watch?v=nXaxk27zwlk

 
Benchmarks that can be const prop'd/etc away are often meaningless. 
 
A benchmark that's completely removed is even more meaningless, and the developer may not even know it's happening. 

Write good benchmarks?

No, seriously, i mean, you want benchmarks that tests what users will see when the compiler works, not benchmarks that test what users see if the were to suddenly turn off parts of the optimizers ;)

But users are also not testing how fast deterministic code which LLVM is completely removing can go. This intrinsic prevents LLVM from correctly thinking the code is deterministic (or that a value isn't used) so that measurements are (at the very least, the tiniest bit) meaningful.

I'm not saying this intrinsic will make all benchmarks meaningful (and I can't), I'm saying that it would be useful in Rust in ensuring that tests/benches aren't invalidated simply because a computation wasn't performed.

Past that, if you want to ensure a particular optimization does a particular thing on a benchmark, ISTM it would be better to generate the IR, run opt (or build your own pass-by-pass harness), and then run "the passes you want on it" instead of "trying to stop certain passes from doing things to it".

True, but why would you want to force that speed bump onto other developers? I'd argue that's more hacky than the inline asm.

Speed bump? Hacky?
It's a completely normal test harness? 

That's in fact, why llvm uses it as a test harness?

I mean I wouldn't write a harness or some other type of workaround for something like this: Rust doesn't seem to be the first to have encountered this issue, thus it is nonsensical to require every project using LLVM to have a separate harness or other workaround so they don't run into this issue. LLVM's own documentation suggests that adding an intrinsic is the best choice moving forward anyway: "Adding an intrinsic function is far easier than adding an instruction, and is transparent to optimization passes. If your added functionality can be expressed as a function call, an intrinsic function is the method of choice for LLVM extension." (from http://llvm.org/docs/ExtendingLLVM.html). That sounds perfect to me.

The doc is about if you *need* to extend LLVM, then you should try with intrinsic instead of adding an instruction, it is the “need” part that is not clear here. The doc also states that an intrinsic is transparent to optimization passes, but it is not the case here since you want to prevent optimizations from happening (and you haven’t really specified how to decide what can an optimization do around this intrinsic, because if you don’t teach the optimizer about it, it will treat it as an external function call).

— 
Mehdi


Sean Silva via llvm-dev

unread,
Nov 6, 2015, 10:03:10 PM11/6/15
to Richard Diamond, llvm-dev
I still don't understand what you are trying to test with this.

Are you trying to measure e.g. the performance of

xor %eax, %eax
1:
xor %esi, %eax
dec %esi
jnz 1b

?

are you trying to measure whether the compiler will vectorize this? Are you trying to test how well the compiler will vectorize this? Are you trying to measure the compiler's unrolling heuristics? Are you trying to see if the (0..n).fold(...) machinery gets lowered to a loop? Are you trying to see if the compiler will reduce it to (n & ((n&1)-1)) ^ ((n ^ (n >> 1))&1) or a similar closed form expression? (I'm sure that's not the simplest one; just one I cooked up)

I'm honestly curious.

-- Sean Silva
 

Now, as I stated in the proposal, `test::black_box` currently uses no-op inline asm to "read" from its argument in a way the optimizations can't see. Conceptually, this seems like something that should be modelled in LLVM's IR rather than by hacks higher up the IR food chain because the root problem is caused by LLVM's optimization passes (most of the time this code optimization is desired, just not here). Plus, it seems others have used other tricks to achieve similar effects (ie volatile), so why shouldn't there be something to model this behaviour?
 
Benchmarks that can be const prop'd/etc away are often meaningless. 
 
A benchmark that's completely removed is even more meaningless, and the developer may not even know it's happening.

Write good benchmarks?

No, seriously, i mean, you want benchmarks that tests what users will see when the compiler works, not benchmarks that test what users see if the were to suddenly turn off parts of the optimizers ;)

But users are also not testing how fast deterministic code which LLVM is completely removing can go. This intrinsic prevents LLVM from correctly thinking the code is deterministic (or that a value isn't used) so that measurements are (at the very least, the tiniest bit) meaningful.

I'm not saying this intrinsic will make all benchmarks meaningful (and I can't), I'm saying that it would be useful in Rust in ensuring that tests/benches aren't invalidated simply because a computation wasn't performed.

Past that, if you want to ensure a particular optimization does a particular thing on a benchmark, ISTM it would be better to generate the IR, run opt (or build your own pass-by-pass harness), and then run "the passes you want on it" instead of "trying to stop certain passes from doing things to it".

True, but why would you want to force that speed bump onto other developers? I'd argue that's more hacky than the inline asm.

Speed bump? Hacky?
It's a completely normal test harness? 

That's in fact, why llvm uses it as a test harness?

I mean I wouldn't write a harness or some other type of workaround for something like this: Rust doesn't seem to be the first to have encountered this issue, thus it is nonsensical to require every project using LLVM to have a separate harness or other workaround so they don't run into this issue. LLVM's own documentation suggests that adding an intrinsic is the best choice moving forward anyway: "Adding an intrinsic function is far easier than adding an instruction, and is transparent to optimization passes. If your added functionality can be expressed as a function call, an intrinsic function is the method of choice for LLVM extension." (from http://llvm.org/docs/ExtendingLLVM.html). That sounds perfect to me.

At anyrate, I apologize for my original hand-wavy-ness; I am young and inexperienced.

Philip Reames via llvm-dev

unread,
Nov 9, 2015, 1:20:53 PM11/9/15
to Richard Diamond, llvm-dev
This statement doesn't make sense to me.  What does generic have to do with being external?  And why can't an external function be defined?

Philip

Steven Stewart-Gallus via llvm-dev

unread,
Nov 9, 2015, 4:23:38 PM11/9/15
to llvm...@lists.llvm.org
Hello, I think what you want are intrinsics similar to the following macroes right?

#define PUBLISH_WRITES_TO_VAR(X) __asm__ __volatile__("":: "m"((X)))
#define OBSERVE_WRITES_TO_VAR(X) __asm__ __volatile__("": "=m"((X)))

Thank you,
Steven Stewart-Gallus

Alex Elsayed via llvm-dev

unread,
Nov 9, 2015, 9:00:17 PM11/9/15
to llvm...@lists.llvm.org
On Fri, 06 Nov 2015 09:27:32 -0800, Daniel Berlin via llvm-dev wrote:

<snip>

I think the fundamental thing you're missing is that benchmarks are an
exercise in if/then:

*If* a user exercises this API, *then* how well would it perform?

Of course, in the case of a user, the data could come from anywhere, and
go anywhere - the terminal, a network socket, whatever.

However, in a benchmark, all the data comes from (and goes) to places the
compiler and see.

Thus, it's necessary to make the compiler _pretend_ the data came from
and goes to a "black box", in order for the benchmarks to even *remotely*
resemble what they're meant to test.

This is actually distinct from #1, #2, _and_ #3 above - quite simply,
what is needed is a way to simulate a "real usage" scenario without
actually contacting the external world.

Daniel Berlin via llvm-dev

unread,
Nov 9, 2015, 9:04:29 PM11/9/15
to Alex Elsayed, llvm-dev

I think the fundamental thing you're missing is that benchmarks are an
exercise in if/then:

I don't believe i'm actually missing anything.
 

*If* a user exercises this API, *then* how well would it perform?

In this case, he exercised the API, and the compiler optimized it away.
If he wants to test whether the API, exercised in some other way, will be optimized, he should test that.

Your argument that this can't be tested is *almost always* false.  It's true sometimes, but that's *actually pretty rare*.


Of course, in the case of a user, the data could come from anywhere, and
go anywhere - the terminal, a network socket, whatever.

However, in a benchmark, all the data comes from (and goes) to places the
compiler and see.

This is not necessarily true, but as I said, the way around this is to use volatile.
 

Thus, it's necessary to make the compiler _pretend_ the data came from
and goes to a "black box", in order for the benchmarks to even *remotely*
resemble what they're meant to test.


This is actually distinct from #1, #2, _and_ #3 above - quite simply,
what is needed is a way to simulate a "real usage" scenario without
actually contacting the external world.


No. It is not distinct in any way, and you haven't shown why it is.
It is exactly  "i want this operation to happen exactly X times despite whether the compiler thinks it is necessary or can be removed"

That is volatile.

Reid Kleckner via llvm-dev

unread,
Nov 9, 2015, 10:46:48 PM11/9/15
to Daniel Berlin, llvm-dev, Alex Elsayed
One thing that volatile doesn't do is escape results that have been written to memory.

The proposed blackbox intrinsic is modeled as reading and writing any pointed to memory, which is useful.

I also think blackbox will be a lot easier for people to use than empty volatile inline asm and volatile loads and stores. That alone seems worth something. :)

Daniel Berlin via llvm-dev

unread,
Nov 9, 2015, 11:58:49 PM11/9/15
to Reid Kleckner, llvm-dev, Alex Elsayed
On Mon, Nov 9, 2015 at 7:46 PM, Reid Kleckner <r...@google.com> wrote:
One thing that volatile doesn't do is escape results that have been written to memory.


Honestly, i'd probably rather see attributes or something than this intrinsic.

That said ....
 
 
The proposed blackbox intrinsic is modeled as reading and writing any pointed to memory, which is useful.


The proposed intrinsic does not have a really well defined exact set of semantics.
If it did, and those semantics made sense, i think you'd see a lot less pushback.
 
I also think blackbox will be a lot easier for people to use than empty volatile inline asm and volatile loads and stores. That alone seems worth something. :)


Yes, but at the same time, it seems very "Do What I Mean".

Those kinds of intrinsics rarely turn out to be sane and maintainable ;-)

Jeroen Dobbelaere via llvm-dev

unread,
Nov 10, 2015, 5:04:37 AM11/10/15
to Richard Diamond, llvm...@lists.llvm.org

Hi Richard,

 

why don't you use an inline assembly that returns your argument in a register ?

 

For example:

----

 

int foo(int a, int b)

{

  int c=a+b+10;

  __asm__ volatile ("":"=r"(c):"0"(c):"memory");

 

  return c+20;

}

---

 

 

results in: (Note that the +10 and +20 were not combined)

 

---

foo:                                    # @foo

        .cfi_startproc

# BB#0:

        leal    10(%rdi,%rsi), %eax

        #APP

        #NO_APP

        addl    $20, %eax

        retq

.Lfunc_end0:

        .size   foo, .Lfunc_end0-foo

        .cfi_endproc

--

 

At llvm-ir level, it looks like:

 

---

define i32 @foo(i32 %a, i32 %b) #0 {

  %1 = add i32 %a, 10

  %2 = add i32 %1, %b

  %3 = tail call i32 asm sideeffect "", "=r,0,~{memory},~{dirflag},~{fpsr},~{flags}"(i32 %2) #1, !srcloc !1

  %4 = add nsw i32 %3, 20

  ret i32 %4

}

---

 

Greetings,

 

Jeroen Dobbelaere

 

 

 

From: llvm-dev [mailto:llvm-dev...@lists.llvm.org] On Behalf Of Richard Diamond via llvm-dev
Sent: Tuesday, November 03, 2015 12:58 AM
To: llvm...@lists.llvm.org
Subject: [llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

 

Hey all,

 

I'd like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc.

 

 

# Background/Motivation

Reid Kleckner via llvm-dev

unread,
Nov 11, 2015, 1:32:23 PM11/11/15
to Daniel Berlin, llvm-dev, Alex Elsayed
I think the idea is to model the intrinsic as a normal external function call:
- Can read/write escaped memory
- Escapes pointer args
- Functionattrs cannot infer anything about it
- Returns a pointer which may alias any escaped data

It's obvious how this works at the IR level, but I'm not sure what would happen in the backend. If you compile the intrinsic to nothing but a virtual register copy, MI optimizations will kick in. You might get away with compiling it to a volatile store/load, and hope that the time for high-level memory optimizations like GVN is over.

Daniel Berlin via llvm-dev

unread,
Nov 11, 2015, 1:41:34 PM11/11/15
to Reid Kleckner, llvm-dev, Alex Elsayed
On Wed, Nov 11, 2015 at 10:32 AM, Reid Kleckner <r...@google.com> wrote:
I think the idea is to model the intrinsic as a normal external function call: 
- Can read/write escaped memory
- Escapes pointer args
- Functionattrs cannot infer anything about it
- Returns a pointer which may alias any escaped data

As you point out so nicely, there is already a list of stuff that external function calls may do, but we may be able to prove things about them anyway due to attributes, etc.  

So it's not just an external function call, it's a super-magic one.

Now, can we handle that?
Sure.

For example,  i can move external function calls if i can prove things about their dependencies, and the above list is not sufficient to prevent me from moving (or PRE'ing) most of the blackbox calls that just take normal non-pointer args.
Is that going to be okay?

(Imagine, for example, LTO modes where i can guarantee i have the entire program, etc.
You still want blackbox to be magically special in these modes, even though nothing else is).

Reid Kleckner via llvm-dev

unread,
Nov 11, 2015, 2:06:53 PM11/11/15
to Daniel Berlin, llvm-dev, Alex Elsayed
On Wed, Nov 11, 2015 at 10:41 AM, Daniel Berlin <dbe...@dberlin.org> wrote:
On Wed, Nov 11, 2015 at 10:32 AM, Reid Kleckner <r...@google.com> wrote:
I think the idea is to model the intrinsic as a normal external function call: 
- Can read/write escaped memory
- Escapes pointer args
- Functionattrs cannot infer anything about it
- Returns a pointer which may alias any escaped data

As you point out so nicely, there is already a list of stuff that external function calls may do, but we may be able to prove things about them anyway due to attributes, etc.  

So it's not just an external function call, it's a super-magic one.

Right, an external function, with a definition that the compiler will never find.
 
Now, can we handle that?
Sure.

For example,  i can move external function calls if i can prove things about their dependencies, and the above list is not sufficient to prevent me from moving (or PRE'ing) most of the blackbox calls that just take normal non-pointer args.
Is that going to be okay?

(Imagine, for example, LTO modes where i can guarantee i have the entire program, etc.
You still want blackbox to be magically special in these modes, even though nothing else is).

Sure, the compiler can reorder all the memory accesses to non-escaped memory as it sees fit across the barrier. That's part of the normal modelling of external calls.

I don't know how you could CSE it, though. Any call you can't reason about can always use inline asm to talk to external devices or issue a write syscall. I don't know how you could practically deploy a super-duper LTO mode that doesn't allow that as part of its model.

The following CFG simplification would be legal, as it also fits the normal model of an external call:
if (cond) y =llvm.blackbox(x)
else y = llvm.blackbox(x)
-->
y = llvm.blackbox(x)

I don't see how this is special. It just provides an overloaded intrinsic whose definition we promise to never reason about. Other than that it follows the same familiar rules that function calls do.

Daniel Berlin via llvm-dev

unread,
Nov 11, 2015, 2:13:54 PM11/11/15
to Reid Kleckner, llvm-dev, Alex Elsayed
On Wed, Nov 11, 2015 at 11:06 AM, Reid Kleckner <r...@google.com> wrote:
On Wed, Nov 11, 2015 at 10:41 AM, Daniel Berlin <dbe...@dberlin.org> wrote:
On Wed, Nov 11, 2015 at 10:32 AM, Reid Kleckner <r...@google.com> wrote:
I think the idea is to model the intrinsic as a normal external function call: 
- Can read/write escaped memory
- Escapes pointer args
- Functionattrs cannot infer anything about it
- Returns a pointer which may alias any escaped data

As you point out so nicely, there is already a list of stuff that external function calls may do, but we may be able to prove things about them anyway due to attributes, etc.  

So it's not just an external function call, it's a super-magic one.

Right, an external function, with a definition that the compiler will never find.
 
Now, can we handle that?
Sure.

For example,  i can move external function calls if i can prove things about their dependencies, and the above list is not sufficient to prevent me from moving (or PRE'ing) most of the blackbox calls that just take normal non-pointer args.
Is that going to be okay?

(Imagine, for example, LTO modes where i can guarantee i have the entire program, etc.
You still want blackbox to be magically special in these modes, even though nothing else is).

Sure, the compiler can reorder all the memory accesses to non-escaped memory as it sees fit across the barrier. That's part of the normal modelling of external calls.

I don't know how you could CSE it, though.

You'd have to make sure everything you ever built in LLVM handled this particular intrinsic specially.
 
Any call you can't reason about 
can always use inline asm to talk to external devices or issue a write syscall.

Heck, i could even reason about inline asm if i wanted to ;-).

My point is that this call is super special compared to all other  calls, and literally everything in LLVM has to understand that.
The liklihood of subtle bugs being introduced in functionality (IE analysis/etc doing the wrong thing because it is not special cased) seems super high to me.
 
I don't know how you could practically deploy a super-duper LTO mode that doesn't allow that as part of its model.
 
Sure.

The following CFG simplification would be legal, as it also fits the normal model of an external call:
if (cond) y =llvm.blackbox(x)
else y = llvm.blackbox(x)
-->
y = llvm.blackbox(x)

I don't see how this is special. It just provides an overloaded intrinsic whose definition we promise to never reason about. Other than that it follows the same familiar rules that function calls do.

You have now removed some conditional evaluation and  jumps.  those  would normally take benchmark time.
Why is that okay?


Alex Elsayed via llvm-dev

unread,
Nov 11, 2015, 4:01:58 PM11/11/15
to llvm...@lists.llvm.org
On Wed, 11 Nov 2015 11:13:43 -0800, Daniel Berlin via llvm-dev wrote:
<snip for gmane>

> Heck, i could even reason about inline asm if i wanted to ;-).
>
> My point is that this call is super special compared to all other
> calls,
> and literally everything in LLVM has to understand that.
> The liklihood of subtle bugs being introduced in functionality (IE
> analysis/etc doing the wrong thing because it is not special cased)
> seems super high to me.

I do agree this is a concern.



>> I don't know how you could practically deploy a super-duper LTO mode
>> that doesn't allow that as part of its model.
>>
>>
> Sure.
>
>
>> The following CFG simplification would be legal, as it also fits the
>> normal model of an external call:
>> if (cond) y =llvm.blackbox(x)
>> else y = llvm.blackbox(x)
>> -->
>> y = llvm.blackbox(x)
>>
>> I don't see how this is special. It just provides an overloaded
>> intrinsic whose definition we promise to never reason about. Other than
>> that it follows the same familiar rules that function calls do.
>>
>>
> You have now removed some conditional evaluation and jumps. those
> would normally take benchmark time.
> Why is that okay?

Because the original post in terms of wanting to inhibit specific
optimizations was a flawed way of describing the problem.

Reid's explanation of "an external function that LLVM is not allowed to
reason about the body of" is a much better explanation, as a good
benchmark will place llvm.blackbox() exactly where real code would call,
say, getrandom() (on input) or printf() (on output).

However, as the function call overhead of said external function isn't
part of the _developer's_ code, and not something they can make faster in
case of slow results, it's not relevant to the benchmarks - thus, using
an _actual_ external function is suboptimal, even leaving aside that with
LTO and such, llvm may STILL infer things about such functions, obviating
the benchark.

Perhaps the best explanation is that it's about *simulating the
existence* of a "perfectly efficient" external world.

James Molloy via llvm-dev

unread,
Nov 11, 2015, 4:12:49 PM11/11/15
to Alex Elsayed, llvm...@lists.llvm.org
Hi,

I don't understand why in your benchmarks you're not just using an external function, and then have an empty benchmark that just calls that external function so you can know what overhead that adds. A bunch of benchmark suites do this already, and chandler mentions it in the talk posted earlier.

James

Daniel Berlin via llvm-dev

unread,
Nov 11, 2015, 4:15:59 PM11/11/15
to Alex Elsayed, llvm-dev


Reid's explanation of "an external function that LLVM is not allowed to
reason about the body of" is a much better explanation, as a good
benchmark will place llvm.blackbox() exactly where real code would call,
say, getrandom() (on input) or printf() (on output).


However, as the function call overhead of said external function isn't
part of the _developer's_ code,

This isn't call overhead though.
It's a conditional and two calls someone wrote in some benchmark code.
That's not call overhead ;-)

It's just that i've proven the condition has no side effects and doesn't matter, so i eliminated it.

Thus, I'm trying to ask the question: "Will the use case really still be served if we let us eliminate these conditionals as useless, when the whole point is to let people test the overhead of things the compiler wanted to eliminate because it thinks they are useless"
;-)

Alex Elsayed via llvm-dev

unread,
Nov 11, 2015, 4:21:38 PM11/11/15
to llvm...@lists.llvm.org
On Wed, 11 Nov 2015 13:15:45 -0800, Daniel Berlin via llvm-dev wrote:


>>
>>
>> Reid's explanation of "an external function that LLVM is not allowed to
>> reason about the body of" is a much better explanation, as a good
>> benchmark will place llvm.blackbox() exactly where real code would
>> call,
>> say, getrandom() (on input) or printf() (on output).
>>
>>
>
>> However, as the function call overhead of said external function isn't
>> part of the _developer's_ code,
>
>
> This isn't call overhead though.
> It's a conditional and two calls someone wrote in some benchmark code.
> That's not call overhead ;-)

I meant the prologue/epilogue of the external function, but James'
response is relevant there.

> It's just that i've proven the condition has no side effects and doesn't
> matter, so i eliminated it.

Yes. That's perfectly fine. You could do the exact same thing with
getrandom()'s result, or printf() calls.

That was my point.

> Thus, I'm trying to ask the question: "Will the use case really still be
> served if we let us eliminate these conditionals as useless, when the
> whole point is to let people test the overhead of things the compiler
> wanted to eliminate because it thinks they are useless"
> ;-)

And my answer was "Yes, emphatically so, as you're continually restating
what I consider a deeply flawed summation of what it's trying to solve."

> the whole point is to let people test the overhead of things the
> compiler wanted to eliminate because it thinks they are useless"

This is _incorrect_.

The point is to _model the behavior of the benchmarked code AS IF the
data goes to and comes from a place we know nothing about_.

These are fundamentally different things, which is why I keep restating
it.

Sean Silva via llvm-dev

unread,
Nov 11, 2015, 10:14:39 PM11/11/15
to Alex Elsayed, llvm-dev
Can you show a real benchmark that users have tried to write where the call overhead of actually using an external function call is measurable? A no-op function call is going to take maybe a dozen cycles max (inside a loop, so good branch prediction etc.). Anything where a dozen cycles is measurable by comparison basically can't be reasoned about at the C++ level (you are basically benchmarking at asm level at that point, so just write it in asm).

More generally, there's only been (I think) 1 concrete example given in this thread (the xor fold thing). Could you please give like 5 distinct real-world examples? That would help us get a feel for the real motivation here and why an external function call wouldn't work. (also, presumably this is a consistent problem that has been cropping up in practice if you are going to the length of wanting to add an IR intrinsic that, as Daniel points out, has implications throughout the compiler)

-- Sean Silva

Joerg Sonnenberger via llvm-dev

unread,
Nov 12, 2015, 9:31:16 AM11/12/15
to llvm...@lists.llvm.org
On Wed, Nov 11, 2015 at 07:14:28PM -0800, Sean Silva via llvm-dev wrote:
> Can you show a real benchmark that users have tried to write where the call
> overhead of actually using an external function call is measurable?

This is the wrong question. The correct question is: What useful
benchmark cannot trivally factor out the overhead of the external
function call. Yes, if you do microbenchmarking it can be measurable.
But the point is that the overhead should be extremely predictable and
stable. As such, it can be easily calibrated and removed from the cost
of whatever you are really trying to measure. Given that the
instrumentation in general has some latency, you won't get around
calibration anyway.

Joerg

Richard Diamond via llvm-dev

unread,
Nov 16, 2015, 12:00:11 PM11/16/15
to llvm-dev

Hey all,

I apologize for my delay with my reply to you all (two tests last week, with three more coming this week).

I appreciate all of your inputs. Based on the discussion, I’ve refined the scope and purpose of llvm.blackbox, at least as it pertains to Rust’s desired use case. Previously, I left the intrinsic only vaguely specified, and based on the resulting comments, I’ve arrived at a more well defined intrinsic.

Specifically:

  • Calls to the intrinsic must not be removed;
  • Calls may be duplicated;
  • No assumptions about the return value may be made from its argument (including pointer arguments, meaning the returned pointer is a may alias for all queries);
  • It must be assumed that every byte of the pointer element type of the argument will be read (if the argument is a pointer); and
  • These rules must be maintained all the way to machine code emission, at which point the first rule must be violated.

All other optimizations are fair game.

The above is a bit involved to be sure, and seeing how this intrinsic isn’t critical, I’m fine with leaving it at the worse case (ie read/write mem + other side effects) for now.

Why?

Alex summed it up best: “[..] it’s about simulating the existence of a “perfectly efficient” external world.” This intrinsic would serve as an aid for benchmarking, ensuring benchmark code is still relevant after optimizations are performed on it, and is an attempt to create a dedicated escape hatch to be used in place of the alternatives I’ve listed below.

Alternatives

In no particular order:

  • Volatile stores

Not ideal for benchmarking (isn’t guaranteed to cache), nonetheless I made an attempt to measure the effects on Rustc’s set of benchmarks. However, I found an issue with rustc which blocks attempts to measure the effect: https://github.com/rust-lang/rust/issues/29663.

  • Inline asm which “uses” a pointer to the value

Rust’s current solution. Needs stack space.

  • Inline asm which returns the value

Won’t work for any type which is bigger than a register; at least not without creating a rustc intrinsic anyway to make the asm operate piecewise on the register native component types of the type if need be. And how is rustc to know exactly which are the register sized or smaller types? rustc mostly leaves such knowledge to LLVM.

Good idea, but the needed logistics would make it ugly.

  • Mark test::black_box as noinline

Also not ideal because of the mandatory function call overhead.

  • External function

Impossible for Rust; generics are monomorphised into the crate in which it is used (ie the resulting function in IR won’t ever be external to the module using it), thus making this an impossible solution for Rust. Also, Rust doesn’t allow function overloading, so C++ style explicit specialization is also out. Also suffers from the same aforementioned call overhead.

Again, comments are welcome.
Richard Diamond

James Molloy via llvm-dev

unread,
Nov 16, 2015, 1:03:45 PM11/16/15
to Richard Diamond, llvm-dev
Hi Richard,

You don't appear to have addressed my suggestion to not require a perfect external world, instead to measure the overhead of an imperfect world (by using an empty benchmark) and subtracting that from the measured benchmark score.

Besides which, absolute benchmark results are more than often totally useless - the really important part of benchmarking is relative differences. Certainly in my experience I've never needed to care about absolute numbers and i wonder why you do.

Cheers,

James

Dmitri Gribenko via llvm-dev

unread,
Nov 16, 2015, 10:00:17 PM11/16/15
to James Molloy, llvm-dev
On Mon, Nov 16, 2015 at 10:03 AM, James Molloy via llvm-dev
<llvm...@lists.llvm.org> wrote:
> You don't appear to have addressed my suggestion to not require a perfect
> external world, instead to measure the overhead of an imperfect world (by
> using an empty benchmark) and subtracting that from the measured benchmark
> score.

In microbenchmarks, performance is not additive. You can't compose
two pieces of code and predict that the benchmark results will be the
sum of the individual measurements.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <grib...@gmail.com>*/

Sean Silva via llvm-dev

unread,
Nov 16, 2015, 11:56:01 PM11/16/15
to Dmitri Gribenko, llvm-dev
On Mon, Nov 16, 2015 at 6:59 PM, Dmitri Gribenko via llvm-dev <llvm...@lists.llvm.org> wrote:
On Mon, Nov 16, 2015 at 10:03 AM, James Molloy via llvm-dev
<llvm...@lists.llvm.org> wrote:
> You don't appear to have addressed my suggestion to not require a perfect
> external world, instead to measure the overhead of an imperfect world (by
> using an empty benchmark) and subtracting that from the measured benchmark
> score.

In microbenchmarks, performance is not additive.  You can't compose
two pieces of code and predict that the benchmark results will be the
sum of the individual measurements.

This sounds like an argument against microbenchmarks in general, rather than against James's point. James' point assumes that you are doing a meaningful benchmark (since, presumably, you are trying to understand the relative performance in your application).

-- Sean Silva

Daniel Berlin via llvm-dev

unread,
Nov 16, 2015, 11:57:15 PM11/16/15
to Richard Diamond, llvm-dev
"Not ideal for benchmarking (isn’t guaranteed to cache),"

Could you clarify what you mean by this?


Dmitri Gribenko via llvm-dev

unread,
Nov 17, 2015, 12:07:56 AM11/17/15
to Sean Silva, llvm-dev
On Mon, Nov 16, 2015 at 8:55 PM, Sean Silva <chiso...@gmail.com> wrote:
>
>
> On Mon, Nov 16, 2015 at 6:59 PM, Dmitri Gribenko via llvm-dev
> <llvm...@lists.llvm.org> wrote:
>>
>> On Mon, Nov 16, 2015 at 10:03 AM, James Molloy via llvm-dev
>> <llvm...@lists.llvm.org> wrote:
>> > You don't appear to have addressed my suggestion to not require a
>> > perfect
>> > external world, instead to measure the overhead of an imperfect world
>> > (by
>> > using an empty benchmark) and subtracting that from the measured
>> > benchmark
>> > score.
>>
>> In microbenchmarks, performance is not additive. You can't compose
>> two pieces of code and predict that the benchmark results will be the
>> sum of the individual measurements.
>
>
> This sounds like an argument against microbenchmarks in general, rather than
> against James's point. James' point assumes that you are doing a meaningful
> benchmark (since, presumably, you are trying to understand the relative
> performance in your application).

Any benchmarking, and especially microbenchmarking, should not be
primarily about measuring the relative performance change. It is a
small scientific experiment, where you don't just get numbers -- you
need to have an explanation why are you getting these numbers. And,
especially for microbenchmarks, having an explanation, and a way to
validate it, as well as one's assumptions, is critical.

In large system benchmarks performance is not additive either -- when
you have multiple subsystems, cores and queues. But this does not
mean that system-level benchmarks are not useful. As any benchmarks,
they need interpretation.

Sean Silva via llvm-dev

unread,
Nov 17, 2015, 12:34:53 AM11/17/15
to Dmitri Gribenko, llvm-dev
I agree with all this, but I don't understand how adding an external function call would interfere at all. I guess I could rephrase what I was saying as "if having an external function call prevents the benchmark from performing its purpose, it is hard to believe that any conclusions coming from such a benchmark would be applicable to real code". The only exceptions I can think about are extremely low-level asm measurements (which are written in asm anyway so this whole discussion of llvm.blackbox is irrelevant).

-- Sean Silva

Dmitri Gribenko via llvm-dev

unread,
Nov 17, 2015, 1:11:54 AM11/17/15
to Sean Silva, llvm-dev

The thing is, the black box function one can implement in the language
would not be a perfect substitute for the real producer or consumer.

I don't know about Rust, but in other high-level languages,
implementing a black box as a generic function might cause an overhead
due to the way generic functions are implemented, higher than the
overhead of a regular function call. For example, the value might
need to be moved to the heap to be passed as an unconstrained generic
parameter. This wouldn't be the case in real code, where the function
would be non-generic, and possibly even inlined into the callee.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <grib...@gmail.com>*/

Sean Silva via llvm-dev

unread,
Nov 17, 2015, 10:34:02 PM11/17/15
to Dmitri Gribenko, llvm-dev
How does an intrinsic at IR-level avoid this? If the slowdown from generic-ness is happening at the language/frontend semantic level, then an IR intrinsic doesn't seem like it would help.

-- Sean Silva

Dmitri Gribenko via llvm-dev

unread,
Nov 17, 2015, 11:19:43 PM11/17/15
to Sean Silva, llvm-dev

If such a function can be implemented with an IR-level intrinsic, it
can be inlined (removing all the call overhead), but still keep the
opaque semantics.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <grib...@gmail.com>*/

Reply all
Reply to author
Forward
0 new messages