[llvm-dev] [RFC] Adding range metadata to array subscripts.

Clement Courbet via llvm-dev

unread,

Mar 24, 2021, 5:15:21 AM3/24/21

to llvm-dev

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM can do better alias analysis. I have a proposal as D99248 (clang part) and D99247 (llvm part). But there are other possible options that I'm detailing below.

Consider the following code, adapted from brotli:

```

struct Histogram {

int values[256];

int total;

};

Histogram DoIt(const int* image, int size) {

Histogram histogram;

for (int i = 0; i < size; ++i) {

++histogram.values[image[i]]; // (A)

++histogram.total; // (B)

}

return histogram;

}

```

In this code, the compiler does not know anything about the values of images[i], so it assumes that 256 is a possible value for it. In that case, (A) would change the value of histogram.total, so (B) has to load, add one and store [godbolt].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use values to form a pointer to total and dereference it. What valid C/C++ code is allowed to do with values is:

- Form any pointer in [values, values + 256].
- Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has no notion of types. In particular, given an LLVM aggregate definition:

```
%struct.S = type { [42 x i32], i32, i32 }
```

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep reference] representing indexing into the [42 x i32] array to load the i32 member at index 2. It is also valid for %i to be 43 (though not 44 if an inbound GEP is used).

So clang has to give LLVM more information about the C/C++ rules.

IR representation:
LLVM has several ways of representing ranges of values:
- !range metadata can be attached to integer call and load instructions to indicate the allowed range of values of the result. LLVM's ValueTracking provides a function for querying the range for any llvm::Variable.
- The llvm.assume intrinsic takes a boolean condition that can also be used by ValueTracking to infer range of values.
- The inrange attribute of GEP can be used to indicate C-like semantics for the structure field marked with the inrange attribute. It can only be used for GEP constantexprs (ie.e. GEPs defined inline), but not for standalone GEPs defining instructions. relevant discussion.

Alternatives:
(1) Annotate each array subscript index value with a range, e.g.:

```
%i = i64 …
%ri = call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %ri
...
!0 = !{i64 0, i64 42}

```
(2) (variant of 1) relax the constraint that !range metadata can only be set on call and load instructions, and set the !range metadata on the index expression. We still need annotations for function parameters though:

```
%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %i
...
!0 = !{i64 0, i64 42}

```
This is slightly more compact.

(3) Same as (1), with llvm.assume. This feels inferior to annotations.
(4) Extend inrange to non-constantexprs GEPs. It is unclear how this will interfere with optimizations.

On the clang side:

The clang part is quite trivial as the infrastructure is already in place to emit dynamic ubsan guards: D99248

On the LLVM Side:

Alternatives:
(A) - (annotation or assume options) Simply enable scev-aa which knows how to handle value ranges in general. IIUC it's not enabled in clang because it has issues with invalidation when code changes, and is therefore not cacheable. This makes it too slow to be practical.
(B) - (annotation or assume options) Teach BasicAA to honor !range metadata (D99247)
(C) - (inrange option) Teach BasicAA to honor inrange attributes of GEP.

I was leaning towards (1) and (B) because:

- BasicAA already has basic support for value range analysis (zero/nonzero), this is a small and natural extension.

- The BasicAA improvement already benefits some existing code (as evidenced by the test changes in D99247)

- Using range metadata rather than the `inrange` attribute means that BasicAA will automatically benefit from improvements in value tracking in the future.

Opinions ?

Nuno Lopes via llvm-dev

unread,

Mar 24, 2021, 7:00:27 AM3/24/21

to Clement Courbet, llvm-dev

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.

In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

Just my 2c.

Nuno

Clement Courbet via llvm-dev

unread,

Mar 24, 2021, 7:19:56 AM3/24/21

to Nuno Lopes, llvm-dev, Peter Collingbourne

On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunop...@sapo.pt> wrote:

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right?

Agreed, but inrange was not extended to non-const geps. I'm not sure exactly why. From what I understand of the previous discussions, there were non-trivial issues with supporting `inrange` for non-const geps, though it's not exactly clear to me what these were. Peter, what was the blocker for that ?

It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.
In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

Yes, for the purpose of BasicAA `inrange` is a special case of range, where the range happens to be the size of the underlying `inrange` field.

Johannes Doerfert via llvm-dev

unread,

Mar 24, 2021, 9:20:25 AM3/24/21

to Clement Courbet, llvm-dev

I really like encoding more (range) information in the IR,
more thoughts inlined.

On 3/24/21 4:14 AM, Clement Courbet via llvm-dev wrote:
> Hi everyone,
>
> tl;dr: I would like to teach clang to output range metadata so that LLVM
> can do better alias analysis. I have a proposal as D99248

> <https://reviews.llvm.org/D99248> (clang part) and D99247
> <https://reviews.llvm.org/D99247> (llvm part). But there are other possible

> options that I'm detailing below.
>
> Consider the following code, adapted from brotli

> <https://en.wikipedia.org/wiki/Brotli>:

>
> ```
>
> struct Histogram {
>
> int values[256];
>
> int total;
>
> };
>
> Histogram DoIt(const int* image, int size) {
>
> Histogram histogram;
>
> for (int i = 0; i < size; ++i) {
>
> ++histogram.values[image[i]]; // (A)
>
> ++histogram.total; // (B)
>
> }
>
> return histogram;
>
> }
> ```
>
> In this code, the compiler does not know anything about the values of
> images[i], so it assumes that 256 is a possible value for it. In that case,
> (A) would change the value of histogram.total, so (B) has to load, add one

> and store [godbolt <https://godbolt.org/z/KxE343>].

>
> Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
> values to form a pointer to total and dereference it. What valid C/C++ code
> is allowed to do with values is:
> - Form any pointer in [values, values + 256].
> - Form and dereference any pointer in [values, values + 256)
>
> Note that the LLVM memory model is much laxer than that of C/C++. It has no
> notion of types. In particular, given an LLVM aggregate definition:
>
> ```
> %struct.S = type { [42 x i32], i32, i32 }
> ```
>
> It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep
> reference] representing indexing into the [42 x i32] array to load the i32
> member at index 2. It is also valid for %i to be 43 (though not 44 if an
> inbound GEP is used).
> So clang has to give LLVM more information about the C/C++ rules.
>

> *IR representation:*

> LLVM has several ways of representing ranges of values:

> - *!range* metadata can be attached to integer call and load instructions

> to indicate the allowed range of values of the result. LLVM's ValueTracking
> provides a function for querying the range for any llvm::Variable.

> - The *llvm.assume* intrinsic takes a boolean condition that can also be

> used by ValueTracking to infer range of values.

> - The *inrange* attribute of GEP can be used to indicate C-like semantics

> for the structure field marked with the inrange attribute. It can only be
> used for GEP constantexprs (ie.e. GEPs defined inline), but not for
> standalone GEPs defining instructions. relevant discussion

> <https://reviews.llvm.org/D22793?id=65626#inline-194653>.
>
> Alternatives:
> *(1) *Annotate each array subscript index value with a range, e.g.:

> ```
> %i = i64 …
> %ri = call i64 @llvm.annotation.i64(%index), !range !0
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
> %ri
> ...
> !0 = !{i64 0, i64 42}
> ```

> *(2) *(variant of 1) relax the constraint that !range metadata can only be

> set on call and load instructions, and set the !range metadata on the index
> expression. We still need annotations for function parameters though:
> ```
> %i = i64 … , !range !0
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
> %i
> ...
> !0 = !{i64 0, i64 42}
> ```
> This is slightly more compact.
>

> *(3)* Same as (1), with llvm.assume. This feels inferior to annotations.
> *(4)* Extend inrange to non-constantexprs GEPs. It is unclear how this will
> interfere with optimizations.

I would very much like not to introduce another way to encode
assumptions other than `llvm.assume`. If you want to avoid the extra
instructions, use `llvm.assume(i1 true) ["range"(%val, %lb, %ub)]`,
which is in line with our move towards operand bundle use.

SCEV should be thought about this (as well), unsure what the problem
is you describe below. If BasicAA needs to know, sure.

~ Johannes

>
> *On the clang side*:

> The clang part is quite trivial as the infrastructure is already in place

> to emit dynamic ubsan guards: D99248 <https://reviews.llvm.org/D99248>
>
> *On the LLVM Side:*
> Alternatives:
> *(A)* - (annotation or assume options) Simply enable scev-aa which knows

> how to handle value ranges in general. IIUC it's not enabled in clang
> because it has issues with invalidation when code changes, and is therefore
> not cacheable. This makes it too slow to be practical.

> *(B) *- (annotation or assume options) Teach BasicAA to honor !range
> metadata (D99247 <https://reviews.llvm.org/D99247>)
> *(C)* - (inrange option) Teach BasicAA to honor inrange attributes of GEP.

>
> I was leaning towards (1) and (B) because:
> - BasicAA already has basic support for value range analysis
> (zero/nonzero), this is a small and natural extension.
> - The BasicAA improvement already benefits some existing code (as

> evidenced by the test changes in D99247 <https://reviews.llvm.org/D99247>)

> - Using range metadata rather than the `inrange` attribute means that
> BasicAA will automatically benefit from improvements in value tracking in
> the future.
>
> Opinions ?
>
>

> _______________________________________________
> LLVM Developers mailing list
> llvm...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Clement Courbet via llvm-dev

unread,

Mar 24, 2021, 10:07:02 AM3/24/21

to Johannes Doerfert, llvm-dev

Thanks, I did not know about that. I've just tried it but it appears that tags have to be attribute names, and `!range` is not a valid attribute, it's a metadata node. Is there a way to encode this ?

SCEV should be thought about this (as well), unsure what the problem
is you describe below. If BasicAA needs to know, sure.

scev-aa already knows how to use range information. If we add a !range metadata in clang right now and use SCEV, there is nothing to do on the LLVM side. My point was that scev-aa is not enabled in the default pipeline, so we might as well teach BasicAA about this cheap case.

Actually I think it makes sense to teach BasicAA about range information anyway (D99247) given that it could already be useful in cases like:

```

define dso_local void @DoIt(%struct.Histogram* noalias nocapture sret(%struct.Histogram) align 4 %0, i32 %1, i8 zeroext %2) local_unnamed_addr #0 {
...
%6 = zext i8 %2 to i64
%7 = getelementptr inbounds %struct.Histogram, %struct.Histogram* %0, i64 0, i32 0, i64 %6
```

Where ValueTracking could easily deduce that %6 is in [0,255].

Johannes Doerfert via llvm-dev

unread,

Mar 24, 2021, 11:16:27 AM3/24/21

to Clement Courbet, llvm-dev

We need to get rid of that assertion. There are other non-attributes
to be used in assume operand bundles in the (near) future, so the this
work has to be done anyway.

>
>> SCEV should be thought about this (as well), unsure what the problem
>> is you describe below. If BasicAA needs to know, sure.
>>
> scev-aa already knows how to use range information. If we add a !range
> metadata in clang right now and use SCEV, there is nothing to do on the
> LLVM side. My point was that scev-aa is not enabled in the default
> pipeline, so we might as well teach BasicAA about this cheap case.
>
> Actually I think it makes sense to teach BasicAA about range information
> anyway (D99247) given that it could already be useful in cases like:
>
> ```
> define dso_local void @DoIt(%struct.Histogram* noalias nocapture sret(

> %struct.Histogram) align 4 %0, i32 %1, *i8 zeroext %2*) local_unnamed_addr
> #0 {
> ...
> *%6 = zext i8 %2 to i64*

> %7 = getelementptr inbounds %struct.Histogram, %struct.Histogram* %0, i64 0

> , i32 0, *i64 %6*

> ```
>
> Where ValueTracking could easily deduce that %6 is in [0,255].

Sounds reasonable.

~ Johannes

Florian Hahn via llvm-dev

unread,

Mar 24, 2021, 1:48:04 PM3/24/21

to Johannes Doerfert, llvm-dev

+1 on trying to use assume, rather than adding another way.

But are value ranges special for assumes, so that we need to handle them in a bundle? Is that just so we can easier skip ‘artificial’ assume users?

Cheers,

Florian

Peter Collingbourne via llvm-dev

unread,

Mar 24, 2021, 2:58:01 PM3/24/21

to Clement Courbet, llvm-dev, Peter Collingbourne

On Wed, Mar 24, 2021 at 4:20 AM Clement Courbet via llvm-dev <llvm...@lists.llvm.org> wrote:

On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunop...@sapo.pt> wrote:
In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right?

Agreed, but inrange was not extended to non-const geps. I'm not sure exactly why. From what I understand of the previous discussions, there were non-trivial issues with supporting `inrange` for non-const geps, though it's not exactly clear to me what these were. Peter, what was the blocker for that ?

I don't recall any non-trivial issues. Reading the previous discussion, the only issues were mechanical ones, i.e. we would need an IR representation of inrange for non-constant GEPs together with printing/parsing/bitcode support.

Peter

_______________________________________________

LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Peter

Johannes Doerfert via llvm-dev

unread,

Mar 24, 2021, 3:32:10 PM3/24/21

to Florian Hahn, llvm-dev

It would make users explicit and we will have non-attribute bundles anyway.
I find it also "conceptually nicer", would you prefer explicit instructions?

~ Johannes

> Cheers,
> Florian

Clement Courbet via llvm-dev

unread,

Mar 25, 2021, 9:53:27 AM3/25/21

to Johannes Doerfert, llvm-dev

To summarize here is how each proposal would look like, along with a list of pros and cons for each.

%struct.S = type { i32, [2 x i32], i32 }

inrange

define void @with_inrange(%struct.S* %s, i32 %in_array) {
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, inrange i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

(Note that this is not currently valid IR, we need to make `inrange` work on standalone GEP instructions)

pros:

- compact, no extra instructions

cons:

- The range information is local to the GEP instruction, and cannot be easily used for other purposes, e.g. SCEV (the ValueTracker is not aware of it)

annotation with range metadata

define void @with_annotation(%struct.S* %s, i32 %index) {
%in_array = call i32 @llvm.annotation.i32(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

pros:

- a single added instruction

- the ValueTracker already knows about range metadata

- deriving the range in ValueTracker involves a single metadata lookup.

- The range information will be available for other purposes (e.g. SCEV).

cons:

- one extra instruction

- llvm.annotation is not widely used

assume
define void @with_assume(%struct.S* %s, i32 %in_array) {
%cmp1 = icmp sge i32 %in_array, 0
%cmp2 = icmp slt i32 %in_array, 2
%cmp = and i1 %cmp1, %cmp2
call void @llvm.assume(i1 %cmp)
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

pros:

- The range information will be available for other purposes (e.g. SCEV).

cons:

- several extra instructions

- the ValueTracker cannot currently derive a range from this

- Deriving the range can be costly

assume with bundle
define void @with_assume_bundle(%struct.S* %s, i32 %in_array) {
call void @llvm.assume(i1 true) ["range"(i32 %in_array, i32 0, i32 2)]
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

(Note that this is not currently valid IR, we need to make bundle tags work for non-attributes)

pros:

- The range information will be available for other purposes (e.g. SCEV).

- ? (see below)

cons:

- one extra instruction

- ? (see below)

@johannes For the last option, I'm also not sure what becomes of the "range" tag in the assume bundle once we remove the assertion ? Does it become attached to the value (%in_array), in which case we have the advantages of range metadata (option 2), or is this still attached to the assume, with the disadvantages of option 3 ?

Clement Courbet via llvm-dev

unread,

Mar 25, 2021, 10:42:35 AM3/25/21

to Nuno Lopes, llvm-dev

On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunop...@sapo.pt> wrote:

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.
In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

So I tried implementing this on top of `inrange`, and I found out that you can only have one `inrange` index, which doesn't allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341

Roman Lebedev via llvm-dev

unread,

Mar 25, 2021, 10:46:33 AM3/25/21

to Clement Courbet, llvm-dev

On Thu, Mar 25, 2021 at 5:42 PM Clement Courbet via llvm-dev
<llvm...@lists.llvm.org> wrote:
>
>
>
> On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunop...@sapo.pt> wrote:
>>
>> In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.
>>
>> In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.
>>
>>
>
>
> So I tried implementing this on top of `inrange`, and I found out that you can only have one `inrange` index, which doesn't allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341

Are you aware of upcoming Opaque Pointers?
There won't be any multi-dimensional GEP's after that.

Roman

Clement Courbet via llvm-dev

unread,

Mar 25, 2021, 12:23:22 PM3/25/21

to Roman Lebedev, llvm-dev

On Thu, Mar 25, 2021 at 3:46 PM Roman Lebedev <lebed...@gmail.com> wrote:

On Thu, Mar 25, 2021 at 5:42 PM Clement Courbet via llvm-dev
<llvm...@lists.llvm.org> wrote:
>
>
>
> On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunop...@sapo.pt> wrote:
>>
>> In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.
>>
>> In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.
>>
>>
>
>
> So I tried implementing this on top of `inrange`, and I found out that you can only have one `inrange` index, which doesn't allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341
Are you aware of upcoming Opaque Pointers?
There won't be any multi-dimensional GEP's after that.

I'm not familiar with that ? Does that mean that any N-dimensional GEP will become a chain of N 1-dimensional GEPs ?

Michael Kruse via llvm-dev

unread,

Mar 25, 2021, 12:31:19 PM3/25/21

to Clement Courbet, llvm-dev

Hi,

there might be some overlap with this proposal:
https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html

Michael

Am Do., 25. März 2021 um 08:53 Uhr schrieb Clement Courbet via
llvm-dev <llvm...@lists.llvm.org>:

Michael Kruse via llvm-dev

unread,

Mar 25, 2021, 12:33:28 PM3/25/21

to Clement Courbet, llvm-dev

Am Do., 25. März 2021 um 09:42 Uhr schrieb Clement Courbet via
llvm-dev <llvm...@lists.llvm.org>:

> So I tried implementing this on top of `inrange`, and I found out that you can only have one `inrange` index, which doesn't allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341

I think this is only for the internal representation. According to the
reference (https://llvm.org/docs/LangRef.html#id230), it is allowed
for each dimension.

Michael

Johannes Doerfert via llvm-dev

unread,

Mar 25, 2021, 1:11:50 PM3/25/21

to Clement Courbet, llvm-dev

On 3/25/21 8:53 AM, Clement Courbet wrote:
> To summarize here is how each proposal would look like, along with a list
> of pros and cons for each.
>
> %struct.S = type { i32, [2 x i32], i32 }
>

> *inrange*

> define void @with_inrange(%struct.S* %s, i32 %in_array) {
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> inrange i32 %in_array
> %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
> ret void
> }
>
> (Note that this is not currently valid IR, we need to make `inrange` work
> on standalone GEP instructions)
> pros:
> - compact, no extra instructions
> cons:
> - The range information is local to the GEP instruction, and cannot be
> easily used for other purposes, e.g. SCEV (the ValueTracker is not aware of
> it)

This seems fine to me, SCEV and others can learn about this.
I assume we'll want the range `llvm.assume` anyway but that doesn't
necessarily preclude this.

> *annotation with range metadata*

> define void @with_annotation(%struct.S* %s, i32 %index) {
> %in_array = call i32 @llvm.annotation.i32(%index), !range !0
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> i32 %in_array
> %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
> ret void
> }
>
> pros:
> - a single added instruction
> - the ValueTracker already knows about range metadata
> - deriving the range in ValueTracker involves a single metadata lookup.
> - The range information will be available for other purposes (e.g. SCEV).
> cons:
> - one extra instruction
> - llvm.annotation is not widely used

This is my least favorite. It introduces a completely new concept
for something that should be attached to the GEP or assume.

>
>
> *assume*

> define void @with_assume(%struct.S* %s, i32 %in_array) {
> %cmp1 = icmp sge i32 %in_array, 0
> %cmp2 = icmp slt i32 %in_array, 2
> %cmp = and i1 %cmp1, %cmp2
> call void @llvm.assume(i1 %cmp)
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> i32 %in_array
> %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
> ret void
> }
>
> pros:
> - The range information will be available for other purposes (e.g. SCEV).
> cons:
> - several extra instructions
> - the ValueTracker cannot currently derive a range from this
> - Deriving the range can be costly

> <https://github.com/llvm/llvm-project/blob/dd388ba3e0b0a5f06565d0bcb6e1aebb5daac065/llvm/lib/Analysis/ValueTracking.cpp#L638>

This is the old assume way and OK. I don't think the cost of
looking at assumptions is a good argument because we should
come up with solutions there anyway. The extra instructions
and uses are the real downside here.

>
> *assume with bundle*

> define void @with_assume_bundle(%struct.S* %s, i32 %in_array) {
> call void @llvm.assume(i1 true) ["range"(i32 %in_array, i32 0, i32 2)]
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> i32 %in_array
> %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
> ret void
> }
>
> (Note that this is not currently valid IR, we need to make bundle tags work
> for non-attributes)
> pros:
> - The range information will be available for other purposes (e.g. SCEV).
> - ? (see below)
> cons:
> - one extra instruction
> - ? (see below)
>
> @johannes For the last option, I'm also not sure what becomes of the
> "range" tag in the assume bundle once we remove the assertion ? Does it
> become attached to the value (%in_array), in which case we have the
> advantages of range metadata (option 2), or is this still attached to the
> assume, with the disadvantages of option 3 ?

So, as I said before, we'll have to make bundles work for non-attribute
tags anyway, that should not be a negative point here. The extra instruction
is there but we already know how to combine multiple `llvm.assume`s into
one.

The tag of the bundle describes the kind of assumption. In general, it could
be pretty much anything, the user needs to know how to interpret the
arguments.

All arguments of a bundle are "attached" to the tag, think of the tag as a
function name of an uninterpreted function. The "range" one could look
like this
```
void range(int v, int lb, int ub) {
__builtin_assume(lb <= v && v < ub);
}
```

Later I'm confident we even want such explicit assumption functions, so
you can
write the following, given the "range" definition above is in the module.
`call void @llvm.assume(i1 true) ["assumption_fn"(@range, i32 %in_array,
i32 0, i32 2)]`
For more justification see:
https://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html

I'm not sure I understand the questions properly. It basically is range
metadata attached
to an arbitrary value for a fixed program point. Could you elaborate on
your comments wrt.
the connection to option 2 and 3, please.

~ Johannes

Florian Hahn via llvm-dev

unread,

Mar 27, 2021, 2:31:08 PM3/27/21

to Johannes Doerfert, llvm-dev

One disadvantage of using a bundle (or !range metadata) is that we treat ranges for certain values in a special way and differently to how we treat range information expressed by the user e.g. via conditions (or builtin assume). This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled, which in turn can lead to surprising results (of the form: why does a transformation apply if information provided in a certain way, but does not apply of the equivalent info is provided in a different way).

Using instruction potentially also allows us to specify more complex ranges, in relation to other values.

But I realize that there are some practical consideration that make the instruction approach less appealing and I am all in favor of the more pragmatic & practical solution to start with.

Cheers,

Florian

Johannes Doerfert via llvm-dev

unread,

Mar 27, 2021, 4:37:18 PM3/27/21

to Florian Hahn, llvm-dev

I don't think this is necessarily accurate. We can, and already do
(https://godbolt.org/z/MaMEb1Koo),
generate bundles from conditions. If we can interpret a condition, why
could we not rewrite it into a bundle?
I'm not sure why this is any different form other normalization we do.
(And bundles have benefits over implicit
instruction encodings, for example use tracking and #instructions.)

> This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled, which in turn can lead to surprising results (of the form: why does a transformation apply if information provided in a certain way, but does not apply of the equivalent info is provided in a different way).
> Using instruction potentially also allows us to specify more complex ranges, in relation to other values.

I don't see how bundles would restrict us in any way. I mean, if we want
to express property XYZ for %v and %q,
`llvm.assume(i1 true) ["XYZ"(%v, %q)]`, makes it really easy and it is
arguably as generic as you want it to be.

> But I realize that there are some practical consideration that make the instruction approach less appealing and I am all in favor of the more pragmatic & practical solution to start with.

I think bundles, and more generic assumptions, are what we need in the
future. I still believe we should use them
to encode information in assertions [0], among other things, without
running into the risk of having side-effects
that influence the compilation.

~ Johannes

[0] https://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html

James Courtier-Dutton via llvm-dev

unread,

Mar 28, 2021, 4:50:37 AM3/28/21

to Johannes Doerfert, llvm-dev

Hi,

char* test_fill(int size) {
char *test1 = malloc(size)
for (n = 0; n <= size; n++) {
test1[n] = 'A';
}
}

Would it be worth making the "range" information a little richer and
be able to use algebraic expressions as well as numeric ranges.
Note: the above example code has an off by one overflow, and it would
be helpful if one could catch that at compile time.
In this case, it could catch that n must be less than size, and not
less than or equal to size.
Thus putting the range value on the test1 pointer as being from
address of test1 to test1 + (size - 1)

This can only be achieved if algebraic expressions are used for
ranges, and not just constant values.
Actual use cases can get much more complicated with for example,
non-contiguous ranges. e.g. 0,1,4,5 ok, but 2,3,6,7 not ok.

Another useful thing to catch at compile time, would be a warning that
a pointer is being dereferenced, and we were not able to apply a range
expression to it. I.e. warn about unbounded dereferences.

I think it would be useful to at least consider how we would capture
this more complex range information/metadata in LLVM IR.

Kind Regards

James

> >>>> On 3/24/21 9:06 AM, Clement Courbet wrote:
> >>>>> On Wed, Mar 24, 2021 at 2:20 PM Johannes Doerfert <
> >>>>> johannes...@gmail.com> wrote:
> >>>>>
> >>>>>> I really like encoding more (range) information in the IR,
> >>>>>> more thoughts inlined.
> >>>>>>
> >>>>>> On 3/24/21 4:14 AM, Clement Courbet via llvm-dev wrote:
> >>>>>>> struct Histogram {
> >>>>>>>
> >>>>>>> int values[256];
> >>>>>>>
> >>>>>>> int total;
> >>>>>>>
> >>>>>>> };
> >>>>>>>
> >>>>>>> Histogram DoIt(const int* image, int size) {
> >>>>>>>
> >>>>>>> Histogram histogram;
> >>>>>>>
> >>>>>>> for (int i = 0; i < size; ++i) {
> >>>>>>>
> >>>>>>> ++histogram.values[image[i]]; // (A)
> >>>>>>>
> >>>>>>> ++histogram.total; // (B)
> >>>>>>>
> >>>>>>> }
> >>>>>>>
> >>>>>>> return histogram;
> >>>>>>>
> >>>>>>> }

Johannes Doerfert via llvm-dev

unread,

Mar 28, 2021, 11:44:31 AM3/28/21

to James Courtier-Dutton, llvm-dev

On 3/28/21 3:49 AM, James Courtier-Dutton wrote:
> Hi,
>
> char* test_fill(int size) {
> char *test1 = malloc(size)
> for (n = 0; n <= size; n++) {
> test1[n] = 'A';
> }
> }
>
> Would it be worth making the "range" information a little richer and
> be able to use algebraic expressions as well as numeric ranges.
> Note: the above example code has an off by one overflow, and it would
> be helpful if one could catch that at compile time.
> In this case, it could catch that n must be less than size, and not
> less than or equal to size.
> Thus putting the range value on the test1 pointer as being from
> address of test1 to test1 + (size - 1)
>
> This can only be achieved if algebraic expressions are used for
> ranges, and not just constant values.
> Actual use cases can get much more complicated with for example,
> non-contiguous ranges. e.g. 0,1,4,5 ok, but 2,3,6,7 not ok.
>
> Another useful thing to catch at compile time, would be a warning that
> a pointer is being dereferenced, and we were not able to apply a range
> expression to it. I.e. warn about unbounded dereferences.
>
> I think it would be useful to at least consider how we would capture
> this more complex range information/metadata in LLVM IR.

I think what you want is the max object extend attribute, formerly
known as max object size when we only wanted to track an upper bound
next revision shall also include the lower one:
https://reviews.llvm.org/D87975

If we allow values instead of only constants you can "properly"
generate warnings, using SCEV to determine the range of `n` above.

That said, in operand bundles we can generally allow non-constant
values, e.g., `["range"(%p, i32 0, i32 %N)]`

~ Johannes

Clement Courbet via llvm-dev

unread,

Mar 29, 2021, 1:43:32 AM3/29/21

to Johannes Doerfert, llvm-dev

On Sun, Mar 28, 2021 at 5:44 PM Johannes Doerfert via llvm-dev <llvm...@lists.llvm.org> wrote:

On 3/28/21 3:49 AM, James Courtier-Dutton wrote:
> Actual use cases can get much more complicated with for example,
> non-contiguous ranges. e.g. 0,1,4,5 ok, but 2,3,6,7 not ok.

This can actually be represented by !range metadata, but only for constant values.

That said, in operand bundles we can generally allow non-constant
values, e.g., `["range"(%p, i32 0, i32 %N)]`

That's a good point and it essentially disqualifies !range metadata if we want to one day be able to deal with values..

> This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled

I don't think this is really an issue, because whatever the choice of representation is, all these variants should be using `ValueTracking` to abstract the representation anyway.

> I'm not sure I understand the questions properly. It basically is range metadata attached to an arbitrary value for a fixed program point. Could you elaborate on your comments wrt. the connection to option 2 and 3, please.

It's just that I was not sure about the internal representation, but you've answered by question: It's internally a different concept from !range metadata, so I'll need to teach ValueTracking about it.

Clement Courbet via llvm-dev

unread,

Mar 29, 2021, 2:09:02 AM3/29/21

to Michael Kruse, llvm-dev

On Thu, Mar 25, 2021 at 5:33 PM Michael Kruse <llv...@meinersbur.de> wrote:

Am Do., 25. März 2021 um 09:42 Uhr schrieb Clement Courbet via
llvm-dev <llvm...@lists.llvm.org>:
> So I tried implementing this on top of `inrange`, and I found out that you can only have one `inrange` index, which doesn't allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341

I think this is only for the internal representation. According to the
reference (https://llvm.org/docs/LangRef.html#id230), it is allowed
for each dimension.

This means that we would have to store one bit per index instead of a single index (right now the implementation is limiting the number of indices to 64). That means that we can support at most 6 indices with SubclassOptionalData. That might be sufficient for the most common cases (not representing the inrange data is not a correctness issue), but we have to be aware of that limitation.

Florian Hahn via llvm-dev

unread,

Mar 30, 2021, 1:25:54 PM3/30/21

to Johannes Doerfert, llvm-dev

On Mar 27, 2021, at 20:37, Johannes Doerfert <johannes...@gmail.com> wrote:
On 3/27/21 1:30 PM, Florian Hahn wrote:
On Mar 24, 2021, at 19:32, Johannes Doerfert <johannes...@gmail.com> wrote:
On 3/24/21 12:47 PM, Florian Hahn wrote:
On Mar 24, 2021, at 15:16, Johannes Doerfert via llvm-dev <llvm...@lists.llvm.org> wrote:

We need to get rid of that assertion. There are other non-attributes
to be used in assume operand bundles in the (near) future, so the this
work has to be done anyway.
+1 on trying to use assume, rather than adding another way.

But are value ranges special for assumes, so that we need to handle them in a bundle? Is that just so we can easier skip ‘artificial’ assume users?
It would make users explicit and we will have non-attribute bundles anyway.
I find it also "conceptually nicer", would you prefer explicit instructions?
One disadvantage of using a bundle (or !range metadata) is that we treat ranges for certain values in a special way and differently to how we treat range information expressed by the user e.g. via conditions (or builtin assume).

I don't think this is necessarily accurate. We can, and already do (https://godbolt.org/z/MaMEb1Koo),
generate bundles from conditions. If we can interpret a condition, why could we not rewrite it into a bundle?
I'm not sure why this is any different form other normalization we do. (And bundles have benefits over implicit
instruction encodings, for example use tracking and #instructions.)

I’m not arguing that it is not possible to do everything with assume bundles. I am saying that we end up with at least 2 ways to encode the same information, so we need to handle 2 parallel encodings (i.e. we always have to handle conditions from the program control flow, which is represented via instructions)

I think the !nonnull example you shared is illustrates the extra work passes will have to do. For example, a couple of passes know how to generically handle information from assumes, exactly because the conditions used for the assumes are not special and they already have to handle the same conditions for branches. If we instead convert the condition to a special bundle, all those passes will need updating to properly interpret !nonnull (and future bundles). Examples include SCCP, NewGVN, parts of SCEV.

FTR I think assume bundles are great to express interesting properties!

I am just trying to highlight some potential drawbacks when it comes to ranges or other properties we can express directly in LLVM IR already. I am sure it would be possible to add some extra abstraction to make it easier to update the relevant passes, it’s just a cost to consider.

This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled, which in turn can lead to surprising results (of the form: why does a transformation apply if information provided in a certain way, but does not apply of the equivalent info is provided in a different way).
Using instruction potentially also allows us to specify more complex ranges, in relation to other values.

I don't see how bundles would restrict us in any way. I mean, if we want to express property XYZ for %v and %q,
`llvm.assume(i1 true) ["XYZ"(%v, %q)]`, makes it really easy and it is arguably as generic as you want it to be.

I agree that it is possible to encode more interesting properties with assume bundles, but wouldn’t we end up duplicating all existing compare predicates for example? And for something like %x + %y < %x we would either fall back to instructions again or come up with a way to encode that in bundles as well. If we still use IR instructions for more complex expressions, we’d still need a way to exclude the ‘assume-only’ uses.

But I realize that there are some practical consideration that make the instruction approach less appealing and I am all in favor of the more pragmatic & practical solution to start with.

I think bundles, and more generic assumptions, are what we need in the future. I still believe we should use them
to encode information in assertions [0], among other things, without running into the risk of having side-effects
that influence the compilation.

Again, I am not saying we shouldn’t, just that there are some potential drawbacks in some cases.

Cheers,

Florian

Johannes Doerfert via llvm-dev

unread,

Mar 30, 2021, 1:51:43 PM3/30/21

to Florian Hahn, llvm-dev

On 3/30/21 12:25 PM, Florian Hahn wrote:
>
>> On Mar 27, 2021, at 20:37, Johannes Doerfert <johannes...@gmail.com> wrote:
>> On 3/27/21 1:30 PM, Florian Hahn wrote:
>>>> On Mar 24, 2021, at 19:32, Johannes Doerfert <johannes...@gmail.com> wrote:
>>>> On 3/24/21 12:47 PM, Florian Hahn wrote:
>>>>>> On Mar 24, 2021, at 15:16, Johannes Doerfert via llvm-dev <llvm...@lists.llvm.org> wrote:
>>>>>> We need to get rid of that assertion. There are other non-attributes
>>>>>> to be used in assume operand bundles in the (near) future, so the this
>>>>>> work has to be done anyway.
>>>>> +1 on trying to use assume, rather than adding another way.
>>>>>
>>>>> But are value ranges special for assumes, so that we need to handle them in a bundle? Is that just so we can easier skip ‘artificial’ assume users?
>>>> It would make users explicit and we will have non-attribute bundles anyway.
>>>> I find it also "conceptually nicer", would you prefer explicit instructions?
>>> One disadvantage of using a bundle (or !range metadata) is that we treat ranges for certain values in a special way and differently to how we treat range information expressed by the user e.g. via conditions (or builtin assume).

>> I don't think this is necessarily accurate. We can, and already do (https://godbolt.org/z/MaMEb1Koo <https://godbolt.org/z/MaMEb1Koo>),

>> generate bundles from conditions. If we can interpret a condition, why could we not rewrite it into a bundle?
>> I'm not sure why this is any different form other normalization we do. (And bundles have benefits over implicit
>> instruction encodings, for example use tracking and #instructions.)
>>
> I’m not arguing that it is not possible to do everything with assume bundles. I am saying that we end up with at least 2 ways to encode the same information, so we need to handle 2 parallel encodings (i.e. we always have to handle conditions from the program control flow, which is represented via instructions)
>
> I think the !nonnull example you shared is illustrates the extra work passes will have to do. For example, a couple of passes know how to generically handle information from assumes, exactly because the conditions used for the assumes are not special and they already have to handle the same conditions for branches. If we instead convert the condition to a special bundle, all those passes will need updating to properly interpret !nonnull (and future bundles). Examples include SCCP, NewGVN, parts of SCEV.
>
> FTR I think assume bundles are great to express interesting properties!
>
> I am just trying to highlight some potential drawbacks when it comes to ranges or other properties we can express directly in LLVM IR already. I am sure it would be possible to add some extra abstraction to make it easier to update the relevant passes, it’s just a cost to consider.

You are right,t sometimes we will need more code to "also" handle the
bundles, especially when the same conditions can occur in regular code
as well.
My point was that we probably want a canonical assumption representation
and bundles generally have more benefits over explicit encodings. This might
require us to teach passes a new encoding but we will then use the new
encoding for all assumptions of a certain kind and start generating
those right
away wherever possible.

>
>>> This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled, which in turn can lead to surprising results (of the form: why does a transformation apply if information provided in a certain way, but does not apply of the equivalent info is provided in a different way).
>>> Using instruction potentially also allows us to specify more complex ranges, in relation to other values.
>> I don't see how bundles would restrict us in any way. I mean, if we want to express property XYZ for %v and %q,
>> `llvm.assume(i1 true) ["XYZ"(%v, %q)]`, makes it really easy and it is arguably as generic as you want it to be.
>>
> I agree that it is possible to encode more interesting properties with assume bundles, but wouldn’t we end up duplicating all existing compare predicates for example? And for something like %x + %y < %x we would either fall back to instructions again or come up with a way to encode that in bundles as well. If we still use IR instructions for more complex expressions, we’d still need a way to exclude the ‘assume-only’ uses.

For "common assumption" I would strongly suggest "known bundles", e.g.,
for frequent kinds of inequalities maybe.
For "complex assumptions" I would prefer we do outlined assumptions to
deal with the uses problem while also *gaining*
expressiveness. What I mean was described in the email [0] under design
idea 2) and came up a few times since.
In addition to complex instruction based encodings of assumptions it
allows us to deal with calls and side-effects properly.
I would use such an encoding and teach the Attributor to transfer
knowledge about the arguments from the assumption to the
outlined assumption function and back as new "known bundles". So we
normalize and specialize in the outlined function and
transfer back what we know other passes can actually digest.

~ Johannes

[0] https://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html

>

>>> But I realize that there are some practical consideration that make the instruction approach less appealing and I am all in favor of the more pragmatic & practical solution to start with.
>> I think bundles, and more generic assumptions, are what we need in the future. I still believe we should use them
>> to encode information in assertions [0], among other things, without running into the risk of having side-effects
>> that influence the compilation.
> Again, I am not saying we shouldn’t, just that there are some potential drawbacks in some cases.
>
> Cheers,
> Florian

Michael Kruse via llvm-dev

unread,

Mar 30, 2021, 1:54:16 PM3/30/21

to Clement Courbet, llvm-dev

Am Mo., 29. März 2021 um 01:08 Uhr schrieb Clement Courbet <cou...@google.com>:
> This means that we would have to store one bit per index instead of a single index (right now the implementation is limiting the number of indices to 64). That means that we can support at most 6 indices with SubclassOptionalData. That might be sufficient for the most common cases (not representing the inrange data is not a correctness issue), but we have to be aware of that limitation.

We can also introduce a new field into
GetElementPtrInst/GetElementPtrConstantExpr. Since it's
variable-sized, use of llvm::TrailingObjects (llvm::User calls it
co-allocation) might be necessary.

Florian Hahn via llvm-dev

unread,

Mar 30, 2021, 5:39:36 PM3/30/21

to Clement Courbet, llvm-dev

On Mar 29, 2021, at 06:43, Clement Courbet via llvm-dev <llvm...@lists.llvm.org> wrote:

On Sun, Mar 28, 2021 at 5:44 PM Johannes Doerfert via llvm-dev <llvm...@lists.llvm.org> wrote:

On 3/28/21 3:49 AM, James Courtier-Dutton wrote:
> Actual use cases can get much more complicated with for example,
> non-contiguous ranges. e.g. 0,1,4,5 ok, but 2,3,6,7 not ok.

This can actually be represented by !range metadata, but only for constant values.

That said, in operand bundles we can generally allow non-constant
values, e.g., `["range"(%p, i32 0, i32 %N)]`

That's a good point and it essentially disqualifies !range metadata if we want to one day be able to deal with values..

> This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled

I don't think this is really an issue, because whatever the choice of representation is, all these variants should be using `ValueTracking` to abstract the representation anyway.

There are some passes for which ValueTracking in the current form is not suitable, e.g. because they propagate information forward, rather than querying backwards on-demand as in ValueTracking.

I don’t doubt that we can come up with a reasonable abstraction to make dealing with this easier, especially for the constant/plain value ranges.

Anyways, it feels like the discussion got a bit sidetracked (sorry!) and I think we should probably focus on the main proposal at hand. I’d recommend to go with whatever is supported well on current main for this proposal and discuss the assume bundles separately.