[llvm-dev] [RFC] Introducing a byte type to LLVM

153 views
Skip to first unread message

George Mitenkov via llvm-dev

unread,
Jun 4, 2021, 11:24:45 AM6/4/21
to llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte type to LLVM to fix miscompilations due to load type punning. Please see the proposal below. It would be great to hear the feedback/comments/suggestions!


Motivation

==========

char and unsigned char are considered to be universal holders in C. They can access raw memory and are used to implement memcpy. i8 is the LLVM’s counterpart but it does not have such semantics, which is also not desirable as it would disable many optimizations.


We therefore propose to introduce a new byte type that would have a raw-memory access semantics and can be differentiated from i8. This can help to fix unsound optimizations and make the lowering of char, unsigned char or std::byte correct. Currently, we denote the byte type as b<N>, where N is the number of bits.


In our semantics, byte type carries provenance and copying bytes does not escape pointers, thereby benefiting alias analysis. If the memory contains poison bits, then the result of loading a byte from it is bitwise poison.



Benefits

========

The small calls to memcpy and memmove that are transformed by Instcombine into integer load/store pairs would be correctly transformed into the loads/stores of the new byte type no matter what underlying value is being copied (integral value or a pointer value).

  1. The new byte type solves the problem of copying the padded data, with no contamination of the loaded value due to poison bits of the padding.

  2. When copying pointers as bytes implicit pointer-to-integer casts are avoided.The current setting of performing a memcpy using i8s leads to miscompilations (example: bug report 37469) and is bad for alias analysis.



Lowering of char, unsigned char and std::byte

======================================

For any function that takes char, unsigned char or std::byte as arguments, we lower these to the b8 instead of i8. The motivation comes from the fact that any pointer can be explicitly copied in bytes, and each byte can be passed as an argument to the copying function. Since a copy like that can escape a pointer, we need to generate the byte type to avoid the issue.


Example

void foo(unsigned char arg1, char arg2) {...}

            =>

void @foo(zeroext b8 %arg1, signext b8 %arg2) {...}

 

std::byte is defined as enum class byte : unsigned char {} and therefore all bitwise operations over it can be lowered equivalently to operations on unsigned char (operation is performed over zero-extended to i32 operands, and the result is truncated back).



==============================

Byte Type Semantics (Version 1)

==============================


[allowed] alloca/load/store

=====================

The byte type is allowed to be allocated, stored and loaded. No semantic changes needed.


[allowed] zext/sext/trunc

====================

In order to easily adapt the current replacement of i8 with b8, we want to extend zext/sext/trunc instructions’ semantics to allow the first operand as a byte type as well. This will reduce the number of instructions needed for the cast.


Example

trunc i32 %int to b8

zext i8 %char to b32


[modified] bitcast

==============

We modify the semantics of the bitcast to allow casts from and to the byte types. If the byte has a value of the same kind (integral or a pointer) as the other type, then the bitcast is a noop. Otherwise, the result is poison.


Example

bitcast i<N> %val to b<N>

bitcast i<N>* %val to b64

bitcast b<N> %byte to i<N>     [byte has an integral value => noop]

 

[new] bytecast

============

During IR generation, we cannot deduce whether the byte value contains a pointer or an integral value, and therefore we cannot create a bitcast.


Example

int @cast(char c) { return (int) c; }

Current

Proposed

i32 @cast(i8 signext %c) {

  %1 = alloca i8, align 1

  store i8 %c, i8* %1, align 1

  %2 = load i8, i8* %1, align 1

  %3 = sext i8 %2 to i32

  ret i32 %3

}

i32 @cast(b8 %c) {

  %1 = alloca b8

  store b8 %c, b8* %11

  %2 = load b8, b8* %1

  %3 =    ?    b8 %2 to i32

  ret i32 %3

}

In this example, the operation  ?  can be sext (b8 is i8) or can be ptrtoint if the byte has a pointer value.


We therefore introduce a new bytecast instruction. The frontend will always produce a bytecast, which can then be optimized into a more specific cast if necessary. Bytecast operands must have the same bitwidth (or size).


The semantics of bytecast instruction is:


bytecast b<N> %byte to T  

The kind of the value of the byte

T

Semantics

Pointer

Integral

ptrtoint

Pointer

Pointer

bitcast

Integral

Integral

integral casts (zext, sext, etc.)

Integral

Pointer

inttoptr

 

Essentially, this instruction is a selection of the right casts. Once the compiler has been able to prove the kind of the byte’s value, the bytecast can be replaced with appropriate cast or become a noop.


Example

bytecast b64 %bytes to i32*

bytecast b8 %byte to i8

 

[disabled] and/or/xor

=================

We do not allow bitwise operations over byte types because it’s not trivial to define all cases, particularly the ones involving pointers. If these operations are useful for optimizations and/or some frontends, semantics for these can be considered separately.


If the operation is desired on the frontend level, then the default generated code can always cast the byte type to an integer to operate on integer values. 

 

[disabled] arithmetic

=================

Performing arithmetic operations over the byte type is similarly not allowed (A valid question is what does it mean to add to bytes of raw memory?). If we want to perform arithmetic, we need to cast a byte to an integer (via sext/zext/trunc explicitly).

 

[modified] comparison

==================

We allow performing comparisons, as we may potentially want to compare the ordering of the memory instances, check for null, etc. Comparison is also needed since char types are commonly compared. We define the following semantic of the byte type comparison.


Case 1: The values of the bytes have the same kinds: compare the values.

Case 2: The values of the bytes have the different kinds: do a bytecast of the non-integral type to the other type and compare the integral values.



Example

=======

unsigned char sum(unsigned char a, unsigned char b) { return a + b; }


Current

Proposed

zeroext i8 @sum(i8 zeroext %a, i8 zeroext %b) {

  %1 = alloca i8

  %2 = alloca i8

  store i8 %a, i8* %1

  store i8 %b, i8* %2

  %3 = load i8, i8* %1

  %4 = zext i8 %3 to i32

  %5 = load i8, i8* %2

  %6 = zext i8 %5 to i32

  %7 = add nsw i32 %4, %6

  %8 = trunc i32 %7 to i8

  ret i8 %8

}

zeroext b8 @sum(b8 zeroext %a, b8 zeroext %b) {

  %1 = alloca b8

  %2 = alloca b8

  store b8 %a, b8* %1

  store b8 %b, b8* %2

  %3 = load b8, b8* %1

  %4 = bytecast b8 %3 to i8

  %5 = zext i8 %4 to i32

  %6 = load b8, b8* %2

  %7 = bytecast b8 %6 to i8

  %8 = zext i8 %7 to i32

  %9 = add nsw i32 %5, %8

  %10 = trunc i32 %9 to b8

  ret b8 %10

}





Thanks,

George

Johannes Doerfert via llvm-dev

unread,
Jun 4, 2021, 12:03:42 PM6/4/21
to George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
Hi George,

On 6/4/21 10:24 AM, George Mitenkov via cfe-dev wrote:
> Hi all,
>
> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
> type to LLVM to fix miscompilations due to load type punning. Please see
> the proposal below. It would be great to hear the
> feedback/comments/suggestions!
>
>
> Motivation
> ==========
>
> char and unsigned char are considered to be universal holders in C. They
> can access raw memory and are used to implement memcpy. i8 is the LLVM’s
> counterpart but it does not have such semantics, which is also not
> desirable as it would disable many optimizations.
>
> We therefore propose to introduce a new byte type that would have a raw-memory
> access semantics and can be differentiated from i8. This can help to fix
> unsound optimizations and make the lowering of char, unsigned char or
> std::byte correct. Currently, we denote the byte type as b<N>, where N is
> the number of bits.
>
> In our semantics, byte type carries provenance and copying bytes does not
> escape pointers, thereby benefiting alias analysis. If the memory contains
> poison bits, then the result of loading a byte from it is bitwise poison.

Could you elaborate on the motivation a bit. I'm not sure I follow.
Maybe examples would be good. The only one I found is in
https://bugs.llvm.org/show_bug.cgi?id=37469. For that one I'm
wondering how much we would in reality loose if we make the required
ptr2int explicit when people cast a pointer to an integer. In general,
explicit encoding seems to me preferable and overall less work (new
types kinda scare me TBH).

~ Johannes


>
>
> Benefits
> ========
>
> The small calls to memcpy and memmove that are transformed by Instcombine
> into integer load/store pairs would be correctly transformed into the
> loads/stores of the new byte type no matter what underlying value is being
> copied (integral value or a pointer value).
>

> 1.


>
> The new byte type solves the problem of copying the padded data, with no
> contamination of the loaded value due to poison bits of the padding.

> 2.


>
> When copying pointers as bytes implicit pointer-to-integer casts are
> avoided.The current setting of performing a memcpy using i8s leads to
> miscompilations (example: bug report 37469

> <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for alias

> _______________________________________________
> cfe-dev mailing list
> cfe...@lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

George Mitenkov via llvm-dev

unread,
Jun 4, 2021, 12:35:37 PM6/4/21
to Johannes Doerfert, llvm...@lists.llvm.org, Ralf Jung, cfe...@lists.llvm.org
Hi Johannes,

Sure! The underlying problem is that raw-memory access handlers are treated
as integers, while they are not really integers. Especially std::byte that specifically
states that it has raw-memory access semantics. This semantic mismatch can make
AA wrong and a pointer to escape.

Consider the following LLVM IR that copies a pointer:

%src8 = bitcast i8** %src to i8*

%dst8 = bitcast i8** %dst to i8*

call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false)

%load = load i8*, i8** %dst

%addr = ptrtoint i8* %load to i64

ret i64 %addr


If we optimize the call to memcpy, then the IR becomes


%src64 = bitcast i8** %src to i64*

%dst64 = bitcast i8** %dst to i64*

%addr = load i64, i64* %src64, align 1

store i64 %addr, i64* %dst64, align 1

ret i64 %addr


Since there is no "data" type in LLVM like a byte, the ptrtoint is optimized out and the

pointer escapes. If we have used a pair of byte load/store that would not happen.

The mentioned bug is just an example, but this pattern can be

replicated in other cases that optimize memcpy as integer load/stores and drop ptrtoint,

undermining alias analysis. To illustrate further, consider loading a pointer like:


%v1 = load i8*, %p

%v2 = load i8*, %p


Instcombine optimizes this into


%v1 = load i64, %p

%v2 = inttoptr %v1


changing the underlying provenance (again, not good for AA). If we had a byte type instead,

then we could

1. preserve provenance

2. differentiate between integers and the raw data copied


This might be beneficial for some backends (e.g CHERI) that want to differentiate between

true integers and bytes that can be integers. The optimizations for known integers

can become more aggressive in this case.


Thanks,

George

Johannes Doerfert via llvm-dev

unread,
Jun 4, 2021, 12:56:40 PM6/4/21
to George Mitenkov, llvm...@lists.llvm.org, Ralf Jung, cfe...@lists.llvm.org

On 6/4/21 11:35 AM, George Mitenkov wrote:
> Hi Johannes,
>
> Sure! The underlying problem is that raw-memory access handlers are treated
> as integers, while they are not really integers. Especially std::byte that
> specifically
> states that it has raw-memory access semantics. This semantic mismatch can
> make
> AA wrong and a pointer to escape.

I would assume being conservative is a possibility here, right?
So if someone wants to type-pun a pointer we would do very little
optimizations around it.


>
> Consider the following LLVM IR that copies a pointer:
>
> %src8 = bitcast i8** %src to i8*
>
> %dst8 = bitcast i8** %dst to i8*
>
> call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false)
>
> %load = load i8*, i8** %dst
>
> %addr = ptrtoint i8* %load to i64
>
> ret i64 %addr
>
>
> If we optimize the call to memcpy, then the IR becomes
>
>
> %src64 = bitcast i8** %src to i64*
>
> %dst64 = bitcast i8** %dst to i64*
>
> %addr = load i64, i64* %src64, align 1
>
> store i64 %addr, i64* %dst64, align 1
>
> ret i64 %addr
>
>
> Since there is no "data" type in LLVM like a byte, the ptrtoint is
> optimized out and the pointer escapes. If we have used a pair of byte load/store that would not
> happen.

The pointer (load %src) does escape in this example, doesn't it?

I don't get why "that would not happen" with byte load/store and why
that would be correct at all.


> The mentioned bug is just an example, but this pattern can be
>
> replicated in other cases that optimize memcpy as integer load/stores and
> drop ptrtoint,
>
> undermining alias analysis. To illustrate further, consider loading a
> pointer like:
>
>
> %v1 = load i8*, %p
>
> %v2 = load i8*, %p
>
>
> Instcombine optimizes this into
>
>
> %v1 = load i64, %p
>
> %v2 = inttoptr %v1
>
>
> changing the underlying provenance (again, not good for AA).

This example looks somehow broken (I'll assuming %v1 has integer type).

I agree this is not a great optimization, but is it wrong? We should have
done the opposite (ptr2int for v1) but even then I'm not sure what AA could
derive here. We load a pointer and then we escape it into an integer. If the
ptr2int is not removed, I'm not sure what AA should do about this. I'm also
not sure how practical this is. Do we have some idea how often such
cases exist?


> If we had a
> byte type instead,
>
> then we could
>
> 1. preserve provenance
>
> 2. differentiate between integers and the raw data copied
>
>
> This might be beneficial for some backends (e.g CHERI) that want to
> differentiate between
>
> true integers and bytes that can be integers. The optimizations for known
> integers
>
> can become more aggressive in this case.
>

It sounds to me like the problem is our optimization in the presence of
int2ptr and ptr2int are (still) unsound. We should at least not remove
ptr2int,
is that it?

That said, I still don't get how a byte type helps for these examples.
More concretely:

ref 1) How does a byte type preserve provenance? If it is somehow
implicitly attached
       would that not mean we need to now be careful doing
optimizations on bytes (as we
       should now be careful doing optimizations around p2i and i2p)?
Literally shifting
       the problem from looking for p2i and i2p to looking for bytecast?

ref 2) What optimization can I do if I have the differentiation of
integers and raw data?
       Can we somehow get an idea of the optimization potential?


~ Johannes

John McCall via llvm-dev

unread,
Jun 4, 2021, 2:25:50 PM6/4/21
to George Mitenkov, llvm...@lists.llvm.org, Ralf Jung, cfe...@lists.llvm.org

On 4 Jun 2021, at 11:24, George Mitenkov wrote:

Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
type to LLVM to fix miscompilations due to load type punning. Please see
the proposal below. It would be great to hear the
feedback/comments/suggestions!


Motivation
==========

char and unsigned char are considered to be universal holders in C. They
can access raw memory and are used to implement memcpy. i8 is the LLVM’s
counterpart but it does not have such semantics, which is also not
desirable as it would disable many optimizations.

I don’t believe this is correct. LLVM does not have an innate
concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using tbaa
metadata, which are unrelated to the IR type of the access.

John.

John McCall via llvm-dev

unread,
Jun 4, 2021, 2:49:34 PM6/4/21
to John McCall, llvm...@lists.llvm.org, Ralf Jung, cfe...@lists.llvm.org

If this is all related to https://bugs.llvm.org/show_bug.cgi?id=37469,
I don’t think anything about i8 is the ultimate problem there.

John.

Nuno Lopes via llvm-dev

unread,
Jun 4, 2021, 3:06:36 PM6/4/21
to John McCall, George Mitenkov, llvm...@lists.llvm.org, Clang Dev, Ralf Jung

Its debatable whether LLVM considers memory to be typed or not. If we dont consider memory to be typed, then *all* integer load operations have to be considered as potentially escaping pointers. Example:
store i32* %p, i32** %q
%q2 = bitcast i32** %q to i64*
%v = load i64* %q2

This program stores a pointer and then loads it back as an integer. So theres an implicit pointer-to-integer cast, which escapes the pointer. If we allow this situation to happen, then the alias analysis code is broken, as well as several optimizations. LLVM doesnt consider loads as potential pointer escape sites. It would probably be a disaster (performance wise) if it did!

 

The introduction of the byte type allow us to make all pointer <-> integer casts explicit, so that we dont have to make all integer loads as escaping. It also allow us to say that LLVM optimizations are correct, and we just need to create a few new optimization to get rid of the extra bytecast instructions when they are provably not needed.

TBAA is unrelated to the problem we are trying to solve here.

 

Nuno

Nuno Lopes via llvm-dev

unread,
Jun 4, 2021, 3:17:30 PM6/4/21
to Johannes Doerfert, George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
It's true: an alternative to introducing a new byte type is to make alias analysis more conservative. But I think it's way worse than you think.
Making AA conservative means that every single integer load must be considered to be escaping a pointer. Unless you know the last stored value was an integer, you would have to conservatively assume that the load may be escaping some pointer (even if only partially). This sounds to me like terrible. Type punning through memory is probably very uncommon, and it seems we would be slowing down all programs because of such uncommon case.

Therefore I feel it's best if we introduce a way to explicitly mark potential type punning situations, such that AA (and optimizers) only need to be conservative when needed. It's not the case that right now we can detect the few potential type punning cases and be conservative just there. We would need to be conservative in e.g. every function that loads an argument pointer.
I agree it's scary to introduce a new type in the IR; we don't do that very often. But we need to fix the bugs in LLVM.

Nuno


~ Johannes

>> On 6/4/21 10:24 AM, George Mitenkov via cfe-dev wrote:
>>> Hi all,
>>>
>>> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
>>> type to LLVM to fix miscompilations due to load type punning. Please see
>>> the proposal below. It would be great to hear the
>>> feedback/comments/suggestions!
>>>
>>>
>>> Motivation
>>> ==========
>>>
>>> char and unsigned char are considered to be universal holders in C. They
>>> can access raw memory and are used to implement memcpy. i8 is the LLVM’s
>>> counterpart but it does not have such semantics, which is also not
>>> desirable as it would disable many optimizations.
>>>

John McCall via llvm-dev

unread,
Jun 4, 2021, 3:54:04 PM6/4/21
to Nuno Lopes, llvm...@lists.llvm.org, Ralf Jung, Clang Dev

Huh? It’s not the load that’s the problem in your example, it’s the store. If we have a pointer that’s currently known to not alias anything, and that pointer gets stored somewhere, and we can’t analyze all the uses of where it’s stored, then yes, anything that could potentially load from that memory might load the pointer, and so the pointer can’t just be assumed to not alias anything anymore. We don’t need subtle reasoning about bitcasts for that, that’s just basic conservative code analysis.

If you want to be doing optimizations where we opaquely assume that certain loads don’t carry pointers, then you need frontend involvement for that; you can’t just assume that some random i64 doesn’t carry a pointer. And I’ve gotta tell you, some random i64 is allowed to carry a pointer in C, so you’re not likely to get much mileage out of that. If you make this change, all that’s going to happen is that C-ish frontends will have to make all their integer types b8, b32, b64, etc. to indicate that they can carry pointer data, and then you’ll be right back where you were in terms of optimizability. The frontends that can make stronger statements about integers can probably all make much stronger promises about aliasing than you’d get from this analysis anyway.

The introduction of the byte type allow us to make all pointer <-> integer casts explicit, so that we don’t have to make all integer loads as escaping. It also allow us to say that LLVM optimizations are correct, and we “just” need to create a few new optimization to get rid of the extra bytecast instructions when they are provably not needed.

It sounds an awful lot like you’re trying to optimize a different language than the one you’ve got. I can certainly see how this constraint on inttoptr might be useful, but it’s not a constraint that LLVM has historically documented or enforced, and you haven’t made a very compelling case for why it’s worth such a major and incompatible change to the language.

John.

Johannes Doerfert via llvm-dev

unread,
Jun 4, 2021, 5:11:54 PM6/4/21
to Nuno Lopes, George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
I think the lack of a end-to-end example is making this harder (for me)
to understand.


On 6/4/21 2:17 PM, Nuno Lopes wrote:
> It's true: an alternative to introducing a new byte type is to make alias analysis more conservative. But I think it's way worse than you think.
> Making AA conservative means that every single integer load must be considered to be escaping a pointer.

What pointer is escaping if I have an integer load. I'm sorry for my
questions but I assume I'm not the only one that might
want to understand this properly. So far, I've assumed if I store a
pointer away w/o knowing what happens to the memory the
pointer escapes. This is our current model and it's not that bad.


> Unless you know the last stored value was an integer, you would have to conservatively assume that the load may be escaping some pointer (even if only partially). This sounds to me like terrible. Type punning through memory is probably very uncommon, and it seems we would be slowing down all programs because of such uncommon case.
>
> Therefore I feel it's best if we introduce a way to explicitly mark potential type punning situations, such that AA (and optimizers) only need to be conservative when needed. It's not the case that right now we can detect the few potential type punning cases and be conservative just there. We would need to be conservative in e.g. every function that loads an argument pointer.
> I agree it's scary to introduce a new type in the IR; we don't do that very often. But we need to fix the bugs in LLVM.

What doesn't add up for me is that one the one side the argument
"everything needs to be conservative" w/o byte types, but at the same
time we can somehow determine when to use byte types locally? If the
latter would not be the case, how would we know when we need a byte type?

Why would we not be conservative/explicit about type punning whenever we
would introduce byte types with this proposal? I unfortunately don't
understand what the byte types give us. My best guess is that they
explicitly mark byte operations on pointers, which we could do otherwise
I assume (explicit p2i, i2p). If there is more to it, I think we need an
end to end example that shows how we cannot make type punning explicit
with current means and a new type is conceptually better.

~ Johannes

Joshua Cranmer via llvm-dev

unread,
Jun 4, 2021, 6:27:55 PM6/4/21
to John McCall, George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
On 6/4/2021 2:25 PM, John McCall via cfe-dev wrote:
I don’t believe this is correct. LLVM does not have an innate

concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using tbaa
metadata, which are unrelated to the IR type of the access.

John.

I've never been thoroughly involved in any of the actual optimizations here, but it seems to me that there is a soundness hole in the LLVM semantics that we gloss over when we say that LLVM doesn't have typed memory.

Working backwards from what a putative operational semantics of LLVM might look like (and I'm going to ignore poison/undef because it's not relevant), I think there is agreement that integer types in LLVM are purely bitvectors. Any value of i64 5 can be replaced with any other value of i64 5 no matter where it came from. At the same time, when we have pointers involved, this is not true. Two pointers may have the same numerical value (e.g., when cast to integers), but one might not be replaceable with the other because there's other data that might not be the same. So in operational terms, pointers have both a numerical value and a bag of provenance data (probably other stuff, but let's be simple and call it provenance).

Now we have to ask what the semantics of converting between integers and pointers are. Integers, as we've defined, don't have provenance data. So an inttoptr instruction has to synthesize that provenance somehow. Ideally, we'd want to grab that data from the ptrtoint instruction that generated the integer, but the semantics of integers means we can only launder that data globally, so that an inttoptr has the union of all of the provenance data that was ever fed into an inttoptr (I suspect the actual semantics we use is somewhat more precise than this in that it only considers those pointers that point to still-live data, which doesn't invalidate anything I'm about to talk about).

Okay, what about memory? I believe what most people intend to mean when they say that LLVM's memory is untyped is that a load or store of any type is equivalent to first converting it to an integer and then storing the integer into memory. E.g. these two functions are semantically equivalent:

define void @foo(ptr %mem, i8* %foo) {
  store i8* %foo, ptr %mem
}
define void @bar(ptr %mem, i8* %foo) {
  %asint = ptrtoint i8* %foo to i64 ; Or whatever pointer size you have
  store i64 %asint, ptr %mem
}

In other words, we are to accept that every load and store instruction of a pointer has an implicit inttoptr or ptrtoint attached to it. But as I mentioned earlier, pointers have this extra metadata attached to it that is lost when converting to an integer. Under this strict interpretation of memory, we *lose* that metadata every time a pointer is stored in memory, as if we did an inttoptr(ptrtoint x). Thus, the following two functions are *not* semantically equivalent in that model:

define i8* @basic(i8* %in) {
  ret i8* %in
}
define i8* @via_alloc(i8* %in) {
  %mem = alloca i8*
  store i8* %in, i8** %mem
  %out = load i8*, i8** %mem
  ret i8* %out
}

In order to allow these two functions to be equivalent, we have to let the load of a pointer recover the provenance data stored by the store of the pointer, and nothing more general. If either one of those were instead an integer load or store, then no provenance data can be communicated, so the integer and the pointer loads *must* be nonequivalent (although loading an integer instead of a pointer would presumably be a pessimistic transformation).

In short, pointers have pointery bits that aren't reflected in a bitvector representation an integer has. LLVM has some optimizations that assume that loads and stores only have bitvector manipulation semantics, while other optimizations (and most of the frontends) expect that loads and stores will preserve the pointery bits. And when these interact with each other, it's undoubtedly possible that the pointery bits get lost along the way.

-- 
Joshua Cranmer

Chris Lattner via llvm-dev

unread,
Jun 6, 2021, 12:27:12 AM6/6/21
to John McCall, llvm...@lists.llvm.org, Ralf Jung, cfe...@lists.llvm.org
I completely agree with John.  “i8” in LLVM doesn’t carry any implications about aliasing (in fact, LLVM pointers are going towards being typeless).  Any such thing occurs at the accesses, and are part of TBAA.

I’m opposed to adding a byte type to LLVM, as such semantic carrying types are entirely unprecedented, and would add tremendous complexity to the entire system.

-Chris

James Courtier-Dutton via llvm-dev

unread,
Jun 6, 2021, 3:55:54 AM6/6/21
to Chris Lattner, llvm-dev, cfe-dev@lists.llvm.org Developers, Ralf Jung
Hi,

I would also oppose adding a byte type, but mainly because the bug
report mentioned (https://bugs.llvm.org/show_bug.cgi?id=37469) is not
a bug at all.
The example in the bug report is just badly written C code.
Specifically:

int main() {
int A[4], B[4];
printf("%p %p\n", A, &B[4]);
if ((uintptr_t)A == (uintptr_t)&B[4]) {
store_10_to_p(A, &B[4]);
printf("%d\n", A[0]);
}
return 0;
}

"int B[4];" allows values between 0 and 3 only, and referring to 4 in
&B[4] is undef, so in my view, it is correctly optimised out which is
why it disappears in -O3.

Kind Regards

James


On Sun, 6 Jun 2021 at 05:26, Chris Lattner via cfe-dev

James Courtier-Dutton via llvm-dev

unread,
Jun 6, 2021, 5:02:19 AM6/6/21
to Chris Lattner, llvm-dev, cfe-dev@lists.llvm.org Developers, Ralf Jung
Also, the comment below is wrong. At this point, arr3 is equivalent to
arr2, which is q.

// Now arr3 is equivalent to arr1, which is p.
int *r;
memcpy(&r, (unsigned char *)arr3, sizeof(r));
// Now r is p.
*p = 1;
*r = 10;

Madhur Amilkanthwar via llvm-dev

unread,
Jun 6, 2021, 5:33:14 AM6/6/21
to James Courtier-Dutton, llvm-dev, Ralf Jung, cfe-dev@lists.llvm.org Developers
HI George,

I don't think this is scalable model to add a new type just to benefit
an analysis and draw specific conclusions from it. I can argue that
if b8 is proposed then why not b16 for half? Why not b32 for some
other reason? This won't stop just there and one can go beyond
and introduce types to benefit domain specific languages.

Given the problem, I'd say we should think about a way to annotate
types with attribute or metadata or flags which optimizations can use
to do a better job. The attribute/metadata could carry the semantic
meaning for the type. Frontends can generate this "type attribute/metadata"
and optimizations can choose to use this extra information to do the
better job. It would be a hint though and not a mandate for optimizations.
This approach is very similar to attributes in LLVM IR and just like an IR
function can have attributes, a type can also posses attributes/metadata.
(Whether it should be an attribute or metadata is a choice but
that would not deviate from the purpose).

This approach is far more adoptable and convincing than introducing
a whole new type which would be massive complexity for the type system.


--
Disclaimer: Views, concerns, thoughts, questions, ideas expressed in this mail are of my own and my employer has no take in it. 
Thank You.
Madhur D. Amilkanthwar

Hal Finkel via llvm-dev

unread,
Jun 6, 2021, 11:52:26 AM6/6/21
to Chris Lattner, John McCall, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung

I'll take this opportunity to point out that, at least historically, the reason why a desire to optimize around ptrtoint keeps resurfacing is because:

 1. Common optimizations introduce them into code that did not otherwise have them (SROA, for example, see convertValue in SROA.cpp).

 2. They're generated by some of the ABI code for argument passing (see clang/lib/CodeGen/TargetInfo.cpp).

 3. They're present in certain performance-sensitive code idioms (see, for example, ADT/PointerIntPair.h).

It seems to me that, if there's design work to do in this area, one should consider addressing these now-long-standing issues where we introduce ptrtoint by replacing this mechanism with some other one.

 -Hal

George Mitenkov via llvm-dev

unread,
Jun 6, 2021, 2:11:43 PM6/6/21
to Madhur Amilkanthwar, llvm-dev, cfe-dev@lists.llvm.org Developers, Ralf Jung
Hi Madhur,

I can argue that if b8 is proposed then why not b16 
for half? Why not b32 for some other reason? 
We do propose to have b<N> for different Ns, as per proposal:
we denote the byte type as b<N>, where N is the number of bits.
Maybe that was not explicit enough. But note that the only byte bit width produced
by the frontend is b8 (from char, unsigned char/std::byte). The other bit widths
come from LLVM optimizations, such as already mentioned memcpy:

%src8 = bitcast i8** %src to i8*

%dst8 = bitcast i8** %dst to i8*

call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false) 


is transformed (by instcombine) into 


%src64 = bitcast i8** %src to i64*

%dst64 = bitcast i8** %dst to i64*

%val = load i64, i64* %src64

store i64 %val, i64* %dst64


What we propose is to have roughly

%src64 = bitcast i8** %src to b64*

%dst64 = bitcast i8** %dst to b64*

%val = load b64, b64* %src64

store b64 %val, b64* %dst64


Having this just copies memory as-is, and cannot introduce implicit

ptr2int/int2ptr casts.


Given the problem, I'd say we should think about a way to annotate
types with attribute or metadata or flags which optimizations can use
to do a better job. The attribute/metadata could carry the semantic 
meaning for the type. Frontends can generate this "type attribute/metadata" 
and optimizations can choose to use this extra information to do the
better job. It would be a hint though and not a mandate for optimizations.
This approach is very similar to attributes in LLVM IR and just like an IR
function can have attributes, a type can also posses attributes/metadata. 
(Whether it should be an attribute or metadata is a choice but 
that would not deviate from the purpose).
I see your point and it is definitely easier to fix everything with metadata/attributes.
I am concerned with this approach because it just postpones a bigger problem.
There are already lots of metadata, attributes and IR/optimizations have become
complicated enough. Instead of fixing a problem while we can, we just add attributes,
then more attributes... I believe that at some point this will just become legacy and
impossible to fix. 

Thanks,
George

Nuno Lopes via llvm-dev

unread,
Jun 6, 2021, 2:26:12 PM6/6/21
to Chris Lattner, John McCall, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung

Let me try to explain the issue around typed memory. Its orthogonal to TBAA. Plus I agree the allocation type is irrelevant; its the type of the load/store operations that matters.

 

Fact 1: LLVM needs an universal holder type in its IR.

Why?

  • Frontends load data from memory and dont necessarily know the type they are loading. When you load a char in C, it can be part of a pointer or an integer, for example. Clang has no way to know.
  • Thats necessary to express implementations of memcpy in IR, as it copies raw data without knowing what the type is.
  • LLVM lowers small memcpys into load/store.

 

I hope you agree LLVM needs a type that can contain any other type.

 

 

Fact 2: optimizations for this universal holder type need to be correct for any LLVM IR type.

Since this type can hold anything, optimizations that apply to this type must be correct for integers, pointers, etc. Otherwise, the results would be incorrect if the data was cast to the wrong type.

 

The corollary of this fact is that GVN is not correct for the universal holder type in general. Thats because GVN is not correct for pointers in general (though correct in some cases). GVN doesnt work for pointers since just because 2 pointers compare equal, it doesnt mean they have the same provenance. As LLVMs AA uses provenance information, this information must be preserved.

 

Hopefully you agree with everything I wrote so far. If not please stop me and ask.

 

 

Ok, so whats the solution?

We have two candidates for this universal holder type:

  • integer (i<x>)
  • byte (b<x>), a new dedicated type

 

The advantage of going with integers is that they already exist. The disadvantage? To make LLVM correct, we would need to disable GVN in most cases or stop using provenance information in AA (make pointers == integers).

This sounds pretty bad to me due to the expect performance hit.

 

Introducing a new type requires work, yes. But it allows us to keep all integer optimizations as aggressive as we want, and only throttle down when encountering the byte type. Another benefit is that loading a byte from memory doesnt escape any pointer, while with the first solution you need to consider all stored pointers as escaped when loading an integer.

Introducing the byte type allow us to fix all the integer/pointer casts bugs in LLVM IR. And it enables more optimizations as we wont need to be careful because of the LLVM IR bug.

 

 

A 3rd option is to keep everything as-is. That is, keep wrong optimizations in LLVM and keep adding hacks to limit the miscompilations in real-world programs.

That has bitten us multiple times. Last time someone told me my miscompilation example was academic, LLVM ended up miscompiling itself, and then miscompiled Googles code. Only this year we are planning to fully fix that bug (we have a hack right now).

Last week Alive2 caught a miscompilation in the Linux kernel, in the network stack. The optimization got the pointer arithmetic wrong. Pretty scary, and the bug may have security implications.

 

Anyway, just to argue that leaving things as-is is not an option. We have momentum and 3 GSoC students ready to help to fix the last few design bugs in LLVM IR.

 

I know this stuff is complex; please ask questions if something is not clear or if you disagree with anything I wrote above.

 

Thanks,

Nuno

 

 

 

From: Chris Lattner via llvm-dev
Sent: 06 June 2021 05:27
To: John McCall <rjmc...@apple.com>
Cc: llvm...@lists.llvm.org; Ralf Jung <ju...@mpi-sws.org>; cfe...@lists.llvm.org
Subject: Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

 

On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe...@lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov wrote:

Nuno Lopes via llvm-dev

unread,
Jun 6, 2021, 2:31:26 PM6/6/21
to James Courtier-Dutton, llvm...@lists.llvm.org, Ralf Jung, cfe...@lists.llvm.org
Hi James,

Your comment is incorrect because:
int x[n];
x[n]; // legal expression in C

This is necessary to allow idioms where you iterate over an array using pointer arithmetic, like:
for (int *p = x; p != x + n; ++p) { ... }

So, AFAICT, the bug report you mention is perfectly valid.
Plus LLVM IR's pointer arithmetic may go out-of-bounds (getelementptr without inbounds). It's important to not mix C and LLVM IR semantics. They have different goals.

Nuno

-----Original Message-----
From: llvm-dev <llvm-dev...@lists.llvm.org> On Behalf Of James Courtier-Dutton via llvm-dev
Sent: 06 June 2021 10:02
To: Chris Lattner <clat...@nondot.org>
Cc: llvm-dev <llvm...@lists.llvm.org>; cfe...@lists.llvm.org Developers <cfe...@lists.llvm.org>; Ralf Jung <ju...@mpi-sws.org>
Subject: Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

> On Sun, 6 Jun 2021 at 05:26, Chris Lattner via cfe-dev

> <cfe...@lists.llvm.org> wrote:
> >
> > On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe...@lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov wrote:
> >
> > Hi all,
> >
> > Together with Nuno Lopes and Juneyoung Lee we propose to add a new
> > byte type to LLVM to fix miscompilations due to load type punning.
> > Please see the proposal below. It would be great to hear the
> > feedback/comments/suggestions!
> >
> >
> > Motivation
> > ==========
> >
> > char and unsigned char are considered to be universal holders in C.
> > They can access raw memory and are used to implement memcpy. i8 is
> > the LLVM’s counterpart but it does not have such semantics, which is
> > also not desirable as it would disable many optimizations.
> >
> > I don’t believe this is correct. LLVM does not have an innate
> > concept of typed memory. The type of a global or local allocation is
> > just a roundabout way of giving it a size and default alignment, and
> > similarly the type of a load or store just determines the width and
> > default alignment of the access. There are no restrictions on what
> > types can be used to load or store from certain objects.
> >
> > C-style type aliasing restrictions are imposed using tbaa metadata,
> > which are unrelated to the IR type of the access.
> >
> > I completely agree with John. “i8” in LLVM doesn’t carry any implications about aliasing (in fact, LLVM pointers are going towards being typeless). Any such thing occurs at the accesses, and are part of TBAA.
> >
> > I’m opposed to adding a byte type to LLVM, as such semantic carrying types are entirely unprecedented, and would add tremendous complexity to the entire system.
> >
> > -Chris
> >

Nuno Lopes via llvm-dev

unread,
Jun 6, 2021, 2:47:42 PM6/6/21
to Johannes Doerfert, George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
I think Joshua gave a very nice motivation already. I'll just complement with an alias analysis example. (plus see my reply to Chris)

define @f() {
%p = alloca i8
%q = alloca i8**
store i8 42, %p
store i8* %p, i8** %q

%pp = some ptr arithmetic; full may-alias
%v = load i64, %pp
call @g(%v)

%w = load i8, %p ; call we store-forward & RAUW 42?
}

So, we pass an integer to some function. If we don't distinguish between pointers and integers (current memory model), we need to assume that %v may carry the address of %p, as the load may have done an implicit pointer-to-integer cast. Therefore, pointer %p is escaped.
Thus, in this model we cannot do store forwarding of 42 to the load of %w. Except that LLVM doesn't consider integer loads as escaping pointers. This is wrong, and that's one of the reasons why we miscompile integer/pointer casts. The other reason is what I explained in the reply to John & Chris, which is that some integer optimizations aren't correct for pointers, so we cannot represent pointers as integers (without appropriate p2i/i2p casts).

Another model is to say that integers can only carry integers. Above there's no pointer-to-integer cast of %p, hence %v cannot have the address of %p, and thus store-forwarding can happen.
This model requires a "universal holder" different than (ab)using integers.

Introducing a byte type means we only need to careful with char loads, but not with short, int, long, etc loads. Clang would lower char loads to byte loads, and the remaining integer types straight to i<n> loads. So we can go full speed on integers, and throttle down just on chars, which is all we can do because the semantics of chars in C.

Marshall Clow via llvm-dev

unread,
Jun 6, 2021, 2:52:31 PM6/6/21
to James Courtier-Dutton, James Courtier-Dutton via cfe-dev, llvm-dev, Ralf Jung
On Jun 6, 2021, at 12:54 AM, James Courtier-Dutton via cfe-dev <cfe...@lists.llvm.org> wrote:
>
> Hi,
>
> I would also oppose adding a byte type, but mainly because the bug
> report mentioned (https://bugs.llvm.org/show_bug.cgi?id=37469) is not
> a bug at all.
> The example in the bug report is just badly written C code.
> Specifically:
>
> int main() {
> int A[4], B[4];
> printf("%p %p\n", A, &B[4]);
> if ((uintptr_t)A == (uintptr_t)&B[4]) {
> store_10_to_p(A, &B[4]);
> printf("%d\n", A[0]);
> }
> return 0;
> }
>
> "int B[4];" allows values between 0 and 3 only, and referring to 4 in
> &B[4] is undef, so in my view, it is correctly optimised out which is
> why it disappears in -O3.

Taking the address of the “one-past-the end” element of an array is perfectly legal in both C and C++.
*Dereferencing* that pointer is UB.

— Marshall

David Blaikie via llvm-dev

unread,
Jun 6, 2021, 2:58:17 PM6/6/21
to Nuno Lopes, llvm-dev, Ralf Jung, Clang Dev
Perhaps a small concrete example of the GVN (or otherwise, if there
are smaller examples in other optimizations) transformation that you
find to be invalid - showing how it's important for integers, and
invalid for pointers (smallest example that manifests that invalidity
in an observable way) might help ground the conversation a bit.

Nuno Lopes via llvm-dev

unread,
Jun 6, 2021, 3:16:09 PM6/6/21
to David Blaikie, llvm-dev, Ralf Jung, Clang Dev
Of course, let's look at a simple example with GVN:
if (p == q) {
*(p-1) = 0;
}

If p/q are integers, then it's perfectly fine to replace p with q or vice-versa inside the 'if'. GVN may decide to produce this:
if (p == q) {
*(q-1) = 0;
}

Now consider that p/q are pointers. In particular they are the following:
char a[n];
char q[n];
char *p = a + n;

inlining the definitions, originally we had:
if (a+n == q) {
*(a+n-1) = 0;
}

The program is well-defined. However, GVN produces a ill-defined program:
if (a+n == q) {
*(q-1) = 0;
}

While a+n may have the same address as q, their provenance is different. You can dereference a[n-1], but you can't dereference q[-1], even if they have exactly the same address.

Pointer equalities can only be propagated if both pointers have the same provenance information. If two pointers are inbounds and compare equal, then they have the same provenance, for example. That's a case when GVN for pointers is correct.

Another class of issues is related with alias analysis and implicit casts done by integer loads. AA doesn't consider these implicit casts, which is incorrect. When we load an integer, we don't know if we are actually loading an integer or a "raw byte", as there's no way to distinguish. So, to be correct, we need to consider all integer loads as escaping pointers (unless we introduce a way to distinguish "raw byte" loads from integer loads). I gave one such example in my last reply to Johannes.

Please let me know if the example above isn't clear. Happy to elaborate further :)

Thanks,
Nuno

James Courtier-Dutton via llvm-dev

unread,
Jun 6, 2021, 3:30:52 PM6/6/21
to Nuno Lopes, llvm-dev, Ralf Jung, Clang Dev
See inline

On Sun, 6 Jun 2021 at 20:15, Nuno Lopes via cfe-dev
<cfe...@lists.llvm.org> wrote:
> Now consider that p/q are pointers. In particular they are the following:
> char a[n];
> char q[n];
> char *p = a + n;
>
> inlining the definitions, originally we had:

> if (a+n == q) { <- This is the problem. This being true or false is undefined. One compiler might make them equal, another will not.
> *(a+n-1) = 0;

Joerg Sonnenberger via llvm-dev

unread,
Jun 6, 2021, 5:25:49 PM6/6/21
to llvm...@lists.llvm.org, James Courtier-Dutton via cfe-dev
On Sun, Jun 06, 2021 at 11:52:13AM -0700, Marshall Clow via llvm-dev wrote:
> On Jun 6, 2021, at 12:54 AM, James Courtier-Dutton via cfe-dev <cfe...@lists.llvm.org> wrote:
> >
> > Hi,
> >
> > I would also oppose adding a byte type, but mainly because the bug
> > report mentioned (https://bugs.llvm.org/show_bug.cgi?id=37469) is not
> > a bug at all.
> > The example in the bug report is just badly written C code.
> > Specifically:
> >
> > int main() {
> > int A[4], B[4];
> > printf("%p %p\n", A, &B[4]);
> > if ((uintptr_t)A == (uintptr_t)&B[4]) {
> > store_10_to_p(A, &B[4]);
> > printf("%d\n", A[0]);
> > }
> > return 0;
> > }
> >
> > "int B[4];" allows values between 0 and 3 only, and referring to 4 in
> > &B[4] is undef, so in my view, it is correctly optimised out which is
> > why it disappears in -O3.
>
> Taking the address of the “one-past-the end” element of an array is perfectly legal in both C and C++.
> *Dereferencing* that pointer is UB.

Not just dereferencing, but also comparing it to a random other point?

Joerg

Harald van Dijk via llvm-dev

unread,
Jun 6, 2021, 6:19:24 PM6/6/21
to James Courtier-Dutton, Nuno Lopes, llvm-dev, Clang Dev, Ralf Jung
On 06/06/2021 20:29, James Courtier-Dutton via llvm-dev wrote:
> See inline
>
> On Sun, 6 Jun 2021 at 20:15, Nuno Lopes via cfe-dev
> <cfe...@lists.llvm.org> wrote:
>> Now consider that p/q are pointers. In particular they are the following:
>> char a[n];
>> char q[n];
>> char *p = a + n;
>>
>> inlining the definitions, originally we had:
>> if (a+n == q) { <- This is the problem. This being true or false is undefined. One compiler might make them equal, another will not.
>> *(a+n-1) = 0;
>> }

That is not the problem. It is true that whether a+n and q compare equal
is unspecified, but that is irrelevant. *If* they compare equal, then
the if block must execute as written, else it must not. The only valid
options are: either a[n-1] gets set to zero, or a[n-1] is left
unmodified. Any transformation to undefined IR is invalid.

You can change the example to something that has fully defined
behaviour, even, simply by:

if (a+n == q) {
*(a+n-1) = 0;

} else {
*(a+n-1) = 0;
}

If LLVM fails to notice that the two branches are equivalent (allowing
the branch to be eliminated entirely), you still have the exact same
problem with the first branch potentially getting mis-optimised.

Cheers,
Harald van Dijk

Juneyoung Lee via llvm-dev

unread,
Jun 7, 2021, 12:25:58 AM6/7/21
to Joerg Sonnenberger, llvm-dev, James Courtier-Dutton via cfe-dev
According to C17 it is legal:

6.5.9.p6

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.


--

Juneyoung Lee
Software Foundation Lab, Seoul National University

Mehdi AMINI via llvm-dev

unread,
Jun 7, 2021, 1:38:07 AM6/7/21
to Juneyoung Lee, llvm-dev, James Courtier-Dutton via cfe-dev
On Sun, Jun 6, 2021 at 9:25 PM Juneyoung Lee via llvm-dev <llvm...@lists.llvm.org> wrote:
According to C17 it is legal:

6.5.9.p6

Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.

This is interesting, I'm curious (we're a bit off-topic though) how is the part about "that happens to immediately follow the first array object in the address space" defined?

Juneyoung Lee via llvm-dev

unread,
Jun 7, 2021, 4:50:12 AM6/7/21
to Mehdi AMINI, Joerg Sonnenberger, llvm-dev, James Courtier-Dutton via cfe-dev
Hi Mehdi,

I couldn't find a paragraph explicitly specifying how objects are located from the text.
It seems it is deliberately underspecified to allow compilations to exotic target machines.
However, I could find a few paragraphs that look interesting:

6.2.4p2: An object exists, has a constant address,33) and retains its last-stored value throughout its lifetime.
33) The term “constant address” means that two pointers to the object constructed at possibly different times will compare equal. The address may be different during two different executions of the same program.

6.3.2.3.p5 68)
The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.

7.22.3p1
... Each such allocation shall yield a pointer to an object disjoint from any other object. The pointer returned points to the start (lowest byte address) of the allocated space. ...



Going back to the bug report, I don't think these paragraphs affect the validity of the example. 

Thanks,
Juneyoung

James Courtier-Dutton via llvm-dev

unread,
Jun 7, 2021, 7:58:58 AM6/7/21
to George Mitenkov, llvm-dev, cfe-dev@lists.llvm.org Developers, Ralf Jung
On Fri, 4 Jun 2021 at 17:35, George Mitenkov via cfe-dev
<cfe...@lists.llvm.org> wrote:
>
> Hi Johannes,
>
> Sure! The underlying problem is that raw-memory access handlers are treated
> as integers, while they are not really integers. Especially std::byte that specifically
> states that it has raw-memory access semantics. This semantic mismatch can make
> AA wrong and a pointer to escape.
>
> Consider the following LLVM IR that copies a pointer:
You are making an assumption here. By just looking at the IR code
here, I don't think you can really be
sure what the type of the thing being copied is.

> %src8 = bitcast i8** %src to i8*
> %dst8 = bitcast i8** %dst to i8*
> call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false)
> %load = load i8*, i8** %dst
> %addr = ptrtoint i8* %load to i64
> ret i64 %addr
>
> If we optimize the call to memcpy, then the IR becomes
Same here, by just looking at the IR code here, I don't think you can
really be sure what the type of the thing being copied is.

> %src64 = bitcast i8** %src to i64*
> %dst64 = bitcast i8** %dst to i64*
> %addr = load i64, i64* %src64, align 1
> store i64 %addr, i64* %dst64, align 1
> ret i64 %addr
>

One can do bitcasts etc, to obscure the actual type of the bytes being copied.
In both those examples, 8 bytes are copied, and the same value is
returned. So the end program will function the same when run.
Essentially, there is not enough information in the above code to
determine if the 8 bytes copied are part of a pointer or not.
For AA analysis, I would say, more information is needed.

One can only really be sure what type those bytes are, and that they
are a pointer when they are actually used as a pointer argument to a
LOAD or STORE.
There are some other operations that can also be used to infer whether
it is a pointer or not, but the LOAD/STORE is the simplest example.

Kind Regards

James

Jeroen Dobbelaere via llvm-dev

unread,
Jun 7, 2021, 8:16:06 AM6/7/21
to Nuno Lopes, David Blaikie, llvm-dev, Ralf Jung
IMHO, the ptr_provenance support that is part of the full restrict patches [0]
can likely help for these cases ? The allow to explicitly associate the ptr_provenance
of a load/store instruction.

In the past LLVM AA Tech Calls, we decided to pull those out of the full restrict
patches and prepare them for inclusion. (Which I still have on my list of todo's)

Greetings,

Jeroen Dobbelaere
[0] https://reviews.llvm.org/D68488 [PATCH 03/27] [noalias] [IR] Introduce ptr_provenance for LoadInst/StoreInst
> > On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-
> > https://urldefense.com/v3/__https://lists.llvm.org/cgi-
> bin/mailman/listinfo/llvm-
> dev__;!!A4F2R9G_pg!OaorXrTDpbE6DTSL_1JEUI55JMoefP2WL3XL_8iPWUad0B4K4HEnbYNE32W
> YUj8oOMYIgVLz$
>
> _______________________________________________
> cfe-dev mailing list
> cfe...@lists.llvm.org
> https://urldefense.com/v3/__https://lists.llvm.org/cgi-
> bin/mailman/listinfo/cfe-
> dev__;!!A4F2R9G_pg!OaorXrTDpbE6DTSL_1JEUI55JMoefP2WL3XL_8iPWUad0B4K4HEnbYNE32W
> YUj8oOCB56bnt$

George Mitenkov via llvm-dev

unread,
Jun 7, 2021, 8:26:07 AM6/7/21
to James Courtier-Dutton, llvm-dev, cfe-dev@lists.llvm.org Developers, Ralf Jung
The purpose of an example is to make an assumption about what IR we have, and to show that it becomes wrong after optimization. I am not sure I get your comment here.

Same here, by just looking at the IR code here, I don't think you can
really be sure what the type of the thing being copied is.
That is exactly the point. We do not know what type we are copying - it may be an integer, or it may be a pointer. Importantly, we can see that the ptr2int instruction disappeared, and the optimized code returning the integer actually can escape the pointer. Using a byte type instead of i64 removes implicit pointer casts and therefore helps AA to catch this case.

One can do bitcasts etc, to obscure the actual type of the bytes being copied.
In both those examples, 8 bytes are copied, and the same value is
returned. So the end program will function the same when run.
Essentially, there is not enough information in the above code to
determine if the 8 bytes copied are part of a pointer or not.
For AA analysis, I would say, more information is needed.
This is just an example of a wrong optimization. It is important because bugs like this appear partially because of this exact optimization:
The mecpy is replaced with load/store pairs and store forwarding happens incorrectly. I haven't shown it in full, and hence there may be a bit of confusion. I am happy to elaborate more if this is still unclear!

Thanks,
George

Johannes Doerfert via llvm-dev

unread,
Jun 7, 2021, 10:56:44 AM6/7/21
to George Mitenkov, James Courtier-Dutton, llvm-dev, cfe-dev@lists.llvm.org Developers, Ralf Jung

On 6/7/21 7:25 AM, George Mitenkov wrote:
> The purpose of an example is to make an assumption about what IR
> we have, and to show that it becomes wrong after optimization. I am not
> sure I get your comment here.

FWIW, I didn't understand the comment either.


> Same here, by just looking at the IR code here, I don't think you can
>> really be sure what the type of the thing being copied is.
> That is exactly the point. We do not know what type we are copying - it may
> be an integer, or it may be a pointer. Importantly, we can see that the
> ptr2int instruction disappeared, and the optimized code returning the
> integer actually can escape the pointer. Using a byte type instead of i64
> removes implicit pointer casts and therefore helps AA to catch this case.

This is a fundamental issue I have with this example and the explanation:
What does "catch this case" mean? The pointer does escape, what is
there to catch/improve?


> One can do bitcasts etc, to obscure the actual type of the bytes being
>> copied.
>> In both those examples, 8 bytes are copied, and the same value is
>> returned. So the end program will function the same when run.
>> Essentially, there is not enough information in the above code to
>> determine if the 8 bytes copied are part of a pointer or not.
>> For AA analysis, I would say, more information is needed.
> This is just an example of a wrong optimization. It is important because
> bugs like this appear partially because of this exact optimization:
> https://bugs.llvm.org/show_bug.cgi?id=37469
> The mecpy is replaced with load/store pairs and store forwarding happens
> incorrectly. I haven't shown it in full, and hence there may be a bit of
> confusion. I am happy to elaborate more if this is still unclear!

I still believe the ptr2int in there is the problem, or better our
handling of it. That said, I have not understand what the byte type
actually would do for this example, or at least not what it would
do that is different from the proper handling of ptr2int.

~ Johannes

Johannes Doerfert via llvm-dev

unread,
Jun 7, 2021, 11:12:16 AM6/7/21
to Nuno Lopes, George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
On 6/6/21 1:47 PM, Nuno Lopes wrote:
> I think Joshua gave a very nice motivation already.

I don't dispute that but I am still not understanding the need for
bytes. None of the examples I have seen so far
clearly made the point that it is the byte types that provide a
substantial benefit. The AA example below does neither.


> I'll just complement with an alias analysis example. (plus see my reply to Chris)
>
> define @f() {
> %p = alloca i8
> %q = alloca i8**
> store i8 42, %p
> store i8* %p, i8** %q
>
> %pp = some ptr arithmetic; full may-alias
> %v = load i64, %pp
> call @g(%v)
>
> %w = load i8, %p ; call we store-forward & RAUW 42?
> }
>
> So, we pass an integer to some function. If we don't distinguish between pointers and integers (current memory model), we need to assume that %v may carry the address of %p, as the load may have done an implicit pointer-to-integer cast. Therefore, pointer %p is escaped. Thus, in this model we cannot do store forwarding of 42 to the load of %w.

The conclusion that we cannot forward 42 is true if %pp has access to %p
or %q, which is not clear in the above code.
If no use of %p or %q somehow makes it's way into %pp, nor a load of %q,
%pp can be shown to be independent of %q already, regardless if we
already perform this reasoning or not.

> Except that LLVM doesn't consider integer loads as escaping pointers. This is wrong, and that's one of the reasons why we miscompile integer/pointer casts. The other reason is what I explained in the reply to John & Chris, which is that some integer optimizations aren't correct for pointers, so we cannot represent pointers as integers (without appropriate p2i/i2p casts).

At least me (and probably others) are arguing for the appropriate and
explicit p2i/i2p casts instead of a entirely new type that does the same
job.


> Another model is to say that integers can only carry integers. Above there's no pointer-to-integer cast of %p, hence %v cannot have the address of %p, and thus store-forwarding can happen.
> This model requires a "universal holder" different than (ab)using integers.

Storing %p into %q causes %p to be escaped by default, IMHO. With the
newest definition of escapes, we allow analyses to prove %q is not used
in "a weird way" and therefore we can pretend %p is not escaped. This
effectively requires us to look at all the loads of %q and assume they
are potential copies of %p. If such a load is (potentially) an integer,
you'd need to decide if the (potential) integer use can escape %p, same
as you would decide any other load of %q would. I fail to see the need
for byte types here.


> Introducing a byte type means we only need to careful with char loads, but not with short, int, long, etc loads. Clang would lower char loads to byte loads, and the remaining integer types straight to i<n> loads. So we can go full speed on integers, and throttle down just on chars, which is all we can do because the semantics of chars in C.

This will cause a lot of side effects, in addition to the work required
to make this happen in the first place.
Even if I assume we really want byte types for AA, which I am not sure
of given the lack of a convincing example, we should determine what the
impact would be here, e.g., restrict some optimizations in the presence
of i8 operation and get some idea of the cost.

Harald van Dijk via llvm-dev

unread,
Jun 7, 2021, 1:40:12 PM6/7/21
to George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung
Hi,

On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote:
  1. When copying pointers as bytes implicit pointer-to-integer casts are avoided.The current setting of performing a memcpy using i8s leads to miscompilations (example: bug report 37469) and is bad for alias analysis.

While trying to understand what exactly is going wrong in this bug I think I stumbled upon incomplete documentation. At https://llvm.org/docs/LangRef.html#pointeraliasing it is documented that

  • A memory access must be done through a pointer associated with the address range.
  • A pointer value is associated with the addresses associated with any value it is based on.
  • A pointer value formed by an instruction is based on another pointer value the instruction is getelementptr, bitcast, or inttoptr.

What is not mentioned is what a pointer value formed by a load instruction is based on. If a pointer value formed by a load instruction is not associated with any address range, then it cannot be used to load or store any value. Clearly that is not wanted.

If memory is completely untyped, then a load of a pointer value needs to be equivalent to a load of an integer value followed by inttoptr. In that model, it is not possible to replace a load of a pointer value by a previously stored value, it is only possible to replace it by inttoptr(ptrtoint(previously stored value)). In that model, the problem is that LLVM does replace a load of a pointer value by a previously stored value, and the bug can be fixed (at a cost of reduced optimisations of other code) by making sure to insert ptrtoint and inttoptr as needed, including for any pointer members of structs etc.

Your proposal is based on an interpretation of the problem using a memory model where memory is not completely untyped. In your memory model, what address range is a pointer value that came from a load instruction associated with?

Cheers,
Harald van Dijk

Nicolai Hähnle via llvm-dev

unread,
Jun 8, 2021, 7:02:28 PM6/8/21
to Harald van Dijk, llvm-dev, cfe-dev, Ralf Jung
Hi,

I second this.

I feel like the entire discussion is extremely confused and confusing because there is lack of clarity about what values are in fact carried by the types i<N>, ptr (assuming untyped pointers already), and the proposed b<N>. There are statements about "whether a b<N> contains a pointer or an integer" without a clear definition of what any of that really means.

By the way: the discussion of GVN states that i<N> do not carry provenance information, but LangRef appears to contradict this, because it states (in the Pointer Aliasing section) that "A pointer value formed by an inttoptr is based on all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value." The only way to make this statement operational is for the operand of the inttoptr to carry provenance information; this definitely needs to be addressed by this proposal.

My educated guess is that what the proposal really wants to say is that an i<N> value is poison or an integer without pointer provenance (i.e., an i<N> value is never "based on" a pointer) while a b<N> value is poison or an integer that may have provenance (i.e., it may be "based on" one or more pointers). Everything else should follow from this core definition. With this phrasing, I can see a coherent picture emerging that can be discussed without a lot of the confusion in this thread (though still _some_ confusion, as the topic is inherently subtle).

@George, Nuno, and Juneyoung: is that a good lens through which to see the proposal? If no, why not? If yes, I believe there are some immediate implications and questions, such as:

1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. Just define them as the corresponding i<N> op plus the union of provenance of the operands. This allows consistent implementation of char/unsigned char as b8, without having to jump back and forth between b8 and i8 all the time.

2. What's the provenance of a b<N> `select`?

3. A b<N>-to-i<N> conversion necessarily loses all provenance information.

4. What's the provenance of the result of an i<N>-to-b<N> conversion? The only possible choices are "empty" or "full" provenance, where "full" provenance means that the value is "based on" _all_ pointers. Depending on the details of allowed instruction signatures, we may want to allow both choices. "Full" provenance gives us something that's useful as a step towards inttoptr. "Empty" provenance allows us to lift i<N> to b<N> for arithmetic operations without being overly conservative (e.g., consider the somewhat hypothetical problem of lowering a GEP to a sequence of arithmetic on b<N> types in IR -- we want the result to have the provenance of the base pointer, which enters the arithmetic as a b<N> with provenance; adding the various i<N> offset operands should have no effect on provenance).

5. Memory is untyped but can carry provenance; i.e., it consists of b8. Load/store of i<N> performs implicit i-to-b/b-to-i conversions, respectively. Which kind of i-to-b conversion is used during a store? The "empty" one or the "full" one? There are several reasonable options for how to address this, and a choice must be made.

6. (How) are pointer types fundamentally different from b<N> types of the correct size? (By this I mean: is there any interesting difference in the values that these types can carry? Ignore surface differences like the fact that GEP traditionally goes with pointers while `add` goes with integer types -- we could have a GEP instruction on a correctly sized b<N>)

It seems pretty clear that there is still a lot of work to be done on this proposal before we can be sufficiently confident that it won't just add new correctness problems elsewhere. Besides, the cost of adding a new family of types is very high, and I am not (yet?) convinced that it pays for itself. But with the above I can at least see better where you're coming from.

Cheers,
Nicolai


Cheers,
Harald van Dijk

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.

James Courtier-Dutton via llvm-dev

unread,
Jun 8, 2021, 7:36:49 PM6/8/21
to Nicolai Hähnle, llvm-dev, cfe-dev, Ralf Jung
On Wed, 9 Jun 2021 at 00:02, Nicolai Hähnle via cfe-dev
<cfe...@lists.llvm.org> wrote:
> Hi,
> ...

> On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> I second this.
>
> I feel like the entire discussion is extremely confused and confusing because there is lack of clarity about what values are in fact carried by the types i<N>, ptr (assuming untyped pointers already), and the proposed b<N>. There are statements about "whether a b<N> contains a pointer or an integer" without a clear definition of what any of that really means.
>

+1 on the confused and confusing point. :-)

Juneyoung Lee via llvm-dev

unread,
Jun 8, 2021, 11:28:26 PM6/8/21
to Nicolai Hähnle, Harald van Dijk, llvm-dev, cfe-dev, Ralf Jung
Hi Nicolai,
 
I feel like the entire discussion is extremely confused and confusing because there is lack of clarity about what values are in fact carried by the types i<N>, ptr (assuming untyped pointers already), and the proposed b<N>. There are statements about "whether a b<N> contains a pointer or an integer" without a clear definition of what any of that really means.

By the way: the discussion of GVN states that i<N> do not carry provenance information, but LangRef appears to contradict this, because it states (in the Pointer Aliasing section) that "A pointer value formed by an inttoptr is based on all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value." The only way to make this statement operational is for the operand of the inttoptr to carry provenance information; this definitely needs to be addressed by this proposal.

My educated guess is that what the proposal really wants to say is that an i<N> value is poison or an integer without pointer provenance (i.e., an i<N> value is never "based on" a pointer) while a b<N> value is poison or an integer that may have provenance (i.e., it may be "based on" one or more pointers). Everything else should follow from this core definition. With this phrasing, I can see a coherent picture emerging that can be discussed without a lot of the confusion in this thread (though still _some_ confusion, as the topic is inherently subtle).

@George, Nuno, and Juneyoung: is that a good lens through which to see the proposal?

Yes, our understanding is that only values of pointer types carry provenance whereas non-pointer types (i<N>, float, ...) don't carry provenance.
We believe that when people were writing optimizations on non-pointer types (instcombine, reassociate, gvn, etc) they did not assume the existence of provenance in such types.
As mentioned, it contradicts with LangRef's wording about inttoptr though.

I agree that clarifying these is a good first step to avoid confusion. Specifically, via pinning down what people agree on in LangRef.
I think it is good to start with removing the inttoptr case from LangRef's based-on part and adding relevant texts.
IMO, existing alias analysis and optimizations were already avoiding uses of inttoptr's based-on information (experts' comments needed to double check).
Also, to provide alternatives for ptrtoint/inttoptr round trip with better analysis, llvm.ptrmask and clang-tidy's performance-no-int-to-ptr were added as well.

If people agree, I'll write a draft for this and share it via phabricator.

Thanks,
Juneyoung

Chris Lattner via llvm-dev

unread,
Jun 9, 2021, 12:03:35 PM6/9/21
to Hal Finkel, llvm-dev, cfe...@lists.llvm.org, Ralf Jung
On Jun 6, 2021, at 8:52 AM, Hal Finkel <hal.fin...@gmail.com> wrote:
I'll take this opportunity to point out that, at least historically, the reason why a desire to optimize around ptrtoint keeps resurfacing is because:

 1. Common optimizations introduce them into code that did not otherwise have them (SROA, for example, see convertValue in SROA.cpp).

 2. They're generated by some of the ABI code for argument passing (see clang/lib/CodeGen/TargetInfo.cpp).

 3. They're present in certain performance-sensitive code idioms (see, for example, ADT/PointerIntPair.h).

It seems to me that, if there's design work to do in this area, one should consider addressing these now-long-standing issues where we introduce ptrtoint by replacing this mechanism with some other one.

I completely agree.  These all have different solutions, I’d prefer to tackle them one by one.  

-Chris

Philip Reames via llvm-dev

unread,
Jun 9, 2021, 3:07:15 PM6/9/21
to Chris Lattner, John McCall, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung


On 6/5/21 9:26 PM, Chris Lattner via llvm-dev wrote:
On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe...@lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov wrote:

Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
type to LLVM to fix miscompilations due to load type punning. Please see
the proposal below. It would be great to hear the
feedback/comments/suggestions!


Motivation
==========

char and unsigned char are considered to be universal holders in C. They
can access raw memory and are used to implement memcpy. i8 is the LLVM’s
counterpart but it does not have such semantics, which is also not
desirable as it would disable many optimizations.

I don’t believe this is correct. LLVM does not have an innate


concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using tbaa
metadata, which are unrelated to the IR type of the access.

I completely agree with John.  “i8” in LLVM doesn’t carry any implications about aliasing (in fact, LLVM pointers are going towards being typeless).  Any such thing occurs at the accesses, and are part of TBAA.

I’m opposed to adding a byte type to LLVM, as such semantic carrying types are entirely unprecedented, and would add tremendous complexity to the entire system.

I agree with both John and Chris here.

I've read through the discussion in this thread, and have yet to be convinced there is a problem, much less than this is a good solution.  I'm open to being convinced of those two things, but the writeup in this thread doesn't do it.  There's snippet of examples downthread which might be convincing, but there's objections raised around language semantics which I find very hard to follow.  The fragmentation of the thread really doesn't help. 

I would suggest the OP take some of the motivating examples, write up a web-page with the examples and their interpretation, then revisit the topic.  In particular, I strongly suggest anticipating incorrect interpretation/objections and explicitly addressing them.

I'll also note that the use of the term capture w.r.t a *load* downthread makes absolutely no sense to me.  Stores capture, not loads. 

Philip




Hal Finkel via llvm-dev

unread,
Jun 9, 2021, 6:58:12 PM6/9/21
to Philip Reames, Chris Lattner, John McCall, llvm...@lists.llvm.org, Ralf Jung, cfe...@lists.llvm.org

I agree.

Also, while there certainly is a problem combining GVN with our use-def-chain-based pointer-aliasing/provenance semantics, it is in no way clear to me how a byte type helps resolve that problem.

 -Hal

Hal Finkel via llvm-dev

unread,
Jun 9, 2021, 7:15:06 PM6/9/21
to Chris Lattner, llvm-dev, cfe...@lists.llvm.org, Ralf Jung

I agree, these different problems have three different solutions. Also,
let me add that I see three quasi-separable discussions here (accounting
for past discussions on the same topic):

 1. Do we have a consistency problem with how we treat pointers and
their provenance information? The answer here is yes (see, e.g., the GVN
examples from this thread).

 2. Do we need to do more than be as conservative as possible around
ptrtoint/inttoptr usages? This is relevant because trying to be clever
here is often where inconsistencies around our pointer semantics are
exposed, although it's not always the case that problems involve
inttoptr. Addressing the points I raised above will lessen the
motivation to be more aggressive here (although, in itself, that will
not fix the semantic inconsistencies around pointers).

 3. Does introducing a byte type help resolve the semantic issues
around pointers? I don't yet understand why this might help.

Thanks again,

Juneyoung Lee via llvm-dev

unread,
Jun 10, 2021, 5:05:26 AM6/10/21
to Hal Finkel, Chris Lattner, llvm-dev, James Courtier-Dutton via cfe-dev, Ralf Jung
I created https://reviews.llvm.org/D104013 - please feel free to leave comments.

Thanks,
Juneyoung 

Richard Smith via llvm-dev

unread,
Jun 10, 2021, 8:26:01 PM6/10/21
to Nuno Lopes, llvm-dev, Ralf Jung, Clang Dev
On Sun, 6 Jun 2021 at 12:15, Nuno Lopes via cfe-dev <cfe...@lists.llvm.org> wrote:
Of course, let's look at a simple example with GVN:
if (p == q) {
  *(p-1) = 0;
}

If p/q are integers, then it's perfectly fine to replace p with q or vice-versa inside the 'if'.

If we start with:

char a[n];
char b[n];
long p = (long)(a+n);
long q = (long)b;
if (p == q) {
   *((char*)p-1) = 0;
}

... what prevents us from observing the same issue?

Nicolai Hähnle via llvm-dev

unread,
Jun 11, 2021, 1:47:53 AM6/11/21
to Hal Finkel, llvm-dev, Ralf Jung, cfe-dev
I have written a longer article that resulted as a byproduct of thinking through the problem space of this proposal: https://nhaehnle.blogspot.com/2021/06/can-memcpy-be-implemented-in-llvm-ir.html

What happened is that I ended up questioning some really fundamental things, like, can we even implement memcpy? :) The answer is a qualified Yes, but I found it to be a good framework for thinking about the fundamentals of what is discussed here, so I published this in the hope that others find it useful.

tl;dr: This discussion is ultimately all about pointer provenance. There is a gap in the expressiveness of LLVM IR when it comes to that, with surprising consequences for memcpy (and similar operations). From an aesthetics point of view, filling this gap has a lot of appeal, and the "byte" proposal points in that direction. However, I have some issues with the details of the proposal, and it is so intrusive that it needs to be justified by more than just aesthetics.

The correctness issues in the problem space can be solved by much less intrusive means. The justification for the more intrusive means would be better alias analysis, but I don't think this case has been built well enough so far. We should also consider alternatives (though I don't think there are any that are truly simple).

Apart from that, we need to be much more precise in our documentation of pointer provenance in LangRef (e.g.: what does llvm.memcpy do, exactly -- the mentioned bug 37469 could technically be a bug in the loop idiom recognizer!), and I like the idea of an `unrestrict(p)` instruction as a simpler and more evocative spelling of `inttoptr(ptrtoint(p))`.

I would also like to better understand how this interacts with the C99 "restrict" work that Jeroen pointed out. Overall, this is an important discussion to have but I feel we're only at the very beginning.

tl;dr of the tl;dr: It's complicated :)

Cheers,
Nicolai

Harald van Dijk via llvm-dev

unread,
Jun 11, 2021, 4:47:14 AM6/11/21
to Richard Smith, Nuno Lopes, llvm-dev, Clang Dev, Ralf Jung
On 11/06/2021 01:25, Richard Smith via llvm-dev wrote:
> On Sun, 6 Jun 2021 at 12:15, Nuno Lopes via cfe-dev
> <cfe...@lists.llvm.org <mailto:cfe...@lists.llvm.org>> wrote:
>
> Of course, let's look at a simple example with GVN:
> if (p == q) {
>   *(p-1) = 0;
> }
>
> If p/q are integers, then it's perfectly fine to replace p with q or
> vice-versa inside the 'if'.
>
>
> If we start with:
>
> char a[n];
> char b[n];
> long p = (long)(a+n);
> long q = (long)b;
> if (p == q) {
>    *((char*)p-1) = 0;
> }
>
> ... what prevents us from observing the same issue?

This is (or at least should be) okay, I think, because p is
ptrtoint(a+n), q is ptrtoint(b), the store is to inttoptr(p)-1, and we
should already know not to fold inttoptr(ptrtoint(p)) to p. Per
https://llvm.org/docs/LangRef.html#pointeraliasing, "A pointer value

formed by an inttoptr is based on all pointer values that contribute
(directly or indirectly) to the computation of the pointer’s value." The

"or indirectly" part of that includes the p == q branch condition.[*]
This means that (char*)p compares equal to a+n, (char*)q compares equal
to b, but if p == q, then (char*)p and (char*)q are equivalent and
either may be used to access a or b.

Cheers,
Harald van Dijk

[*] We have to consider all pointer values that contributed before
optimisations, which we do not track, so unless that changes and we do
somehow track it, effectively we have to consider all pointer values
regardless of whether they contributed to handle the possibility that
the p == q check may be optimised away.

Nuno Lopes via llvm-dev

unread,
Jun 11, 2021, 7:18:24 AM6/11/21
to Nicolai Hähnle, Hal Finkel, llvm-dev, Clang Dev, Ralf Jung

Thank you for the writeup!

I think it summarizes the problem & the solution space pretty well and clearly. Well worth a read for anyone interested in contributing to the discussion.

 

Nuno

Ralf Jung via llvm-dev

unread,
Jun 13, 2021, 11:09:45 AM6/13/21
to Nuno Lopes, Chris Lattner, John McCall, llvm...@lists.llvm.org, cfe...@lists.llvm.org
Hi all,

to add to what Nuno said, I came up with a series of optimizations which
demonstrate that dead store elimination, integer "substitution" (GVN replacing
'a' by 'b' after an 'a == b'), and provenance-based alias analysis, are in
conflict with each other: the current LLVM semantics are inconsistent and can
lead to miscompilations.

I posted them here as Rust code:
<https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806>.
A direct translation to C wouldn't be correct due to strict aliasing, but if you
imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then
I hope this example demonstrates that *something* needs to give. This is not
just a small bug somewhere, it is a case of the implicit assumptions made by
different optimization passes / instructions being mutually incompatible.

Kind regards,
Ralf

On 06.06.21 20:26, Nuno Lopes wrote:
> Let me try to explain the issue around typed memory. It’s orthogonal to TBAA.
> Plus I agree the allocation type is irrelevant; it’s the type of the load/store
> operations that matters.
>
> Fact 1: LLVM needs an “universal holder” type in its IR.
>
> Why?
>

> * Frontends load data from memory and don’t necessarily know the type they are


> loading. When you load a char in C, it can be part of a pointer or an
> integer, for example. Clang has no way to know.

> * That’s necessary to express implementations of memcpy in IR, as it copies


> raw data without knowing what the type is.

> * LLVM lowers small memcpys into load/store.


>
> I hope you agree LLVM needs a type that can contain any other type.
>
> Fact 2: optimizations for this “universal holder” type need to be correct for
> any LLVM IR type.
>
> Since this type can hold anything, optimizations that apply to this type must be
> correct for integers, pointers, etc. Otherwise, the results would be incorrect
> if the data was cast to the “wrong” type.
>
> The corollary of this fact is that GVN is not correct for the “universal holder”
> type in general. That’s because GVN is not correct for pointers in general
> (though correct in some cases). GVN doesn’t work for pointers since just because
> 2 pointers compare equal, it doesn’t mean they have the same provenance. As
> LLVM’s AA uses provenance information, this information must be preserved.
>
> Hopefully you agree with everything I wrote so far. If not please stop me and ask.
>
> Ok, so what’s the solution?
>
> We have two candidates for this “universal holder” type:
>

> * integer (i/<x>/)
> * byte (b/<x>/), a new dedicated type


>
> The advantage of going with integers is that they already exist. The
> disadvantage? To make LLVM correct, we would need to disable GVN in most cases
> or stop using provenance information in AA (make pointers == integers).
>
> This sounds pretty bad to me due to the expect performance hit.
>
> Introducing a new type requires work, yes. But it allows us to keep all integer
> optimizations as aggressive as we want, and only throttle down when encountering
> the byte type. Another benefit is that loading a byte from memory doesn’t escape
> any pointer, while with the first solution you need to consider all stored
> pointers as escaped when loading an integer.
>
> Introducing the byte type allow us to fix all the integer/pointer casts bugs in
> LLVM IR. And it enables more optimizations as we won’t need to be careful
> because of the LLVM IR bug.
>

> A 3^rd option is to keep everything as-is. That is, keep wrong optimizations in

> LLVM and keep adding hacks to limit the miscompilations in real-world programs.
>
> That has bitten us multiple times. Last time someone told me my miscompilation
> example was academic, LLVM ended up miscompiling itself, and then miscompiled
> Google’s code. Only this year we are planning to fully fix that bug (we have a
> hack right now).
>
> Last week Alive2 caught a miscompilation in the Linux kernel, in the network
> stack. The optimization got the pointer arithmetic wrong. Pretty scary, and the
> bug may have security implications.
>
> Anyway, just to argue that leaving things as-is is not an option. We have
> momentum and 3 GSoC students ready to help to fix the last few design bugs in
> LLVM IR.
>
> I know this stuff is complex; please ask questions if something is not clear or
> if you disagree with anything I wrote above.
>
> Thanks,
>
> Nuno
>

> *From:* Chris Lattner via llvm-dev
> *Sent:* 06 June 2021 05:27
> *To:* John McCall <rjmc...@apple.com>
> *Cc:* llvm...@lists.llvm.org; Ralf Jung <ju...@mpi-sws.org>; cfe...@lists.llvm.org
> *Subject:* Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
>
> On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe...@lists.llvm.org
> <mailto:cfe...@lists.llvm.org>> wrote:On 4 Jun 2021, at 11:24, George Mitenkov

--
Website: https://people.mpi-sws.org/~jung/

Ralf Jung via llvm-dev

unread,
Jun 13, 2021, 11:22:30 AM6/13/21
to Nicolai Hähnle, Harald van Dijk, llvm-dev, cfe-dev
Hi,

> 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. Just
> define them as the corresponding i<N> op plus the union of provenance of the
> operands. This allows consistent implementation of char/unsigned char as b8,
> without having to jump back and forth between b8 and i8 all the time.

FWIW, "char" addition happens at "int" type due to integer promotion. So there
is no problem with back and forth here.

"Union" of provenance is currently not an operation that is required to model
LLVM IR, so your proposal would necessitate adding such a concept. It'll be
interesting to figure out how "getelementptr inbounds" behaves on
multi-provenance pointers...

> 6. (How) are pointer types fundamentally different from b<N> types of the
> correct size? (By this I mean: is there any interesting difference in the values
> that these types can carry? Ignore surface differences like the fact that GEP
> traditionally goes with pointers while `add` goes with integer types -- we could
> have a GEP instruction on a correctly sized b<N>)

I'm not saying I have the answer here, but one possible difference might arise
with "mixing bytes from different pointers". Say we are storing pointer "ptr1"
directly followed by "ptr2" on a 64bit machine, and now we are doing an
(unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 bytes
of ptr2. This is certainly a valid value for b64. Is it also a valid value at
pointer type, and if yes, which provenance does it have?

Kind regards,
Ralf

Ralf Jung via llvm-dev

unread,
Jun 13, 2021, 11:27:15 AM6/13/21
to Johannes Doerfert, Nuno Lopes, George Mitenkov, llvm...@lists.llvm.org, cfe...@lists.llvm.org
Hi Johannes,

>> I think Joshua gave a very nice motivation already.
>
> I don't dispute that but I am still not understanding the need for bytes. None
> of the examples I have seen so far
> clearly made the point that it is the byte types that provide a substantial
> benefit. The AA example below does neither.

I hope <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html> makes a
convincing case that under the current semantics, when one does an "i64" load of
a value that was stored at pointer type, we have to say that this load returns
poison. In particular, saying that this implicitly performs a "ptrtoint" is
inconsistent with optimizations that are probably too important to be changed to
accommodate this implicit "ptrtoint".

If/once we agree on that, "byte" is a fairly natural proposal IMO: we need some
type at which to perform a load that does *not* turn some kinds of data into
poison. "byte" is that type.

Kind regards,
Ralf

--
Website: https://people.mpi-sws.org/~jung/

John McCall via llvm-dev

unread,
Jun 14, 2021, 1:34:21 AM6/14/21
to Ralf Jung, llvm...@lists.llvm.org, cfe...@lists.llvm.org

On 13 Jun 2021, at 11:26, Ralf Jung wrote:

> Hi Johannes,
>
>>> I think Joshua gave a very nice motivation already.
>>
>> I don't dispute that but I am still not understanding the need for
>> bytes. None of the examples I have seen so far
>> clearly made the point that it is the byte types that provide a
>> substantial benefit. The AA example below does neither.
>
> I hope
> <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html>
> makes a convincing case that under the current semantics, when one
> does an "i64" load of a value that was stored at pointer type, we have
> to say that this load returns poison. In particular, saying that this
> implicitly performs a "ptrtoint" is inconsistent with optimizations
> that are probably too important to be changed to accommodate this
> implicit "ptrtoint".

I think it is fact rather obvious that, if this optimization as
currently written is indeed in conflict with the current semantics, it
is the optimization that will have to give. If the optimization is too
important for performance to give up entirely, we will simply have to
find some more restricted pattern that wee can still soundly optimize.

Perhaps the clearest reason is that, if we did declare that integer
types cannot carry pointers and so introduced byte types that could, C
frontends would have to switch to byte types for their integer types,
and so we would immediately lose this supposedly important optimization
for C-like languages, and so, since optimizing C is very important, we
would immediately need to find some restricted pattern under which we
could soundly apply this optimization to byte types. That’s assuming
that this optimization is actually significant, of course.

John.

Nicolai Hähnle via llvm-dev

unread,
Jun 14, 2021, 2:18:00 AM6/14/21
to Ralf Jung, llvm-dev, cfe-dev
Hi Ralf,

On Sun, Jun 13, 2021 at 5:09 PM Ralf Jung via llvm-dev <llvm...@lists.llvm.org> wrote:
to add to what Nuno said, I came up with a series of optimizations which
demonstrate that dead store elimination, integer "substitution" (GVN replacing
'a' by 'b' after an 'a == b'), and provenance-based alias analysis, are in
conflict with each other: the current LLVM semantics are inconsistent and can
lead to miscompilations.

I posted them here as Rust code:
<https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806>.
A direct translation to C wouldn't be correct due to strict aliasing, but if you
imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then
I hope this example demonstrates that *something* needs to give. This is not
just a small bug somewhere, it is a case of the implicit assumptions made by
different optimization passes / instructions being mutually incompatible.

Agreed. Reading your example as LLVM IR, I think we have a choice to make: either memory is typed or it isn't. By "typed" I mean that an integer load following a pointer store or vice versa returns poison or is immediate UB.

If memory is typed, then your original example program has UB in at least two places:

1. A pointer stored via storage_ptr is loaded via storage_usize, used in a comparison, and the comparison result is branched on. This would be UB.
2. An integer is stored via storage_usize and then loaded via storage_ptr (pointer load) and then dereferenced. This would also be UB.
 
If memory is untyped and you want the original program to have defined behavior, my first thought is that the resolution that makes the most sense is to say that removing the store to *storage_usize is incorrect because that store is not truly redundant: it changes the pointer provenance that is associated with the underlying memory. The other optimizations in that chain of optimizations seem of higher value (keeping in mind that *most* dead store elimination is still correct, it's just this particular narrow case which isn't).

Cheers,
Nicolai

Nicolai Hähnle via llvm-dev

unread,
Jun 14, 2021, 2:23:02 AM6/14/21
to John McCall, llvm-dev, cfe-dev, Ralf Jung
On Mon, Jun 14, 2021 at 7:34 AM John McCall via llvm-dev <llvm...@lists.llvm.org> wrote:


On 13 Jun 2021, at 11:26, Ralf Jung wrote:

> Hi Johannes,
>
>>> I think Joshua gave a very nice motivation already.
>>
>> I don't dispute that but I am still not understanding the need for
>> bytes. None of the examples I have seen so far
>> clearly made the point that it is the byte types that provide a
>> substantial benefit. The AA example below does neither.
>
> I hope
> <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html>
> makes a convincing case that under the current semantics, when one
> does an "i64" load of a value that was stored at pointer type, we have
> to say that this load returns poison. In particular, saying that this
> implicitly performs a "ptrtoint" is inconsistent with optimizations
> that are probably too important to be changed to accommodate this
> implicit "ptrtoint".

I think it is fact rather obvious that, if this optimization as
currently written is indeed in conflict with the current semantics, it
is the optimization that will have to give.  If the optimization is too
important for performance to give up entirely, we will simply have to
find some more restricted pattern that wee can still soundly optimize.

I tend to agree. I don't think Ralf's example alone is convincing evidence that pointer-load of integer-store must be poison, i.e. memory must be typed.

FWIW, the least important optimization in that example's chain, and the one that is most obviously incorrect in an untyped memory world, is eliminating a store of a previously loaded value. How much would we actually lose if we disable this particular optimization? Note that this is only a small special case of dead store elimination. The more common case where there are two stores to memory in a row, and the first one is eliminated, is still correct.

Cheers,
Nicolai

 

Perhaps the clearest reason is that, if we did declare that integer
types cannot carry pointers and so introduced byte types that could, C
frontends would have to switch to byte types for their integer types,
and so we would immediately lose this supposedly important optimization
for C-like languages, and so, since optimizing C is very important, we
would immediately need to find some restricted pattern under which we
could soundly apply this optimization to byte types.  That’s assuming
that this optimization is actually significant, of course.

John.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Nicolai Hähnle via llvm-dev

unread,
Jun 14, 2021, 2:29:33 AM6/14/21
to Ralf Jung, llvm-dev, cfe-dev
Hi Ralf,

On Sun, Jun 13, 2021 at 5:22 PM Ralf Jung <ju...@mpi-sws.org> wrote:
> 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. Just
> define them as the corresponding i<N> op plus the union of provenance of the
> operands. This allows consistent implementation of char/unsigned char as b8,
> without having to jump back and forth between b8 and i8 all the time.

FWIW, "char" addition happens at "int" type due to integer promotion. So there
is no problem with back and forth here.

"Union" of provenance is currently not an operation that is required to model
LLVM IR, so your proposal would necessitate adding such a concept. It'll be
interesting to figure out how "getelementptr inbounds" behaves on
multi-provenance pointers...

True, something needs to be said about that. The main question is whether "jumping" between different objects that are both in the provenance set is poison or not. Ultimately, the goal of provenance is to help alias analysis, so that's what should be driving that choice.
 

> 6. (How) are pointer types fundamentally different from b<N> types of the
> correct size? (By this I mean: is there any interesting difference in the values
> that these types can carry? Ignore surface differences like the fact that GEP
> traditionally goes with pointers while `add` goes with integer types -- we could
> have a GEP instruction on a correctly sized b<N>)

I'm not saying I have the answer here, but one possible difference might arise
with "mixing bytes from different pointers". Say we are storing pointer "ptr1"
directly followed by "ptr2" on a 64bit machine, and now we are doing an
(unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 bytes
of ptr2. This is certainly a valid value for b64. Is it also a valid value at
pointer type, and if yes, which provenance does it have?

This kind of example is why I was implicitly assuming that we must have a "provenance union" operation anyway, whether we like it or not. I suppose the alternative is to say that pointers formed in this way, whether directly or indirectly, are poison, but I have my doubts whether this is feasible. What happens with pointer arithmetic where you start out with two pointers of different provenance, convert to integer in the source language, subtract them, use the result further in some way, and for some reason all steps are performed with "byte" types in LLVM IR?

Cheers,
Nicolai



Kind regards,
Ralf

Ralf Jung via llvm-dev

unread,
Jun 14, 2021, 7:04:43 AM6/14/21
to John McCall, llvm...@lists.llvm.org, cfe...@lists.llvm.org
Hi,

>>> I don't dispute that but I am still not understanding the need for bytes.
>>> None of the examples I have seen so far
>>> clearly made the point that it is the byte types that provide a substantial
>>> benefit. The AA example below does neither.
>>
>> I hope <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html> makes
>> a convincing case that under the current semantics, when one does an "i64"
>> load of a value that was stored at pointer type, we have to say that this load
>> returns poison. In particular, saying that this implicitly performs a
>> "ptrtoint" is inconsistent with optimizations that are probably too important
>> to be changed to accommodate this implicit "ptrtoint".
>
> I think it is fact rather obvious that, if this optimization as currently
> written is indeed in conflict with the current semantics, it is the optimization
> that will have to give.  If the optimization is too important for performance to
> give up entirely, we will simply have to find some more restricted pattern that
> wee can still soundly optimize.

That is certainly a reasonable approach.
However, judging from how reluctant LLVM is to remove optimizations that are
much more convincingly wrong [1], my impression was that it is easier to
complicate the semantics than to remove an optimization that LLVM already performs.

[1]: https://bugs.llvm.org/show_bug.cgi?id=34548,
https://bugs.llvm.org/show_bug.cgi?id=35229;
see https://www.ralfj.de/blog/2020/12/14/provenance.html for a
more detailed explanation

> Perhaps the clearest reason is that, if we did declare that integer types cannot
> carry pointers and so introduced byte types that could, C frontends would have
> to switch to byte types for their integer types, and so we would immediately
> lose this supposedly important optimization for C-like languages, and so, since
> optimizing C is very important, we would immediately need to find some
> restricted pattern under which we could soundly apply this optimization to byte
> types.  That’s assuming that this optimization is actually significant, of course.

At least C with strict aliasing enabled (i.e., standard C) only needs to use the
byte type for "(un)signed char". The other integer types remain unaffected.
There is no arithmetic on these types ("char + char" is subject to integer
promotion), so the IR overhead would consist in a few "bytecast" instructions
next to / replacing the existing sign extensions that convert "char" to "int"
before performing the arithmetic.

Kind regards,
Ralf

Ralf Jung via llvm-dev

unread,
Jun 14, 2021, 7:11:57 AM6/14/21
to Nicolai Hähnle, llvm-dev, cfe-dev
Hi Nicolai,

> to add to what Nuno said, I came up with a series of optimizations which
> demonstrate that dead store elimination, integer "substitution" (GVN replacing
> 'a' by 'b' after an 'a == b'), and provenance-based alias analysis, are in
> conflict with each other: the current LLVM semantics are inconsistent and can
> lead to miscompilations.
>
> I posted them here as Rust code:
> <https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806
> <https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806>>.
>
> A direct translation to C wouldn't be correct due to strict aliasing, but if
> you
> imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then
> I hope this example demonstrates that *something* needs to give. This is not
> just a small bug somewhere, it is a case of the implicit assumptions made by
> different optimization passes / instructions being mutually incompatible.
>
>
> Agreed. Reading your example as LLVM IR, I think we have a choice to make:
> either memory is typed or it isn't. By "typed" I mean that an integer load
> following a pointer store or vice versa returns poison or is immediate UB.
>
> If memory is typed, then your original example program has UB in at least two
> places:
>
> 1. A pointer stored via storage_ptr is loaded via storage_usize, used in a
> comparison, and the comparison result is branched on. This would be UB.
> 2. An integer is stored via storage_usize and then loaded via storage_ptr
> (pointer load) and then dereferenced. This would also be UB.

Agreed. However, to "fix" the example, it is enough to make one of these UB. So,
the second part could still be easily allowed -- converting integers to pointers
is lossless, so I do not think there is a problem with performing that
conversion "implicitly" as part of a load at ptr type.

> If memory is untyped and you want the original program to have defined behavior,
> my first thought is that the resolution that makes the most sense is to say that
> removing the store to *storage_usize is incorrect because that store is not
> truly redundant: it changes the pointer provenance that is associated with the
> underlying memory. The other optimizations in that chain of optimizations seem
> of higher value (keeping in mind that *most* dead store elimination is still
> correct, it's just this particular narrow case which isn't).

I first was confused what you mean by "narrow" here; this would affect *all*
cases of 'removing a store of the value just loaded' since we have no way to
know if that value might be a pointer.

However, judging from your other email, you seem to consider this a narrow case,
compared to removing the first of two consecutive stores to the same location.
That's fair, I am not nearly enough of a compiler person to evaluate these
trade-offs. :) I can just provide guidance by suggesting formal semantics that
support a given set of optimizations, and pointing out when a combination of
optimizations becomes internally inconsistent.

Kind regards,
Ralf

> > *To:* John McCall <rjmc...@apple.com <mailto:rjmc...@apple.com>>
> > *Cc:* llvm...@lists.llvm.org <mailto:llvm...@lists.llvm.org>; Ralf Jung
> <ju...@mpi-sws.org <mailto:ju...@mpi-sws.org>>; cfe...@lists.llvm.org
> <mailto:cfe...@lists.llvm.org>


> > *Subject:* Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
> >
> > On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev
> <cfe...@lists.llvm.org <mailto:cfe...@lists.llvm.org>

> > <mailto:cfe...@lists.llvm.org <mailto:cfe...@lists.llvm.org>>> wrote:On

> Website: https://people.mpi-sws.org/~jung/ <https://people.mpi-sws.org/~jung/>


> _______________________________________________
> LLVM Developers mailing list

> llvm...@lists.llvm.org <mailto:llvm...@lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

John McCall via llvm-dev

unread,
Jun 14, 2021, 12:08:35 PM6/14/21
to Ralf Jung, llvm...@lists.llvm.org, cfe...@lists.llvm.org

The semantics you seem to want are that LLVM’s integer types cannot carry information from pointers. But I can cast a pointer to an integer in C and vice-versa, and compilers have de facto defined the behavior of subsequent operations like breaking the integer up (and then putting it back together), adding numbers to it, and so on. So no, as a C compiler writer, I do not have a choice; I will have to use a type that can validly carry pointer information for integers in C.

Since you seem to find this sort of thing compelling, please note that even a simple assignment like char c2 = c1 technically promotes through int in C, and so int must be able to carry pointer information if char can.

John.

Cranmer, Joshua via llvm-dev

unread,
Jun 14, 2021, 4:29:47 PM6/14/21
to John McCall, Ralf Jung, llvm...@lists.llvm.org, cfe...@lists.llvm.org

In case anyone reading this thread is unaware, there is currently a proposal before the C standards committee that goes into more details on a future provenance model for C: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf. It actually details a few different provenance models, especially several variations on integers-do-not-have-provenance, mainly around what provenance is reconstructed on an inttoptr cast. While I’m not on the standards committee, I believe that something along the lines of this proposal will become the official C provenance rules.

 

While LLVM isn’t required to adopt C rules wholesale, the basic form of the proposal already matches applied LLVM semantics better than our own language reference (we do not consider integers to carry provenance). There are several points in the proposal where extra annotations are suggested to effect various provenance properties (e.g., the ability to write word-level memcpy), and offhand, it looks like the byte proposal would be a viable vehicle for those extra annotations, although I’m not entirely persuaded that it’s the best vehicle.

John McCall via llvm-dev

unread,
Jun 14, 2021, 5:40:07 PM6/14/21
to Cranmer, Joshua, llvm...@lists.llvm.org, cfe...@lists.llvm.org, Ralf Jung

On 14 Jun 2021, at 16:29, Cranmer, Joshua wrote:

In case anyone reading this thread is unaware, there is currently a proposal before the C standards committee that goes into more details on a future provenance model for C: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf. It actually details a few different provenance models, especially several variations on integers-do-not-have-provenance, mainly around what provenance is reconstructed on an inttoptr cast. While I’m not on the standards committee, I believe that something along the lines of this proposal will become the official C provenance rules.

While LLVM isn’t required to adopt C rules wholesale, the basic form of the proposal already matches applied LLVM semantics better than our own language reference (we do not consider integers to carry provenance). There are several points in the proposal where extra annotations are suggested to effect various provenance properties (e.g., the ability to write word-level memcpy), and offhand, it looks like the byte proposal would be a viable vehicle for those extra annotations, although I’m not entirely persuaded that it’s the best vehicle.

To be clear, while that TR represents enormous progress (and I agree that we should be closely following that discussion), it openly acknowledges that it doesn’t know how representation copies are supposed to work. It’s clear that library calls like memcpy are supposed to preserve provenance exactly (rather than doing the more conservative thing that round-tripping through an integer requiring), but code that simply does the same thing inline needs to have the same effect, and nobody knows how that’s supposed to happen if integers don’t carry provenance.

John.

Juneyoung Lee via llvm-dev

unread,
Jun 15, 2021, 1:49:38 AM6/15/21
to John McCall, Ralf Jung, llvm-dev, James Courtier-Dutton via cfe-dev
On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <llvm...@lists.llvm.org> wrote:

The semantics you seem to want are that LLVM’s integer types cannot carry information from pointers. But I can cast a pointer to an integer in C and vice-versa, and compilers have de facto defined the behavior of subsequent operations like breaking the integer up (and then putting it back together), adding numbers to it, and so on. So no, as a C compiler writer, I do not have a choice; I will have to use a type that can validly carry pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer types not carry pointer information does not necessarily mean that dereferencing a pointer casted from integer is UB.

For example, the definition of cast_ival_to_ptrval at the n2676 proposal shows that a pointer's provenance is reconstructed from an integer.

(Whether n2676's cast_ival_to_ptrval can be also used for LLVM's inttoptr semantics is a different question, though) 

Since you seem to find this sort of thing compelling, please note that even a simple assignment like char c2 = c1 technically promotes through int in C, and so int must be able to carry pointer information if char can.

IIUC integer promotion is done when it is used as an operand of arithmetic ops or switch's condition, so I think assignment operation is okay.

Juneyoung 


_______________________________________________

John McCall via llvm-dev

unread,
Jun 15, 2021, 3:07:12 AM6/15/21
to Juneyoung Lee, llvm-dev, James Courtier-Dutton via cfe-dev, Ralf Jung

On 15 Jun 2021, at 1:49, Juneyoung Lee wrote:

On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
llvm...@lists.llvm.org> wrote:

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back
together), adding numbers to it, and so on. So no, as a C compiler writer,
I do not have a choice; I will have to use a type that can validly carry
pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer
types not carry pointer information does not necessarily mean that
dereferencing a pointer casted from integer is UB.

What exactly is the claimed formal property of byte types, then,
that integer types will lack? Because it seems to me that converting
from an integer gives us valid provenance in strictly more situations
than converting from bytes, since it reconstructs provenance if there’s
any object at that address (under still-debated restrictions),
while converting from bytes always preserves the original provenance
(if any). I don’t understand how that can possibly give us more
flexibility to optimize integers.

Since you seem to find this sort of thing compelling, please note that
even a simple assignment like char c2 = c1 technically promotes through
int in C, and so int must be able to carry pointer information if char
can.

IIUC integer promotion is done when it is used as an operand of arithmetic
ops or switch's condition, so I think assignment operation is okay.

Hmm, I was misremembering the rule, you’re right.

John.

Juneyoung Lee via llvm-dev

unread,
Jun 15, 2021, 1:30:07 PM6/15/21
to John McCall, llvm-dev, James Courtier-Dutton via cfe-dev, Ralf Jung
On Tue, Jun 15, 2021 at 4:07 PM John McCall <rjmc...@apple.com> wrote:

On 15 Jun 2021, at 1:49, Juneyoung Lee wrote:

On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
llvm...@lists.llvm.org> wrote:

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back
together), adding numbers to it, and so on. So no, as a C compiler writer,
I do not have a choice; I will have to use a type that can validly carry
pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer
types not carry pointer information does not necessarily mean that
dereferencing a pointer casted from integer is UB.

What exactly is the claimed formal property of byte types, then,
that integer types will lack? Because it seems to me that converting
from an integer gives us valid provenance in strictly more situations
than converting from bytes, since it reconstructs provenance if there’s
any object at that address (under still-debated restrictions),
while converting from bytes always preserves the original provenance
(if any). I don’t understand how that can possibly give us more
flexibility to optimize integers.

When two objects are adjacent, and an integer is exactly pointing to the location between them, its provenance cannot be properly recovered.

int x[1], y[1];
llvm.assume((intptr_t)&x[0] == 0x100 && (intptr_t)&y[0] == 0x104);
int *p = (int*)(intptr_t)&x[1];
// Q: Is p's provenance x or y?

If it is expected that '*(p-1)' is equivalent to *x, p's provenance should be x.
However, based on llvm.assume, optimizations on integers can replace (intptr_t)&x[1] with (intptr_t)&y[0] (which is what happened in the bug report).
Then, '*(p-1)' suddenly becomes out-of-bounds access, which is UB.
So, p's provenance isn't simply x or y; it should be something that can access both x and y.

This implies that, unless there is a guarantee that all allocated objects are one or more bytes apart, there is no type that can perfectly store a pointer byte.
memcpy(x, y, 8) isn't equivalent to 'v=load i64 y;store i64 v, x' because v already lost the pointer information.


The pointer information is perfectly stored in a byte type. But, arithmetic property-based optimizations such as the above one are not correct anymore.
Here is an example with a byte-type version:

int x[1], y[1];
// byte_8 is a 64-bits byte type
llvm.assume((byte_8)&x[0] == 0x100 && (byte_8)&y[0] == 0x104);
int *p = (int*)(byte_8)&x[1];
// p's provenance is alway x.

For a byte type, equality comparison is true does not mean that the two values are precisely equal.
Since (byte_8)&x[1] and (byte_8)&y[0] have different provenances, replacing one with another must be avoided.
Instead, we can guarantee that p is precisely equivalent to &x[1].
Another benefit is that optimizations on integers do not need to suffer from these pointer thingy anymore;
e.g., the optimization on llvm.assume above can survive and it does not need to check whether an integer variable is derived from a pointer value.
 

Since you seem to find this sort of thing compelling, please note that
even a simple assignment like char c2 = c1 technically promotes through
int in C, and so int must be able to carry pointer information if char
can.

IIUC integer promotion is done when it is used as an operand of arithmetic
ops or switch's condition, so I think assignment operation is okay.

Hmm, I was misremembering the rule, you’re right.

John.

Ralf Jung via llvm-dev

unread,
Jun 15, 2021, 3:07:53 PM6/15/21
to Nicolai Hähnle, llvm-dev, cfe-dev
Hi Nicolai,

> > 6. (How) are pointer types fundamentally different from b<N> types of the
> > correct size? (By this I mean: is there any interesting difference in the
> values
> > that these types can carry? Ignore surface differences like the fact that
> GEP
> > traditionally goes with pointers while `add` goes with integer types --
> we could
> > have a GEP instruction on a correctly sized b<N>)
>
> I'm not saying I have the answer here, but one possible difference might arise
> with "mixing bytes from different pointers". Say we are storing pointer "ptr1"
> directly followed by "ptr2" on a 64bit machine, and now we are doing an
> (unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 bytes
> of ptr2. This is certainly a valid value for b64. Is it also a valid value at
> pointer type, and if yes, which provenance does it have?
>
>
> This kind of example is why I was implicitly assuming that we must have a
> "provenance union" operation anyway, whether we like it or not. I suppose the
> alternative is to say that pointers formed in this way, whether directly or
> indirectly, are poison, but I have my doubts whether this is feasible. What
> happens with pointer arithmetic where you start out with two pointers of
> different provenance, convert to integer in the source language, subtract them,
> use the result further in some way, and for some reason all steps are performed
> with "byte" types in LLVM IR?

My personal model here is that every *byte* independently can carry provenance.
When a pointer is loaded and the bytes have different provenance, the load is
fine -- but any load/store with that pointer will be UB. This is because once
there are different provenances, i.e. different allocated objects that this
pointer is supposed to point to, then at least one of them will be violated by
the access, so this is an "incorrect provenance" access. In other words, such
pointers are very similar to those created by going out-of-bounds with
"getelementptr" (no inbounds!): not poison, but still UB to load/store.

Regarding your question, there is no operation that takes two pointers as input,
so there is never a need to "union" provenance. Your sequence of operation will
involve integer roundtrips, and pointers cast from integers have a kind of
"wildcard" provenance; they may access any allocation. (getelementptr inbounds
on such pointers can restrict that; see
https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf for details. And now I am
wondering what happens when one mixes the bytes of several of these pointers
that have 'GEPi on integer pointer provenance'... but saying this is an invalid
provenance that may not to *any* loads/stores seems fine.)

Kind regards,
Ralf

>
> Cheers,
> Nicolai
>
>
>
> Kind regards,
> Ralf
>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> aber vergiss niemals, wie sie sein sollte.

--
Website: https://people.mpi-sws.org/~jung/

Ralf Jung via llvm-dev

unread,
Jun 15, 2021, 3:16:06 PM6/15/21
to John McCall, llvm...@lists.llvm.org, cfe...@lists.llvm.org
Hi,

> The semantics you seem to want are that LLVM’s integer types cannot carry
> information from pointers. But I can cast a pointer to an integer in C and
> vice-versa, and compilers have de facto defined the behavior of subsequent
> operations like breaking the integer up (and then putting it back together),
> adding numbers to it, and so on. So no, as a C compiler writer, I do not have a
> choice; I will have to use a type that can validly carry pointer information for
> integers in C.

Integers demonstrably do not carry provenance; see
<https://www.ralfj.de/blog/2020/12/14/provenance.html> for a detailed
explanation of why.
As a consequence of this, ptr-int-ptr roundtrips are lossy: some of the original
provenance information is lost. This means that optimizing away such roundtrips
is incorrect, and indeed doing so leads to miscompilations
(https://bugs.llvm.org/show_bug.cgi?id=34548).

The key difference between int and byte is that ptr-byte-ptr roundtrips are
*lossless*, all the provenance is preserved. This means some extra optimizations
(such as removing these roundtrips -- which implicitly happens when a
redundant-store-after-load is removed), but also some lost optimizations (most
notably, "x == y" does not mean x and y are equal in all respects; their
provenance might still differ, so it is incorrect for GVN to replace one my the
other).

It's a classic tradeoff: we can *either* have lossless roundtrips *or* "x == y"
implies full equality of the abstract values. Having both together leads to
contradictions, which manifest as miscompilations. "byte" and "int" represent
the two possible choices here; therefore, by adding "byte", LLVM would close a
gap in the expressive power of its IR.

James Courtier-Dutton via llvm-dev

unread,
Jun 15, 2021, 4:10:59 PM6/15/21
to Ralf Jung, llvm-dev, cfe-dev@lists.llvm.org Developers
On Tue, 15 Jun 2021 at 20:16, Ralf Jung via llvm-dev
<llvm...@lists.llvm.org> wrote:
>
> Integers demonstrably do not carry provenance; see
> <https://www.ralfj.de/blog/2020/12/14/provenance.html> for a detailed
> explanation of why.

That article is nice in the way it is trying to describe what the problem is.
Extract from the article:
"How can we fix this?
To fix the problem, we will have to declare one of the three
optimizations incorrect and stop performing it. Speaking in terms of
the LLVM IR semantics, this corresponds to deciding whether pointers
and/or integers have provenance:
1) We could say both pointers and integers have provenance, which
invalidates the first optimization.
2) We could say pointers have provenance but integers do not, which
invalidates the second optimization.
3) We could say nothing has provenance, which invalidates the third
optimization."

This is really the part that I disagree with. There are more possible
alternatives than just 1, 2 and 3.

Kind Regards

James

Ralf Jung via llvm-dev

unread,
Jun 15, 2021, 4:16:00 PM6/15/21
to James Courtier-Dutton, llvm-dev, cfe-dev@lists.llvm.org Developers
Hi James,

>> Integers demonstrably do not carry provenance; see
>> <https://www.ralfj.de/blog/2020/12/14/provenance.html> for a detailed
>> explanation of why.
>
> That article is nice in the way it is trying to describe what the problem is.
> Extract from the article:
> "How can we fix this?
> To fix the problem, we will have to declare one of the three
> optimizations incorrect and stop performing it. Speaking in terms of
> the LLVM IR semantics, this corresponds to deciding whether pointers
> and/or integers have provenance:
> 1) We could say both pointers and integers have provenance, which
> invalidates the first optimization.
> 2) We could say pointers have provenance but integers do not, which
> invalidates the second optimization.
> 3) We could say nothing has provenance, which invalidates the third
> optimization."
>
> This is really the part that I disagree with. There are more possible
> alternatives than just 1, 2 and 3.

I am interested to hear those. :)
Certainly, we have to invalidate at least one of the optimizations -- if all 3
optimizations are valid, we have a contradiction.

Kind regards,
Ralf

David Blaikie via llvm-dev

unread,
Jun 15, 2021, 5:38:47 PM6/15/21
to Ralf Jung, llvm-dev, Clang Dev
On Tue, Jun 15, 2021 at 12:16 PM Ralf Jung via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi,

> The semantics you seem to want are that LLVM’s integer types cannot carry
> information from pointers. But I can cast a pointer to an integer in C and
> vice-versa, and compilers have de facto defined the behavior of subsequent
> operations like breaking the integer up (and then putting it back together),
> adding numbers to it, and so on. So no, as a C compiler writer, I do not have a
> choice; I will have to use a type that can validly carry pointer information for
> integers in C.

Integers demonstrably do not carry provenance; see
<https://www.ralfj.de/blog/2020/12/14/provenance.html> for a detailed
explanation of why.
As a consequence of this, ptr-int-ptr roundtrips are lossy: some of the original
provenance information is lost. This means that optimizing away such roundtrips
is incorrect, and indeed doing so leads to miscompilations
(https://bugs.llvm.org/show_bug.cgi?id=34548).

The key difference between int and byte is that ptr-byte-ptr roundtrips are
*lossless*, all the provenance is preserved. This means some extra optimizations
(such as removing these roundtrips -- which implicitly happens when a
redundant-store-after-load is removed), but also some lost optimizations (most
notably, "x == y" does not mean x and y are equal in all respects; their
provenance might still differ, so it is incorrect for GVN to replace one my the
other).

It's a classic tradeoff: we can *either* have lossless roundtrips

I think an important part of explaining the motivation for "byte" would be an explanation/demonstration of what the cost of losing "lossless roundtrips" would be.
 

John McCall via llvm-dev

unread,
Jun 15, 2021, 7:56:40 PM6/15/21
to Juneyoung Lee, llvm-dev, Hal Finkel, James Courtier-Dutton via cfe-dev, Ralf Jung

On 15 Jun 2021, at 13:29, Juneyoung Lee wrote:

On Tue, Jun 15, 2021 at 4:07 PM John McCall <rjmc...@apple.com> wrote:

On 15 Jun 2021, at 1:49, Juneyoung Lee wrote:

On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
llvm...@lists.llvm.org> wrote:

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back
together), adding numbers to it, and so on. So no, as a C compiler writer,
I do not have a choice; I will have to use a type that can validly carry
pointer information for integers in C.

int->ptr cast can reconstruct the pointer information, so making integer
types not carry pointer information does not necessarily mean that
dereferencing a pointer casted from integer is UB.

What exactly is the claimed formal property of byte types, then,
that integer types will lack? Because it seems to me that converting
from an integer gives us valid provenance in strictly more situations
than converting from bytes, since it reconstructs provenance if there’s
any object at that address (under still-debated restrictions),
while converting from bytes always preserves the original provenance

(if any). I don’t understand how that can possibly give us *more*
flexibility to optimize integers.

When two objects are adjacent, and an integer is exactly pointing to the
location between them, its provenance cannot be properly recovered.

int x[1], y[1];
llvm.assume((intptr_t)&x[0] == 0x100 && (intptr_t)&y[0] == 0x104);
int *p = (int*)(intptr_t)&x[1];
// Q: Is p's provenance x or y?

If it is expected that '*(p-1)' is equivalent to *x, p's provenance should
be x.
However, based on llvm.assume, optimizations on integers can
replace (intptr_t)&x[1] with (intptr_t)&y[0] (which is what happened in the
bug report).
Then, '*(p-1)' suddenly becomes out-of-bounds access, which is UB.
So, p's provenance isn't simply x or y; it should be something that can
access both x and y.

This is something that the TR does not really have a firm grasp on yet.

This implies that, unless there is a guarantee that all allocated objects
are one or more bytes apart, there is no type that can perfectly store a
pointer byte.
memcpy(x, y, 8) isn't equivalent to 'v=load i64 y;store i64 v, x' because v
already lost the pointer information .

Okay, so let me try to restate and summarize all this. I’ve CC’ed
a bunch of people back into this part of the thread.

C is moving towards a provenance model; you can find the details in
this committee TR that Joshua Cranmer linked:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf

This TR is very clearly a work in progress and contains many
digressions and several possible sets of rules with different
implications. I will try to briefly summarize.

Formally, all storage has a unique provenance: a notional unique
identifier for the specific allocation event of the storage, like a
local variable coming into scope, or a specific call to malloc.
Pointers formed to the storage formally carry that provenance (or
invalid provenance, or in some corner cases ambiguous provenance).
It is undefined behavior to do certain things with mismatched
provenances. For example, it is undefined behavior to access
an object through a pointer whose provenance doesn’t match the
current provenance of the storage containing that object. This is
a new way of formalizing the well-known rule that you can’t access
a local variable after it’s gone out of scope.

In the provenance model that looks most likely to be accepted, an
int-to-pointer cast blesses the resulting pointer with the current
provenance of the storage containing that address. There does not
have to be any sort of data dependence between taking the address
of the storage and the integer that’s used in the cast; it can
simply be a lucky guess. An integer could reasonably hold the
representation of any address in the program, and so an
int-to-pointer cast could bless its result with valid provenance
for any storage in memory. But if the pointed-to memory is
subsequently deallocated, the provenance of the pointer will
become out of date.

Now, that rule as I’ve stated it would be really bad. Allowing a
lucky guess to resolve to absolutely anything would almost
completely block the optimizer from optimizing memory. For example,
if a local variable came into scope, and then we called a function
that returned a different pointer, we’d have to conservatively
assume that that pointer might alias the local, even if the address
of the local was never even taken, much less escaped:

  int x = 0;
  int *p = guess_address_of_x();
  *p = 15;
  printf(“%d\n”, x); // provably 0?

So the currently favored proposal adds a really important caveat:
this blessing of provenance only works when a pointer with the
correct provenance has been “exposed”. There are several ways to
expose a pointer, including I/O, but the most important is casting
it to an integer.

Again, there’s no requirement of a data dependence between the
exposure and the int-to-pointer cast. If the program casts a
pointer to an integer, and an independently-produced integer
that happens to be the same value is later cast to a pointer,
and the storage hasn’t been reallocated in the meantime, the
resulting pointer will have the right provenance for the memory
and will be valid to use. This implies that pointer-to-int casts
(and other exposures) are semantically significant events in the
program. They don’t have side effects in the normal sense, but
they must be treated by the compiler just like things that do have
side effects: e.g. unless I’m missing something in the TR,
eliminating a completely unused pointer-to-int cast may make
later code UB.

And in fact, it turns out that this is crucially important for
optimization. If the optimizer wants to allow arbitrary
replacement of integers based on conditional equality, like
in GVN, then replacement totally breaks direct data dependence,
and you can easily be left with no remaining uses of a pointer-to-int
cast when the original code would have had a data dependence. So
you cannot reason about provenance through int-to-pointer casts:
the pointer can alias any storage whose provenance has potentially
been exposed, and the optimizer must be conservative about optimizing
memory that has potentially been exposed.

Of course, a lot of this isn’t new. If code passes a pointer to an
opaque function, then yeah, the optimizer has to assume that the
function might expose it by performing a pointer-to-int cast on it.
But that’s okay, because the optimizer already has to assume that
the function can escape the pointer in a more standard way. Exposure
is just another form of escape.

Everything I’ve been talking about so far is a C-level concept:
an int-to-pointer cast is e.g. (float*) myInt, not inttoptr
in LLVM IR. But I think people have an expectation of what these
things mean in LLVM IR, and I haven’t seen it written out explicitly,
so let’s do that now.

The first assumption here is that int-to-pointer and pointer-to-int
casts in C will translate to inttoptr and ptrtoint operations
in IR. Now, this is already problematic, because those operations
do not currently have the semantics they need to have to make the
proposed optimization model sound. In particular:

  • ptrtoint does not have side-effects and can be dead-stripped
    when unused, which as discussed above is not okay.

  • ptrtoint on a constant is folded to a constant expression,
    not an instruction, which is therefore no longer anchored in the
    code and does not reliably record that the global may have escaped.
    (Unused constant expressions do not really exist, and we absolutely
    cannot allow them to affect the semantics of the IR.)

    Of course, this is only significant for globals that don’t already
    have to be treated conservatively because they might have other
    uses. That is, it only really matters for globals with, say,
    internal or private linkage.

  • inttoptr can be reordered with other instructions, which is
    not allowed because different points in a function may have
    different sets of storage with escaped provenance.

  • inttoptr(ptrtoint) can be peepholed; ignoring the dead-stripping
    aspects of removing the inttoptr, this also potentially
    introduces UB because the original inttoptr “launders” the
    provenance of the pointer to the current provenance of the
    storage, whereas the original pointer may have stale provenance.

There are probably other problems. There are two ways that I see
of addressing all of this.

The first way is to take them on directly and change the core
semantic rules of LLVM IR. I think this is a huge change, but
maybe it’s necessary in order to get the semantics we want. We
will need to introduce some new IR features to make this possible,
both for performance and for expressivity. For example, the way
we currently express relative pointers in constant initializers
involves ptrtoint and sub constant expressions, so if we
remove ptrtoint constants, we’ll need something else. And
frontends will need to be much more cautious about not introducing
unnecessary int<->pointer casts because they’ll be heavily
pessimizing.

The second way is to say that inttoptr and ptrtoint are just
superficial type conversions and don’t affect provenance, then
add builtins that do affect provenance. But this leaves us still
tracking provenance through integers, so any optimization that’s
not valid on pointers will also not be valid on integers.

All of this is not even considering the need for byte types yet.
Here’s the thinking behind byte types as I understand it.

The claim is that, to make all of the above work, we need
int<->pointer conversions to be explicit in IR. If there’s some
other way of doing an int<->pointer conversion, we’re in trouble,
because maybe we won’t realize that the conversion has happened,
and so we won’t be appropriately conservative. And unfortunately
C provides some other ways to do these conversions, which happen
to all go through memory: copying representations around, doing
aliased loads and stores, whatever. So we need to have some way
to represent conversions that happen through these mechanisms.
And C says that these mechanisms don’t generally affect provenance,
so inserting inttoptr/ptrtoint casts is at the very least
imprecise because it launders provenance. So what we want is a
type that maintains the original provenance, and therefore is
subject to the same optimization restrictions as pointers.

I don’t find either side of this argument very convincing.

First, the compiler already has to be very conservative about
memory. If a pointer is stored into memory, we generally have
to treat it as having fully escaped: unless we can prove that the
memory isn’t loaded by other code, we have to assume that it is,
and that the loading code will do arbitrary other stuff. That
could include doing a pointer-to-int cast and sharing the pointer
back to us as an integer. Therefore, in the general case, our
ability to optimize when a pointer has escaped into memory is at
least as bad as if it had escaped via an int-to-pointer cast. If
we can nonetheless optimize, because we can reason about some of
the uses together or prove that there aren’t any other uses,
then okay, maybe we see that there’s an int<->pointer conversion.
But translating this to ptrtoint/inttoptr should be, at
worst, overly conservative; it’s not unsound, for reasons I’m
about to get into.

Second, adding casts through an integer type is always valid.
Doing so might block the optimizer, but it doesn’t break semantics.
If I have a program that does e.g *px = 15, and I change it to
do *(int*)(intptr_t)px = 15, my program has become well-defined
in strictly more situations: in any case, there must be valid
storage at px for this not to be UB, but previously px might
have had the wrong provenance, and now it never does as long as
the provenance for that storage has previously escaped.

If we find that we’re generating too many unnecessary casts
through integer types and it’s really blocking the optimizer too
much, then we should consider the best solutions to those problems
as they arise. It may be the case that we need a better solution
for type conversions introduced through manual memcpy-like code
so that we maintain the original provenance instead of adding
explicit int<->pointer conversions that launder provenance.
I don’t know that byte types are the best solution to that, but
we can consider them.

But this whole conversation about byte types seems to be coming
at it the wrong way around. We need to be centering the first
set of problems around int<->pointer casts.

John.

Andrew Trick via llvm-dev

unread,
Jun 19, 2021, 1:21:49 AM6/19/21
to John McCall, llvm-dev, Hal Finkel, James Courtier-Dutton via cfe-dev, Ralf Jung

On Jun 15, 2021, at 4:56 PM, John McCall via llvm-dev <llvm...@lists.llvm.org> wrote:

If we find that we’re generating too many unnecessary casts
through integer types and it’s really blocking the optimizer too
much, then we should consider the best solutions to those problems
as they arise. It may be the case that we need a better solution
for type conversions introduced through manual memcpy-like code
so that we maintain the original provenance instead of adding
explicit int<->pointer conversions that launder provenance.
I don’t know that byte types are the best solution to that, but
we can consider them.

We can't retroactively weaken inttoptr semantics relative to what frontends rely on--it should continue to have the semantics of a pointer cast in C, which need to be handled more conservatively in LLVM.

We can easily express the weaker memcpy semantics as "!raw" metadata on the load and inttoptr instructions. (I apologize if that was already mentioned in this thread.)

Introducing a new type to communicate pointer semantics seems inconsistent. LLVM types primarily communicate ABI requirements from the frontend to the target. The semantics of pointer loads, and integer conversions can be specified on those operations and generated when the compiler emits byte-wise copies as opposed to semantic casts. We can expose a clang attribute or builtin for manual pointer copying.

-Andy

Ralf Jung via llvm-dev

unread,
Jun 20, 2021, 6:23:12 AM6/20/21
to David Blaikie, llvm-dev, Clang Dev
Hi David,

> Integers demonstrably do not carry provenance; see
> <https://www.ralfj.de/blog/2020/12/14/provenance.html
> <https://www.ralfj.de/blog/2020/12/14/provenance.html>> for a detailed
> explanation of why.
> As a consequence of this, ptr-int-ptr roundtrips are lossy: some of the
> original
> provenance information is lost. This means that optimizing away such roundtrips
> is incorrect, and indeed doing so leads to miscompilations
> (https://bugs.llvm.org/show_bug.cgi?id=34548

> <https://bugs.llvm.org/show_bug.cgi?id=34548>).


>
> The key difference between int and byte is that ptr-byte-ptr roundtrips are
> *lossless*, all the provenance is preserved. This means some extra
> optimizations
> (such as removing these roundtrips -- which implicitly happens when a
> redundant-store-after-load is removed), but also some lost optimizations (most
> notably, "x == y" does not mean x and y are equal in all respects; their
> provenance might still differ, so it is incorrect for GVN to replace one my the
> other).
>
> It's a classic tradeoff: we can *either* have lossless roundtrips
>
>
> I think an important part of explaining the motivation for "byte" would be an
> explanation/demonstration of what the cost of losing "lossless roundtrips" would be.

I am not entirely sure where you are going with this question. Currently LLVM
assumes *both* lossless ptr-int-ptr roundtrips *and* it goes GVN based on "x ==
y" (on integers). This is simply inconsistent and demonstrably leads to
miscompilations. One of them needs to be lost.
Consensus in LLVM seems to be that one would rather lose lossless roundtrip than
GCN based on "==". I am not an expert in these trade-offs among optimizations.
All I can do is cast some light on where the edge of the design space for a
correct set of optimizations lies. I will leave it to others to decide where in
that design space they'd rather be.

Kind regards,
Ralf

>
> *or* "x == y"
> implies full equality of the abstract values. Having both together leads to
> contradictions, which manifest as miscompilations. "byte" and "int" represent
> the two possible choices here; therefore, by adding "byte", LLVM would close a
> gap in the expressive power of its IR.
>
> Kind regards,
> Ralf
> _______________________________________________
> LLVM Developers mailing list

> llvm...@lists.llvm.org <mailto:llvm...@lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>

--
Website: https://people.mpi-sws.org/~jung/

David Chisnall via llvm-dev

unread,
Jun 20, 2021, 11:56:04 AM6/20/21
to Ralf Jung, llvm-dev, cfe-dev
On 13 Jun 2021, at 16:22, Ralf Jung via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> "Union" of provenance is currently not an operation that is required to model LLVM IR, so your proposal would necessitate adding such a concept. It'll be interesting to figure out how "getelementptr inbounds" behaves on multi-provenance pointers...

Union provenance is required if you want an XOR linked list to be valid. These are pretty rare, but there are some idioms (including the calculation of per-CPU storage in the FreeBSD kernel) that depend on multi-provenance semantics.

CHERI systems, such as the Morello boards that Arm is shipping early next year, provide a hardware-enforced single-provenance semantics, which might provide some inspiration for this discussion:

In 64-bit CHERI implementations, memory capabilities are a 128-bit type protected by a tag bit (in memory and registers) that signifies that it has been derived from one of the capabilities provided in a register on hardware reset. Any operation that would violate the montonicity of rights (e.g. overwriting a single byte in a valid capability in memory) clears the tag bit, destroying its provenance and causing a trap if you try to use it as the base for a load or store instruction. When compiling from C-family languages, the memory capability is the hardware type to which all pointer types are lowered.

Our clang port to target these architectures defines a new built-in type, `__intcap_t`, which is used to represent `intptr_t`. When we emit LLVM IR, we lower this to an LLVM IR pointer type, not an integer type. All C operations on `__intcap_t` are defined to take provenance from the left operand, with a warning if we can statically show that this is probably wrong.

In our model, at the IR level, `ptrtoint` is fine, but the integer does not carry provenance. `inttoptr` is not permitted[1]. If code wants to extract an address from a pointer for comparison or for hashing (for example), that’s fine, but it can’t turn the integer back into a pointer directly. If pointers flow around as integers in the C sources, the may-be-a-pointer-in-C types are lowered to pointer types in the IR. This would be easier if C arithmetic operations were defined on pointer types. we currently have to use an IR intrinsic to get the address, then do the arithmetic, then reapply the result to the pointer type, and try to fold that again in the back end.

We have found that large codebases require a very small amount of porting to support this mode, but they *do* require some. This is not a 100% compatible mode with existing codebases and so a single-provenance model for LLVM IR (at least, as the only option) would not be acceptable.

David

[1] Well, kind of. It gives you a capability relative to the default data capability, but in the ABI where all pointers are capabilities then the default data capability is likely to be invalid. Optimisations may not introduce `inttoptr`.

Ralf Jung via llvm-dev

unread,
Jun 20, 2021, 12:27:05 PM6/20/21
to David Chisnall, llvm-dev, cfe-dev
Hi,

>> "Union" of provenance is currently not an operation that is required to model LLVM IR, so your proposal would necessitate adding such a concept. It'll be interesting to figure out how "getelementptr inbounds" behaves on multi-provenance pointers...
>
> Union provenance is required if you want an XOR linked list to be valid.

I don't think that's true? These will use integer-pointer casts (I assume, since
XOR on pointer types is not a thing), and pointers created from integers have a
kind of "wildcard" provenance -- no union. (One *can* use something more
fine-grained for pointers created from integers, but I think for LLVM that would
have more downsides than upsides. In particular, it would make inttoptr harder
to reorder around other instructions).

Kind regards,
Ralf

> These are pretty rare, but there are some idioms (including the calculation of per-CPU storage in the FreeBSD kernel) that depend on multi-provenance semantics.
>
> CHERI systems, such as the Morello boards that Arm is shipping early next year, provide a hardware-enforced single-provenance semantics, which might provide some inspiration for this discussion:
>
> In 64-bit CHERI implementations, memory capabilities are a 128-bit type protected by a tag bit (in memory and registers) that signifies that it has been derived from one of the capabilities provided in a register on hardware reset. Any operation that would violate the montonicity of rights (e.g. overwriting a single byte in a valid capability in memory) clears the tag bit, destroying its provenance and causing a trap if you try to use it as the base for a load or store instruction. When compiling from C-family languages, the memory capability is the hardware type to which all pointer types are lowered.
>
> Our clang port to target these architectures defines a new built-in type, `__intcap_t`, which is used to represent `intptr_t`. When we emit LLVM IR, we lower this to an LLVM IR pointer type, not an integer type. All C operations on `__intcap_t` are defined to take provenance from the left operand, with a warning if we can statically show that this is probably wrong.
>
> In our model, at the IR level, `ptrtoint` is fine, but the integer does not carry provenance. `inttoptr` is not permitted[1]. If code wants to extract an address from a pointer for comparison or for hashing (for example), that’s fine, but it can’t turn the integer back into a pointer directly. If pointers flow around as integers in the C sources, the may-be-a-pointer-in-C types are lowered to pointer types in the IR. This would be easier if C arithmetic operations were defined on pointer types. we currently have to use an IR intrinsic to get the address, then do the arithmetic, then reapply the result to the pointer type, and try to fold that again in the back end.
>
> We have found that large codebases require a very small amount of porting to support this mode, but they *do* require some. This is not a 100% compatible mode with existing codebases and so a single-provenance model for LLVM IR (at least, as the only option) would not be acceptable.
>
> David
>
> [1] Well, kind of. It gives you a capability relative to the default data capability, but in the ABI where all pointers are capabilities then the default data capability is likely to be invalid. Optimisations may not introduce `inttoptr`.
>

--
Website: https://people.mpi-sws.org/~jung/

Juneyoung Lee via llvm-dev

unread,
Jun 21, 2021, 2:15:27 AM6/21/21
to John McCall, llvm-dev, Hal Finkel, Ralf Jung
Hi,
Sorry for my late reply, and thank you for sharing great summaries & ideas. I'll leave my thoughts below.

On Wed, Jun 16, 2021 at 8:56 AM John McCall <rjmc...@apple.com> wrote:

Okay, so let me try to restate and summarize all this. I’ve CC’ed

a bunch of people back into this part of the thread.

C is moving towards a provenance model; you can find the details in
this committee TR that Joshua Cranmer linked:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf

This TR is very clearly a work in progress and contains many
digressions and several possible sets of rules with different
implications. I will try to briefly summarize.

<snip>

Now, that rule as I’ve stated it would be really bad. Allowing a
lucky guess to resolve to absolutely anything would almost
completely block the optimizer from optimizing memory. For example,
if a local variable came into scope, and then we called a function
that returned a different pointer, we’d have to conservatively
assume that that pointer might alias the local, even if the address
of the local was never even taken, much less escaped:

  int x = 0;
  int *p = guess_address_of_x();
  *p = 15;
  printf(“%d\n”, x); // provably 0?

So the currently favored proposal adds a really important caveat:
this blessing of provenance only works when a pointer with the
correct provenance has been “exposed”. There are several ways to
expose a pointer, including I/O, but the most important is casting
it to an integer.

This is a valid point. If one wants to formally show the correctness of this kind of memory optimization this problem should be tackled.
I think n2676's 'Allocation-address nondeterminism' (p. 27) paragraph addresses this issue.
The underlying idea is that the address of an allocated object is assumed to be non-deterministically chosen, causing any guessed accesses to raise undefined behavior in at least one execution.

Again, there’s no requirement of a data dependence between the
exposure and the int-to-pointer cast. If the program casts a
pointer to an integer, and an independently-produced integer
that happens to be the same value is later cast to a pointer,
and the storage hasn’t been reallocated in the meantime, the
resulting pointer will have the right provenance for the memory
and will be valid to use. This implies that pointer-to-int casts
(and other exposures) are semantically significant events in the
program. They don’t have side effects in the normal sense, but
they must be treated by the compiler just like things that do have
side effects: e.g. unless I’m missing something in the TR,
eliminating a completely unused pointer-to-int cast may make
later code UB.

And in fact, it turns out that this is crucially important for
optimization. If the optimizer wants to allow arbitrary
replacement of integers based on conditional equality, like
in GVN, then replacement totally breaks direct data dependence,
and you can easily be left with no remaining uses of a pointer-to-int
cast when the original code would have had a data dependence. So
you cannot reason about provenance through int-to-pointer casts:
the pointer can alias any storage whose provenance has potentially
been exposed, and the optimizer must be conservative about optimizing
memory that has potentially been exposed.

+1, due to this reason the casting semantics cannot be directly used for LLVM's ptrtoint/inttoptr. 

<snip>

All of these concerns are valid.

(I'm not sure whether this is a good place to introduce this, but) we actually have semantics for pointer castings tailored to LLVM (link).
In this proposal, ptrtoint does not have an escaping side effect; ptrtoint and inttoptr are scalar operations.
inttoptr simply returns a pointer which can access any object.
Combined with the address nondeterminism that is described above, unescaped objects can be effectively left untouched from other memory accesses.

Also, the following optimizations can be explained:
- The aliasing property of 'gep inbounds p' is supported: dereferencing 'gep inbounds p, 1' must raise UB if either p or p+1 isn't in bounds of p's object (provenance)
- 'gep (inttoptr i), idx' is equivalent to 'i + idx' (same value and same level of undefinedness)
- gep and gep inbounds is a scalar operation (can be freely reordered w.r.t ptrtoint/inttoptr/lifetime/free/...)
- gep's natural properties are supported: stripping off inbounds tag, 'gep (gep (inttoptr i), i1), i2' -> 'gep (inttoptr i), i1+i2'

There are a few transformations that become hard to explain, but perhaps the biggest one is 'inttoptr(ptrtoint p)' -> p.

<snip>

I don’t find either side of this argument very convincing.

First, the compiler already has to be very conservative about
memory. If a pointer is stored into memory, we generally have
to treat it as having fully escaped: unless we can prove that the
memory isn’t loaded by other code, we have to assume that it is,
and that the loading code will do arbitrary other stuff. That
could include doing a pointer-to-int cast and sharing the pointer
back to us as an integer. Therefore, in the general case, our
ability to optimize when a pointer has escaped into memory is at
least as bad as if it had escaped via an int-to-pointer cast. If
we can nonetheless optimize, because we can reason about some of
the uses together or prove that there aren’t any other uses,
then okay, maybe we see that there’s an int<->pointer conversion.
But translating this to ptrtoint/inttoptr should be, at
worst, overly conservative; it’s not unsound, for reasons I’m
about to get into.

Second, adding casts through an integer type is always valid.
Doing so might block the optimizer, but it doesn’t break semantics.
If I have a program that does e.g *px = 15, and I change it to
do *(int*)(intptr_t)px = 15, my program has become well-defined
in strictly more situations: in any case, there must be valid
storage at px for this not to be UB, but previously px might
have had the wrong provenance, and now it never does as long as
the provenance for that storage has previously escaped.

I agree. Transforming 'p' into 'inttoptr(ptrtoint p)' should not make the program undefined, and it can be used to address the correctness issue of these kinds of problems.

If we find that we’re generating too many unnecessary casts
through integer types and it’s really blocking the optimizer too
much, then we should consider the best solutions to those problems
as they arise. It may be the case that we need a better solution
for type conversions introduced through manual memcpy-like code
so that we maintain the original provenance instead of adding
explicit int<->pointer conversions that launder provenance.
I don’t know that byte types are the best solution to that, but
we can consider them.

But this whole conversation about byte types seems to be coming
at it the wrong way around. We need to be centering the first
set of problems around int<->pointer casts. 

John.

As the first step, I made a patch to LangRef for differentiation of int and ptr: https://reviews.llvm.org/D104013 . It is currently under review.
Maybe we can pursue two-track:
(1) gradually disabling the 'inttoptr(ptrtoint p) -> p' folding.
- For example, to fix https://bugs.llvm.org/show_bug.cgi?id=34548, disabling it when p's underlying object is alloca would be enough (I guess).
(2) inspecting how byte types can help revive optimizations.
- For example, loop idiom recognition on memcpy-like loops should be removed otherwise. Its performance impact should be checked.


Thanks,
Juneyoung

Juneyoung Lee via llvm-dev

unread,
Jun 21, 2021, 2:40:02 AM6/21/21
to Andrew Trick, llvm-dev, Ralf Jung, Hal Finkel, James Courtier-Dutton via cfe-dev
Hello Andrew,

> We can't retroactively weaken inttoptr semantics relative to what frontends rely on--it should continue to have the semantics of a pointer cast in C, which need to be handled more conservatively in LLVM.

I think it depends. If necessary, ptr/int cast semantics needs to be weakened, as done in signed overflow.
Signed overflow is undefined behavior in C, but it yields poison in LLVM because the C semantics prohibits free code motion of add nsw.
I believe its poison definition is successful and can still support useful optimizations.

Thanks,
Juneyoung

Ralf Jung via llvm-dev

unread,
Jun 21, 2021, 4:08:15 AM6/21/21
to Juneyoung Lee, John McCall, llvm-dev, Hal Finkel
Hi,

> Now, that rule as I’ve stated it would be really bad. Allowing a
> lucky guess to resolve to absolutely anything would almost
> completely block the optimizer from optimizing memory. For example,
> if a local variable came into scope, and then we called a function
> that returned a different pointer, we’d have to conservatively
> assume that that pointer might alias the local, even if the address
> of the local was never even taken, much less escaped:
>
> |int x = 0; int *p = guess_address_of_x(); *p = 15; printf(“%d\n”, x); //
> provably 0? |
>
> So the currently favored proposal adds a really important caveat:
> this blessing of provenance only works when a pointer with the
> correct provenance has been “exposed”. There are several ways to
> expose a pointer, including I/O, but the most important is casting
> it to an integer.
>
> This is a valid point. If one wants to formally show the correctness of this
> kind of memory optimization this problem should be tackled.
> I think n2676's 'Allocation-address nondeterminism' (p. 27) paragraph addresses
> this issue.
> The underlying idea is that the address of an allocated object is assumed to be
> non-deterministically chosen, causing any guessed accesses to raise undefined
> behavior in at least one execution.

I am confused -- that optimization is allowed without any reasoning about
non-determinism, because the address of x has never been exposed, right?

There are some optimizations that still require reasoning about non-determinism,
namely cases where the address *has* been exposed. This was recently discussed
on <https://lists.cam.ac.uk/mailman/listinfo/cl-c-memory-object-model> and at
least one WG14 member expressed the opinion that in this case, the optimization
is actually not allowed.
FWIW, I personally prefer a model that always uses non-determinism to justify
such optimizations, avoiding the need for any kind of "exposed" flag (such as
the paper Juneyoung referenced -- unsurprisingly so, since I am a coauthor of
that paper ;).

However, I should also add that the trade-offs in language design are different
for a surface language such as C and for an optimized IR such as LLVM's.

However, note that GVN integer replacement is a problem even without the
"exposed" flag, as demonstrated by the long-standing issues
https://bugs.llvm.org/show_bug.cgi?id=34548 and
https://bugs.llvm.org/show_bug.cgi?id=35229.

Kind regards,
Ralf

Chris F via llvm-dev

unread,
Jun 21, 2021, 12:56:17 PM6/21/21
to llvm...@lists.llvm.org

Sent from my iPhone

John McCall via llvm-dev

unread,
Jun 21, 2021, 3:07:58 PM6/21/21
to Juneyoung Lee, llvm-dev, Ralf Jung

Ah, that’s an interesting idea. I must have missed that section; I’m
afraid I only skimmed N2676 looking for the key points, because it’s
quite a long document.

To be clear, under the PVNI-ae family of semantics, that rule would
not be necessary to optimize x in my example because int-to-pointer
casts are not allowed to recreate provenance for x because it has
not been exposed. That rule does theoretically allow optimization of
some related examples where the address of x has been exposed,
because it lets the compiler try to reason about what happens after
exposure; it’s no longer true that exposure implies guessing are okay.

Unfortunately, though, I this non-determinism still doesn’t allow LLVM
to be anywhere near as naive about pointer-to-int casts as it is today.
The rule is intended to allow the compiler to start doing use-analysis
of exposures; let’s assume that this analysis doesn’t see any
un-analyzable uses, since of course it would need to conservatively
treat them as escapes. But if we can optimize uses of integers as if
they didn’t carry pointer data — say, in a function that takes integer
parameters — and then we can apply those optimized uses to integers
that concretely result from pointer-to-int casts — say, by inlining
that function into one of its callers — can’t we end up with a use
pattern for one or more of those pointer-to-int casts that no longer
reflects the fact that it’s been exposed? It seems to me that either
(1) we cannot do those optimizations on opaque integers or (2) we
need to record that we did them in a way that, if it turns out that
they were created by a pointer-to-int casts, forces other code to
treat that pointer as opaquely exposed.

<snip>

Everything I’ve been talking about so far is a C-level concept:
an int-to-pointer cast is e.g. (float*) myInt, not inttoptr
in LLVM IR. But I think people have an expectation of what these
things mean in LLVM IR, and I haven’t seen it written out explicitly,
so let’s do that now.

The first assumption here is that int-to-pointer and pointer-to-int
casts in C will translate to inttoptr and ptrtoint operations
in IR. Now, this is already problematic, because those operations
do not currently have the semantics they need to have to make the
proposed optimization model sound. In particular:

-



ptrtoint does not have side-effects and can be dead-stripped
when unused, which as discussed above is not okay.

-



ptrtoint on a constant is folded to a constant expression,
not an instruction, which is therefore no longer anchored in the
code and does not reliably record that the global may have escaped.
(Unused constant expressions do not really exist, and we absolutely
cannot allow them to affect the semantics of the IR.)

Of course, this is only significant for globals that don’t already
have to be treated conservatively because they might have other
uses. That is, it only really matters for globals with, say,
internal or private linkage.

-



inttoptr can be reordered with other instructions, which is
not allowed because different points in a function may have
different sets of storage with escaped provenance.

-



inttoptr(ptrtoint) can be peepholed; ignoring the dead-stripping
aspects of removing the inttoptr, this also potentially
introduces UB because the original inttoptr “launders” the
provenance of the pointer to the current provenance of the
storage, whereas the original pointer may have stale provenance.

All of these concerns are valid.

(I'm not sure whether this is a good place to introduce this, but) we
actually have semantics for pointer castings tailored to LLVM (link


In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
and inttoptr are scalar operations.
inttoptr simply returns a pointer which can access any object.

Skimming your paper, I can see how this works except that I don’t
see any way not to treat ptrtoint as an escape. And really I think
you’re already partially acknowledging that, because that’s the only
real sense of saying that inttoptr(ptrtoint p) can’t be reduced to
p. If those are really just scalar operations that don’t expose
p in ways that might be disconnected from the uses of the inttoptr
then that reduction ought to be safe.

John.

Nicolai Hähnle via llvm-dev

unread,
Jun 22, 2021, 3:14:32 AM6/22/21
to John McCall, llvm-dev, Ralf Jung
I haven't read the paper yet, but I think the argument is: Assume p was obtained via an out-of-bounds GEP that happens to point at valid memory that belongs to a different object than p's provenance. In that case, dereferencing inttoptr(ptrtoint(p)) is defined behavior ("inttoptr simply returns a pointer which can access any object") while dereferencing p is UB.

Cheers,
Nicolai

 

John.

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--

Ralf Jung via llvm-dev

unread,
Jun 22, 2021, 5:59:15 AM6/22/21
to John McCall, Juneyoung Lee, llvm-dev
Hi John,

> Unfortunately, though, I this non-determinism still doesn’t allow LLVM
> to be anywhere near as naive about pointer-to-int casts as it is today.

Definitely. There are limits to how naive one can be; beyond those limits,
miscompilations lurk. <https://www.ralfj.de/blog/2020/12/14/provenance.html>
explains this by showing such a miscompilation arising from three naive
optimizations being chained together.

> The rule is intended to allow the compiler to start doing use-analysis
> of exposures; let’s assume that this analysis doesn’t see any
> un-analyzable uses, since of course it would need to conservatively
> treat them as escapes. But if we can optimize uses of integers as if
> they didn’t carry pointer data — say, in a function that takes integer
> parameters — and then we can apply those optimized uses to integers
> that concretely result from pointer-to-int casts — say, by inlining
> that function into one of its callers — can’t we end up with a use
> pattern for one or more of those pointer-to-int casts that no longer
> reflects the fact that it’s been exposed? It seems to me that either
> (1) we cannot do those optimizations on opaque integers or (2) we
> need to record that we did them in a way that, if it turns out that
> they were created by a pointer-to-int casts, forces other code to
> treat that pointer as opaquely exposed.

There is a third option: don't optimize away ptr-int-ptr roundtrips. Then you
can still do all the same optimizations on integers that LLVM does today,
completely naively -- the integer world remains "sane". Only the pointer world
has to be "strange".
(You can also not do things like GVN replacement of *pointer-typed* values, but
for values of integer types this remains unproblematic.)

I don't think it makes sense for LLVM to adopt an explicit "exposed" flag in its
semantics. Reasoning based on non-determinism works fine, and has the advantage
of keeping ptr-to-int casts a pure, side-effect-free operation. This is the
model we explored in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf>, and
we were able to show quite a few of LLVM's standard optimizations correct
formally. Some changes are still needed as you noted, but those changes will be
required anyway even if LLVM were to adopt PNVI-ae:
- No removal of ptr-int-ptr roundtrips.
(https://bugs.llvm.org/show_bug.cgi?id=34548)
- No GVN replacement of pointer-typed values.
(https://bugs.llvm.org/show_bug.cgi?id=35229)

> (I'm not sure whether this is a good place to introduce this, but) we
> actually have semantics for pointer castings tailored to LLVM (link
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>>).
> In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
> and inttoptr are scalar operations.
> inttoptr simply returns a pointer which can access any object.
>

> Skimming your paper, I can see how this works /except/ that I don’t


> see any way not to treat |ptrtoint| as an escape. And really I think
> you’re already partially acknowledging that, because that’s the only
> real sense of saying that |inttoptr(ptrtoint p)| can’t be reduced to

> |p|. If those are really just scalar operations that don’t expose


> |p| in ways that might be disconnected from the uses of the |inttoptr|
> then that reduction ought to be safe.

They are indeed just scalar operations, but the reduction is not safe.
The reason is that pointer-typed variables have values of the form "(addr,
provenance)". There is essentially an 'invisible' component in each pointer
value that tracks some additional information -- the "provenance" of the
pointer. Casting a ptr to an int removes that provenance. Casting an int to a
ptr picks a "default" provenance. So the overall effect of inttoptr(ptrtoint p)
is to turn "(addr, provenance)" into "(addr, DEFAULT_PROVENANCE)".
Clearly that is *not* a NOP, and hence performing the reduction actually changes
the result of this operation. Before the reduction, the resulting pointer had
DEFAULT_PROVENANCE; after the reduction, it maintains the original provenance of
"p". This can introduce UB into previously UB-free programs.

Kind regards,
Ralf

Juneyoung Lee via llvm-dev

unread,
Jun 22, 2021, 12:03:00 PM6/22/21
to John McCall, llvm-dev, Ralf Jung
IIUC the corresponding example would be:
int p;
intptr_t i = (intptr_t)&p;
f(i);
// Initially f(i) exposes p by storing i into somewhere, but optimization changed f to not store i anymore (e.g. i was replaced with a constant)

In this case, p was already exposed by '(intptr_t)p' I guess; the body of f(i) won't affect the exposedness of p.
It again relies on nondeterminism of the address of an object.

For example:
p = call malloc(4) ; p's address is assumed to be nondeterministically chosen
i = ptrtoint p
; i is never used
ret void

Since i is never used, this program cannot safely access p via address guessing + inttoptr.
This allows (more precise) escape analysis to conclude that malloc with one unused ptrtoint is never escaped.

Juneyoung

Hal Finkel via llvm-dev

unread,
Jun 22, 2021, 12:07:36 PM6/22/21
to Ralf Jung, John McCall, Juneyoung Lee, llvm-dev


Do we have any idea how large of an effect this might be? If we disable
GVN for all pointer-typed values? And is it really all GVN, or just
cases where you unify the equivalence classes based on some dominating
comparison operation? We should be careful here, perhaps, because LLVM's
GVN does a lot of plain-old CSE, store-to-load forwarding, etc. and we
should say specifically what would need to be disabled and in what contexts.

 -Hal

Ralf Jung via llvm-dev

unread,
Jun 22, 2021, 12:26:11 PM6/22/21
to Hal Finkel, John McCall, Juneyoung Lee, llvm-dev
Hi Hal,

What I mean is specifically GVN "exploiting" equality tests (icmp) of pointer
type. That doesn't work (https://bugs.llvm.org/show_bug.cgi?id=35229
"weaponizes" it to create a miscompilation, albeit in artificial code). I think
this optimization it is pretty much unsalvageable, except if you somehow know
that the two pointers have *identical* provenance -- the moment pointers have
any kind of provenance, just because icmp says they are equal does not mean we
can treat them as equivalent for GVN, since icmp cannot know if the provenance
of the two pointers is the same or not.

I assume that is what you mean by "unify the equivalence classes based on some
dominating comparison operation" -- at least that sounds about right. :)
(I am probably using some incorrect/sloppy terminology here, and I apologize for
that. My expertise is more in the area of formal language semantics than
compiler construction.)

GVN can still treat e.g. "GEPi p 4" as equivalent with another "GEPi p 4" --
pure operations with identical inputs produce identical outputs, even at pointer
type, and GEP/GEPi remain pure under this proposal. The issue is specifically
the interaction of GVN with icmp, not GVN alone.

Kind regards,
Ralf

--
Website: https://people.mpi-sws.org/~jung/

John McCall via llvm-dev

unread,
Jun 22, 2021, 1:10:50 PM6/22/21
to Juneyoung Lee, llvm-dev, Ralf Jung

some related examples where the address of x *has* been exposed,

Right, formally I think that’s true.

All of these concerns are valid.

(I'm not sure whether this is a good place to introduce this, but) we
actually have semantics for pointer castings tailored to LLVM (link
<https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
and inttoptr are scalar operations.
inttoptr simply returns a pointer which can access any object.

Skimming your paper, I can see how this works *except* that I don’t


see any way not to treat ptrtoint as an escape. And really I think
you’re already partially acknowledging that, because that’s the only
real sense of saying that inttoptr(ptrtoint p) can’t be reduced to
p. If those are really just scalar operations that don’t expose
p in ways that might be disconnected from the uses of the inttoptr
then that reduction ought to be safe.

John.

It again relies on nondeterminism of the address of an object.

For example:
p = call malloc(4) ; p's address is assumed to be nondeterministically
chosen
i = ptrtoint p
; i is never used
ret void

Since i is never used, this program cannot safely access p via address
guessing + inttoptr.
This allows (more precise) escape analysis to conclude that malloc with one
unused ptrtoint is never escaped.

I think this is too local of an analysis.

Your proposal generally relies on certain optimizations not applying
to pointers because they mess up provenance as represented in
transitive use-dependencies. If those optimizations can be applied
to integers, you can lose use-dependencies in exactly the same way as
you can with pointers. Not doing the inttoptr(ptrtoint p)) -> p
reduction doesn’t matter at all, because in either case that value
has a use-dependency on p, whereas the actual problem is that there
is no longer a use-dependency on some other value.

For example, you have compellingly argued that it’s problematic to
do the reduction a == b ? a : b to b for pointer types. Suppose
I instead do this optimization on integers where a = ptrtoint A.
The result is now simply b. If I inttoptr that result and access
the memory, there will be no record that that access may validly
be to A. It does not help that the access may be represented
as inttoptr (ptrtoint B) for some B rather than just directly
to B, because there is no use-dependence on A. All there is
is an apparently unrelated and unused ptrtoint A.

Obviously we can avoid doing this optimization locally when we
see that the inputs result from ptrtoint, but that’s no more
than best-effort: we can do this optimization in a function which
we later inline in a caller that performs all the ptrtoint and
inttoptr casts.

John.

Ralf Jung via llvm-dev

unread,
Jun 22, 2021, 1:21:35 PM6/22/21
to John McCall, Juneyoung Lee, llvm-dev
Hi,

> Your proposal generally relies on certain optimizations not applying
> to pointers because they mess up provenance as represented in
> transitive use-dependencies. If those optimizations can be applied
> to integers, you can lose use-dependencies in exactly the same way as
> you can with pointers. Not doing the inttoptr(ptrtoint p)) -> p
> reduction doesn’t matter at all, because in either case that value
> has a use-dependency on p, whereas the actual problem is that there
> is no longer a use-dependency on some other value.

Note that "provenance" as we use it in this discussion is an *explicit
operational artifact* -- it exists as a concrete piece of state in the Abstract
Machine. That is very different from something that might just be used
internally in some kind of analysis.

There is no problem with "resetting" that provenance on a "inttoptr", and
basically forgetting about where the int comes from. Note that this is a
statement about an operation in the Abstract Machine, not merely a statement
about some analysis: this is not "forgetting" as in "safely overapproximating
the real set of possible behaviors", it is "forgetting" by *forcing* the
provenance to be some kind of wildcard/default provenance. All analyses then
have to correctly account for that.

> For example, you have compellingly argued that it’s problematic to
> do the reduction |a == b ? a : b| to |b| for pointer types. Suppose
> I instead do this optimization on integers where |a = ptrtoint A|.

> The result is now simply |b|. If I |inttoptr| that result and access


> the memory, there will be no record that that access may validly

> be to |A|. It does not help that the access may be represented


> as |inttoptr (ptrtoint B)| for some |B| rather than just directly

> to |B|, because there is no use-dependence on |A|. All there is


> is an apparently unrelated and unused |ptrtoint A|.

So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : ptrtoint B" being
replaced by "ptrtoint B"? I don't see any problem with that. Do you have a
concrete example?

> Obviously we can avoid doing this optimization locally when we

> see that the inputs result from |ptrtoint|, but that’s no more


> than best-effort: we can do this optimization in a function which
> we later inline in a caller that performs all the |ptrtoint| and
> |inttoptr| casts.

I agree "avoiding something locally" is way too fragile to be considered a
solution. I do not think it is needed, though.

Kind regards,
Ralf

John McCall via llvm-dev

unread,
Jun 22, 2021, 2:52:37 PM6/22/21
to Ralf Jung, llvm-dev

On 22 Jun 2021, at 13:21, Ralf Jung wrote:

Your proposal generally relies on certain optimizations not applying
to pointers because they mess up provenance as represented in
transitive use-dependencies. If those optimizations can be applied
to integers, you can lose use-dependencies in exactly the same way as
you can with pointers. Not doing the inttoptr(ptrtoint p)) -> p
reduction doesn’t matter at all, because in either case that value
has a use-dependency on p, whereas the actual problem is that there
is no longer a use-dependency on some other value.

Note that "provenance" as we use it in this discussion is an *explicit operational artifact* -- it exists as a concrete piece of state in the Abstract Machine. That is very different from something that might just be used internally in some kind of analysis.

There is no problem with "resetting" that provenance on a "inttoptr", and basically forgetting about where the int comes from. Note that this is a statement about an operation in the Abstract Machine, not merely a statement about some analysis: this is not "forgetting" as in "safely overapproximating the real set of possible behaviors", it is "forgetting" by *forcing* the provenance to be some kind of wildcard/default provenance. All analyses then have to correctly account for that.

But it’s not a truly wildcard provenance. At the very least, it’s
restricted to the set of provenances that have been exposed, and
my understanding is that Juneyoung’s non-determinism rule attempts
to readmit a use-dependency analysis and turn ptrtoint back into
a simple scalar operation.

For example, you have compellingly argued that it’s problematic to
do the reduction |a == b ? a : b| to |b| for pointer types. Suppose
I instead do this optimization on integers where |a = ptrtoint A|.
The result is now simply |b|. If I |inttoptr| that result and access
the memory, there will be no record that that access may validly
be to |A|. It does not help that the access may be represented
as |inttoptr (ptrtoint B)| for some |B| rather than just directly
to |B|, because there is no use-dependence on |A|. All there is
is an apparently unrelated and unused |ptrtoint A|.

So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : ptrtoint B" being replaced by "ptrtoint B"? I don't see any problem with that. Do you have a concrete example?

I think you can take any example where pointer-type restrictions
would be necessary to protect against miscompilation and turn it
into an example here by just inserting inttoptr and ptrtoint
appropriately. Quick example:

int A = 0x1;
int B = 0x2;
long a = (long) (A+1);
long b = (long) B;
long result = (a == b ? a : b);
if (a == b)
  *(((int*) result) - 1) = 0x4;
else
  *((int*) result) = 0x8;
printf(“%d %d\n”, A, B);

I submit that this program has unspecified but not undefined behavior,
with printing “1 8” and “4 2” being the only two valid outcomes.
But I think an optimizer which changes the fifth line to
long result = b; without leaving any trace behind could easily
compile this to print “1 2” because there would be nothing to
prevent the initialization of A from being forwarded to the
final load.

You can prevent this by noting that the provenance of A has been
xposed and allowing the inttoptr of result to alias A, but
that seems inconsistent with treating ptrtoint as a simple scalar
operation and allowing a use-analysis of a ptrtoint to restrict
which inttoptr casts are allowed to recreate provenance for the
ptrtoint operand.

If you want to keep treating ptrtoint as a scalar operation and
doing use-analyses on it, I think the most palatable option is to
recognize whenever you’re cutting a use-dependency and
conservatively record in IR that the original value has now been
exposed. So if you start with this:

  %eq = icmp eq i32 %a, %b
  %result = select i1 %eq, i32 %a, i32 %b

You have to transform it like this:

  %result = %b
  call void @llvm.expose.i32(i32 %a)

You should be able to remove these exposure events in a lot of
situations, but conservatively they’ll have to be treated as
escapes.

Most optimizations never cut use-dependencies on opaque values
like this and so won’t be affected.

John.

Ralf Jung via llvm-dev

unread,
Jun 22, 2021, 3:40:40 PM6/22/21
to John McCall, llvm-dev
Hi John,

> Note that "provenance" as we use it in this discussion is an *explicit
> operational artifact* -- it exists as a concrete piece of state in the
> Abstract Machine. That is very different from something that might just be
> used internally in some kind of analysis.
>
> There is no problem with "resetting" that provenance on a "inttoptr", and
> basically forgetting about where the int comes from. Note that this is a
> statement about an operation in the Abstract Machine, not merely a statement
> about some analysis: this is not "forgetting" as in "safely
> overapproximating the real set of possible behaviors", it is "forgetting" by
> *forcing* the provenance to be some kind of wildcard/default provenance. All
> analyses then have to correctly account for that.
>
> But it’s not a truly wildcard provenance. At the very least, it’s
> restricted to the set of provenances that have been exposed, and
> my understanding is that Juneyoung’s non-determinism rule attempts
> to readmit a use-dependency analysis and turn |ptrtoint| back into
> a simple scalar operation.

So, at least in the paper Juneyoung and me and a few other people wrote (the one
referenced a couple times in this thread already:
https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf), it is a wildcard provenance.
"exposed" is not a thing in the operational semantics, it is a theorem we can prove.

>
> For example, you have compellingly argued that it’s problematic to
> do the reduction |a == b ? a : b| to |b| for pointer types. Suppose
> I instead do this optimization on integers where |a = ptrtoint A|.
> The result is now simply |b|. If I |inttoptr| that result and access
> the memory, there will be no record that that access may validly
> be to |A|. It does not help that the access may be represented
> as |inttoptr (ptrtoint B)| for some |B| rather than just directly
> to |B|, because there is no use-dependence on |A|. All there is
> is an apparently unrelated and unused |ptrtoint A|.
>
> So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : ptrtoint B" being
> replaced by "ptrtoint B"? I don't see any problem with that. Do you have a
> concrete example?
>
> I think you can take any example where pointer-type restrictions
> would be necessary to protect against miscompilation and turn it
> into an example here by just inserting |inttoptr| and |ptrtoint|
> appropriately. Quick example:
>
> |int A = 0x1; int B = 0x2; long a = (long) (A+1); long b = (long) B; long result
> = (a == b ? a : b); if (a == b) *(((int*) result) - 1) = 0x4; else *((int*)
> result) = 0x8; printf(“%d %d\n”, A, B); |

(Sorry about the bad quote, not sure what Thunderbird does here.)
I assume you mean "&A" and "&B" in lines 3 and 4?


> I submit that this program has unspecified but not undefined behavior,
> with printing “1 8” and “4 2” being the only two valid outcomes.

Okay, I can follow that.

> But I think an optimizer which changes the fifth line to
> |long result = b;| without leaving any trace behind

so then we are at

int A = 0x1;
int B = 0x2;

long a = (long) (&A+1);
long b = (long) &B;
long result = b;


if (a == b)
*(((int*) result) - 1) = 0x4;
else
*((int*) result) = 0x8;
printf(“%d %d\n”, A, B);

> could easily
> compile this to print “1 2” because there would be nothing to
> prevent the initialization of |A| from being forwarded to the
> final load.

I don't think that's right, since "a" is still used in the "if", so a bit of
information about the address is leaked and you can't assume the address was not
guessed. The furthest you can go is

int A = 0x1;
int B = 0x2;

long a = (long) (&A+1);
long b = (long) &B;
if (a == b)
*(((int*) (long) &B) - 1) = 0x4;
else
*((int*) (long) &B) = 0x8;


printf(“%d %d\n”, A, B);

No need for an explicit 'exposed', I think.

That said, things get more tricky once you want to account for C 'restrict' /
LLVM 'noalias', and then it might be necessary to have explicit 'exposed'-like
operations. I haven't seen a properly worked-out model for this, but the
candidate models I saw would need something like this. So maybe it'd not be a
bad idea to have such an operation anyway... I just don't think it should have
any effect for the kind of programs we have been discussing here so far.

Kind regards,
Ralf

>
> You can prevent this by noting that the provenance of |A| has been

> xposed and allowing the |inttoptr| of |result| to alias |A|, but


> that seems inconsistent with treating |ptrtoint| as a simple scalar
> operation and allowing a use-analysis of a |ptrtoint| to restrict
> which |inttoptr| casts are allowed to recreate provenance for the
> |ptrtoint| operand.
>
> If you want to keep treating |ptrtoint| as a scalar operation and
> doing use-analyses on it, I think the most palatable option is to
> recognize whenever you’re cutting a use-dependency and
> conservatively record in IR that the original value has now been
> exposed. So if you start with this:
>
> |%eq = icmp eq i32 %a, %b %result = select i1 %eq, i32 %a, i32 %b |
>
> You have to transform it like this:
>
> |%result = %b call void @llvm.expose.i32(i32 %a) |
>
> You should be able to remove these exposure events in a lot of
> situations, but conservatively they’ll have to be treated as
> escapes.
>
> Most optimizations never cut use-dependencies on opaque values
> like this and so won’t be affected.
>
> John.
>

--
Website: https://people.mpi-sws.org/~jung/

John McCall via llvm-dev

unread,
Jun 22, 2021, 4:52:38 PM6/22/21
to Ralf Jung, llvm-dev

On 22 Jun 2021, at 15:40, Ralf Jung wrote:

Note that "provenance" as we use it in this discussion is an *explicit
operational artifact* -- it exists as a concrete piece of state in the
Abstract Machine. That is very different from something that might just be
used internally in some kind of analysis.

There is no problem with "resetting" that provenance on a "inttoptr", and
basically forgetting about where the int comes from. Note that this is a
statement about an operation in the Abstract Machine, not merely a statement
about some analysis: this is not "forgetting" as in "safely
overapproximating the real set of possible behaviors", it is "forgetting" by
*forcing* the provenance to be some kind of wildcard/default provenance. All
analyses then have to correctly account for that.

But it’s not a truly wildcard provenance. At the very least, it’s
restricted to the set of provenances that have been exposed, and
my understanding is that Juneyoung’s non-determinism rule attempts
to readmit a use-dependency analysis and turn |ptrtoint| back into
a simple scalar operation.

So, at least in the paper Juneyoung and me and a few other people wrote (the one referenced a couple times in this thread already: https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf), it is a wildcard provenance.
"exposed" is not a thing in the operational semantics, it is a theorem we can prove.

Okay, fine, I won’t say “exposure”. :) You intend to still be able to
prove that the reconstructed provenance cannot be that of certain
memory locations based on a comprehensive use analysis of those
locations.

For example, you have compellingly argued that it’s problematic to
do the reduction |a == b ? a : b| to |b| for pointer types. Suppose
I instead do this optimization on integers where |a = ptrtoint A|.
The result is now simply |b|. If I |inttoptr| that result and access
the memory, there will be no record that that access may validly
be to |A|. It does not help that the access may be represented
as |inttoptr (ptrtoint B)| for some |B| rather than just directly
to |B|, because there is no use-dependence on |A|. All there is
is an apparently unrelated and unused |ptrtoint A|.

So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : ptrtoint B" being
replaced by "ptrtoint B"? I don't see any problem with that. Do you have a
concrete example?

I think you can take any example where pointer-type restrictions
would be necessary to protect against miscompilation and turn it
into an example here by just inserting |inttoptr| and |ptrtoint|
appropriately. Quick example:

|int A = 0x1; int B = 0x2; long a = (long) (A+1); long b = (long) B; long result = (a == b ? a : b); if (a == b) *(((int*) result) - 1) = 0x4; else *((int*) result) = 0x8; printf(“%d %d\n”, A, B); |

(Sorry about the bad quote, not sure what Thunderbird does here.)
I assume you mean "&A" and "&B" in lines 3 and 4?

Yes, sorry.

I submit that this program has unspecified but not undefined behavior,
with printing “1 8” and “4 2” being the only two valid outcomes.

Okay, I can follow that.

But I think an optimizer which changes the fifth line to
|long result = b;| without leaving any trace behind

so then we are at

int A = 0x1;
int B = 0x2;
long a = (long) (&A+1);
long b = (long) &B;
long result = b;
if (a == b)
*(((int*) result) - 1) = 0x4;
else
*((int*) result) = 0x8;
printf(“%d %d\n”, A, B);

could easily
compile this to print “1 2” because there would be nothing to
prevent the initialization of |A| from being forwarded to the
final load.

I don't think that's right, since "a" is still used in the "if", so a bit of information about the address is leaked and you can't assume the address was not guessed.

Okay, so you have a baseline expectation that pointer comparisons will be treated as escapes. The reason I added a second comparison in my example is just because otherwise in the a != b case the access is out of bounds. So I guess you’re suggesting that it’s impossible to construct an example that avoids that problem that wouldn’t ultimately do something uneliminatable that would count as an escape?

That said, things get more tricky once you want to account for C 'restrict' / LLVM 'noalias', and then it might be necessary to have explicit 'exposed'-like operations. I haven't seen a properly worked-out model for this, but the candidate models I saw would need something like this. So maybe it'd not be a bad idea to have such an operation anyway... I just don't think it should have any effect for the kind of programs we have been discussing here so far.

Honestly, part of why I like this approach is that I think the basic idea — recognizing when we’re doing a dependency-breaking value replacement and using a different replacement sequence — also moves us towards a solution for consume dependencies.

John.

Nicolai Hähnle via llvm-dev

unread,
Jun 23, 2021, 1:27:52 AM6/23/21
to Ralf Jung, llvm-dev
On Tue, Jun 22, 2021 at 11:59 AM Ralf Jung via llvm-dev <llvm...@lists.llvm.org> wrote:
I don't think it makes sense for LLVM to adopt an explicit "exposed" flag in its
semantics. Reasoning based on non-determinism works fine, and has the advantage
of keeping ptr-to-int casts a pure, side-effect-free operation. This is the
model we explored in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf>, and
we were able to show quite a few of LLVM's standard optimizations correct
formally. Some changes are still needed as you noted, but those changes will be
required anyway even if LLVM were to adopt PNVI-ae:
- No removal of ptr-int-ptr roundtrips.
(https://bugs.llvm.org/show_bug.cgi?id=34548)
- No GVN replacement of pointer-typed values.
(https://bugs.llvm.org/show_bug.cgi?id=35229)

I've read this paper now, and it makes good sense to me as something to adopt in LLVM.

I do have one question about a point that doesn't seem sufficiently justified, though. In the semantics of the paper, store-pointer-then-load-as-integer results in poison. This seems to be the root cause for being forced to introduce a "byte" type for correctness, but it is only really justified by an optimization that eliminates a store that writes back a previously loaded value. That optimization doesn't seem all that important (but feel free to point out why it is...), while introducing a "byte" type is a massive change. On the face of it, that doesn't seem like a good trade-off to me.

Has the alternative of allowing type punning through memory at the cost of removing that optimization been studied sufficiently elsewhere?

Cheers,
Nicolai

Juneyoung Lee via llvm-dev

unread,
Jun 23, 2021, 4:01:17 AM6/23/21
to Hal Finkel, llvm-dev, Nuno Lopes, Ralf Jung
About the impact of disabling GVN for pointers - for SPEC CPU and llvm-test-suite the slowdown wasn't significant (avg. <0.1%),
but I'm concerned that they are only a small number of programs written in C/C++ on and my experiment was done on only one architecture.
Certainly for programs using many pointers disabling GVN is likely to be problematic.

If we have an operation that is similar to launder, then the optimization can be almost salvaged.
Let's call it 'q = wildcard_provenance(p)'; it means that q can be used to access an object that is at (intptr_t)p but not necessarily p's object.
(In our memory model, wildcard_provenance(p) is equivalent to 'inttoptr(ptrtoint p)').

Replacing p with 'wildcard_provenance(p)' is correct because it makes the program more defined.
For example, load p raises UB if p is freed, but load wildcard_provenance(p) is still well-defined if a new live malloc is exactly placed at (intptr_t)p.
When lowered to MachineIR, wildcard_provenance(p) is simply a register copy.

Here is the list of valid optimizations:

1. GVN
if (p == q) {
  use(p);
  use(q);
}
=>
if (p == q) {
  use(p);
  use(wildcard_provenance(p));
}

2.
dereference(p);
dereference(wildcard_provenance(p));
=>
dereference(p);
dereference(p);

3.
store v, wildcard_provenance(p);
w = load p;
=>
store v, wildcard_provenance(p);
w = v;

4.
must-alias(p, wildcard_provenance(p))? // answer: true


I don't have a performance number for this yet because the operation did not exist when designing the memory model.

Juneyoung

Juneyoung Lee via llvm-dev

unread,
Jun 23, 2021, 4:36:54 AM6/23/21
to Nicolai Hähnle, llvm-dev, Ralf Jung
The transformation is analogous to removing memcpy-like code with the same dst and src.
Such code might not be written by humans frequently, but I believe C++'s template instantiation or optimizations like inlining can expose such a case.

Juneyoung


Cheers,
Nicolai

--
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.

Juneyoung Lee via llvm-dev

unread,
Jun 23, 2021, 4:38:12 AM6/23/21
to John McCall, llvm-dev, Ralf Jung
(Slightly out of topic, but) I'm interested in knowing about what the consume dependency problem issue is; do you have any suggested thread for this? 

John.

Ralf Jung via llvm-dev

unread,
Jun 23, 2021, 8:52:20 AM6/23/21
to Juneyoung Lee, Nicolai Hähnle, llvm-dev
Hi Nicolai,

> I've read this paper now, and it makes good sense to me as something to
> adopt in LLVM.

:)

> I do have one question about a point that doesn't seem sufficiently
> justified, though. In the semantics of the paper,
> store-pointer-then-load-as-integer results in poison. This seems to be the
> root cause for being forced to introduce a "byte" type for correctness, but
> it is only really justified by an optimization that eliminates a store that
> writes back a previously loaded value. That optimization doesn't seem all
> that important (but feel free to point out why it is...), while introducing
> a "byte" type is a massive change. On the face of it, that doesn't seem like
> a good trade-off to me.
>
> Has the alternative of allowing type punning through memory at the cost of
> removing that optimization been studied sufficiently elsewhere?
>
>
> The transformation is analogous to removing memcpy-like code with the same dst
> and src.
> Such code might not be written by humans frequently, but I believe C++'s
> template instantiation or optimizations like inlining can expose such a case.

To add to what Juneyoung said:
I don't think that experiment has been made. From what I can see, the
alternative you propose leads to an internally consistent model -- one "just"
has to account for the fact that a "load i64" might do some transformation on
the data to actually obtain an integer result (namely, it might to ptrtoint).

However, I am a bit worried about what happens when we eventually add proper
support for 'restrict'/'noalias': the only models I know for that one actually
make 'ptrtoint' have side-effects on the memory state (similar to setting the
'exposed' flag in the C provenance TS). I can't (currently) demonstrate that
this is *required*, but I also don't know an alternative. So if this remains the
case, and if we say "load i64" performs a ptrtoint when needed, then that would
mean we could not do dead load elimination any more as that would remove the
ptrtoint side-effect.

There also is the somewhat conceptual concern that LLVM ought to have a type
that can loslessly hold all kinds of data that exist in LLVM. Currently, that is
not the case -- 'iN' cannot hold data with provenance.

Kind regards,
Ralf

>
> Juneyoung
>
>
> Cheers,
> Nicolai
>
> --
> Lerne, wie die Welt wirklich ist,
> aber vergiss niemals, wie sie sein sollte.
>
>
>
> --
>
> Juneyoung Lee
> Software Foundation Lab, Seoul National University

--
Website: https://people.mpi-sws.org/~jung/

Jeroen Dobbelaere via llvm-dev

unread,
Jun 23, 2021, 10:17:41 AM6/23/21
to Ralf Jung, Juneyoung Lee, Nicolai Hähnle, llvm...@lists.llvm.org
Hi Ralf,

[..]
>
> To add to what Juneyoung said:
> I don't think that experiment has been made. From what I can see, the
> alternative you propose leads to an internally consistent model -- one "just"
> has to account for the fact that a "load i64" might do some transformation on
> the data to actually obtain an integer result (namely, it might to ptrtoint).
>
> However, I am a bit worried about what happens when we eventually add proper
> support for 'restrict'/'noalias': the only models I know for that one actually
> make 'ptrtoint' have side-effects on the memory state (similar to setting the
> 'exposed' flag in the C provenance TS). I can't (currently) demonstrate that

For the 'c standard', it is undefined behavior to convert a restrict pointer to
an integer and back to a pointer type.

(At least, that is my interpretation of n2573 6.7.3.1 para 3:
Note that "based" is defined only for expressions with pointer types.
)

For the full restrict patches, we do not track restrict provenance across a
ptr2int, except for the 'int2ptr(ptr2int %P)' (which we do, as llvm sometimes
introduced these pairs; not sure if this is still valid).

Greetings,

Jeroen Dobbelaere

> this is *required*, but I also don't know an alternative. So if this remains
> the
> case, and if we say "load i64" performs a ptrtoint when needed, then that
> would
> mean we could not do dead load elimination any more as that would remove the
> ptrtoint side-effect.
>
> There also is the somewhat conceptual concern that LLVM ought to have a type
> that can loslessly hold all kinds of data that exist in LLVM. Currently, that
> is
> not the case -- 'iN' cannot hold data with provenance.
>
> Kind regards,
> Ralf

Ralf Jung via llvm-dev

unread,
Jun 23, 2021, 2:55:39 PM6/23/21
to John McCall, llvm-dev
Hi John,

> So, at least in the paper Juneyoung and me and a few other people wrote (the
> one referenced a couple times in this thread already:
> https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf

> <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf>), it is a wildcard


> provenance.
> "exposed" is not a thing in the operational semantics, it is a theorem we
> can prove.
>
> Okay, fine, I won’t say “exposure”. :) You intend to still be able to
> prove that the reconstructed provenance cannot be that of certain
> memory locations based on a comprehensive use analysis of those
> locations.

Not really. You prove that the *address* can never be guessed by anything else.
For pointers with the wrong address, their provenance simply does not matter.

> I don't think that's right, since "a" is still used in the "if", so a bit of
> information about the address is leaked and you can't assume the address was
> not guessed.
>
> Okay, so you have a baseline expectation that pointer comparisons will be
> treated as escapes.

Yes. This expectation is based on "otherwise the analysis is wrong and I think I
can prove that". ;) You can't just ignore control dependencies and expect
things to work out.
Does LLVM currently not treat pointer comparisons as escapes? I naively assumed
it would because how could one even justify not doing so?^^

> The reason I added a second comparison in my example is just
> because otherwise in the |a != b| case the access is out of bounds. So I guess
> you’re suggesting that it’s impossible to construct an example that avoids that
> problem that wouldn’t ultimately do something uneliminatable that would count as
> an escape?

Yes, I think the theorems we proved for that paper establish this.
Of course, there is a chance that *some other* optimization LLVM does (not
covered by our theorems) is in conflict with that story.

> That said, things get more tricky once you want to account for C 'restrict'
> / LLVM 'noalias', and then it might be necessary to have explicit
> 'exposed'-like operations. I haven't seen a properly worked-out model for
> this, but the candidate models I saw would need something like this. So
> maybe it'd not be a bad idea to have such an operation anyway... I just
> don't think it should have any effect for the kind of programs we have been
> discussing here so far.
>
> Honestly, part of why I like this approach is that I think the basic idea —
> recognizing when we’re doing a dependency-breaking value replacement and using a
> different replacement sequence — also moves us towards a solution for |consume|
> dependencies.

What I don't like about it is that it is not operational (or I did not
understand it properly; I am not quite sure what the exact semantics of that
marker you want to insert is). I think we to get to a state where we have a
precise operational semantics / Abstract Machine for LLVM IR (along the lines of
what we have in that paper -- not the same semantics, of course, but the same
basic style of defining things). Ideally this would even be implemented in an
interpreter. Then we can attack the question of "is this analysis/optimization
correct" in a principled way.
(We have such an interpreter for Rust: https://github.com/rust-lang/miri/ . It's
been tremendously useful, not just for finding bugs.)

The operational version of this that I know is basically the 'exposed' flag of
the C provenance TS. Then 'ptrtoint' has a side-effect of setting that flag, and
can (in general) not be removed, even if its result is unused. For
'restrict'/'noalias' we need something more flexible, but I think it can be
done. (I've done something similar for a Rust memory model in
http://plv.mpi-sws.org/rustbelt/stacked-borrows/ .)
If you are okay with 'ptrtoint' having a side-effect, many things become a lot
simpler.
But you seem to imagine something more targeted, something that only affects
"dependency-breaking value replacements"... I don't have a good enough intuition
for what exactly that is, to really evaluate this option. Could you give an example?

Kind regards,
Ralf

Ralf Jung via llvm-dev

unread,
Jun 23, 2021, 3:09:22 PM6/23/21
to Jeroen Dobbelaere, Juneyoung Lee, Nicolai Hähnle, llvm...@lists.llvm.org
Hi Jeroen,

>> To add to what Juneyoung said:
>> I don't think that experiment has been made. From what I can see, the
>> alternative you propose leads to an internally consistent model -- one "just"
>> has to account for the fact that a "load i64" might do some transformation on
>> the data to actually obtain an integer result (namely, it might to ptrtoint).
>>
>> However, I am a bit worried about what happens when we eventually add proper
>> support for 'restrict'/'noalias': the only models I know for that one actually
>> make 'ptrtoint' have side-effects on the memory state (similar to setting the
>> 'exposed' flag in the C provenance TS). I can't (currently) demonstrate that
>
> For the 'c standard', it is undefined behavior to convert a restrict pointer to
> an integer and back to a pointer type.
>
> (At least, that is my interpretation of n2573 6.7.3.1 para 3:
> Note that "based" is defined only for expressions with pointer types.
> )
>
> For the full restrict patches, we do not track restrict provenance across a
> ptr2int, except for the 'int2ptr(ptr2int %P)' (which we do, as llvm sometimes
> introduced these pairs; not sure if this is still valid).

Interesting. I assumed that doing ptr2int, then doing whatever you want with
that value (say, AES encrypt and then decrypt it), and then turning the same
value back into a pointer, must always produce a pointer that is "at least as
usable" as the one that we started with. I would interpret the parts of the
standard that talk about integer-pointer casts that way.
(That's the problem with axiomatic standards: it is very easy to have mutually
contradicting axioms...)

FWIW, Rust's use of LLVM 'noalias' pretty much relies on this. It would be
rather disastrous for Rust if 'noalias' pointers cannot be cast to integers,
cast back (potentially in a different function), and used.


The C standard definition of 'restrict' is based on hypothetical alternative
executions of the program with different inputs. I can't even imagine any
reasonable way to interpret that unambiguously, so honestly I don't see how that
is even a starting point for a precise formal definition that one could prove
theorems about.^^
The ideas colleagues and me discussed for this more evolved around the idea of
having more than one "provenance" for an allocation (so when a pointer is passed
to a function as 'restrict' argument, it gets a fresh "ID" into its provenance),
and then ensuring that the different provenances on one allocation are used
consistently. But then when you cast a ptr to an int you basically have to mark
that particular provenance as 'exposed' (losing all 'restrict' advantages) to
have any chance of handling the case of casting the int back to a ptr. That
seems fair to me honestly, if you cast a ptr to an int you cannot reasonably
expect alias analysis to make heads or tails of what you are doing. But then
'ptrtoint' has a side-effect and cannot be removed even if the result is unused.

Kind regards,
Ralf


>
> Greetings,
>
> Jeroen Dobbelaere
>
>> this is *required*, but I also don't know an alternative. So if this remains
>> the
>> case, and if we say "load i64" performs a ptrtoint when needed, then that
>> would
>> mean we could not do dead load elimination any more as that would remove the
>> ptrtoint side-effect.
>>
>> There also is the somewhat conceptual concern that LLVM ought to have a type
>> that can loslessly hold all kinds of data that exist in LLVM. Currently, that
>> is
>> not the case -- 'iN' cannot hold data with provenance.
>>
>> Kind regards,
>> Ralf
>

--
Website: https://people.mpi-sws.org/~jung/

Ralf Jung via llvm-dev

unread,
Jun 24, 2021, 3:01:53 AM6/24/21
to Jeroen Dobbelaere, Juneyoung Lee, Nicolai Hähnle, llvm...@lists.llvm.org
Hi again Jeroen,

>> However, I am a bit worried about what happens when we eventually add proper
>> support for 'restrict'/'noalias': the only models I know for that one actually
>> make 'ptrtoint' have side-effects on the memory state (similar to setting the
>> 'exposed' flag in the C provenance TS). I can't (currently) demonstrate that
>
> For the 'c standard', it is undefined behavior to convert a restrict pointer to
> an integer and back to a pointer type.
>
> (At least, that is my interpretation of n2573 6.7.3.1 para 3:
> Note that "based" is defined only for expressions with pointer types.
> )

After sleeping over it, I think I want to push back against this interpretation
a bit more strongly. Consider a program snippet like

int *out = (int*) decrypt(encrypt( (uintptr_t)in ));

It doesn't matter what "encrypt" and "decrypt" do, as long as they are inverses
of each other.
"out" is definitely of pointer type. And by the dependency-based definition of
the standard, it is the case that modifying "in" to point elsewhere would also
make "out" point elsewhere. Thus "out" is 'based on' "in". And hence it is okay
to use "out" to access the object "in" points to, even in the presence of
'restrict'.

Kind regards,
Ralf

>
> For the full restrict patches, we do not track restrict provenance across a
> ptr2int, except for the 'int2ptr(ptr2int %P)' (which we do, as llvm sometimes
> introduced these pairs; not sure if this is still valid).
>
> Greetings,
>
> Jeroen Dobbelaere
>
>> this is *required*, but I also don't know an alternative. So if this remains
>> the
>> case, and if we say "load i64" performs a ptrtoint when needed, then that
>> would
>> mean we could not do dead load elimination any more as that would remove the
>> ptrtoint side-effect.
>>
>> There also is the somewhat conceptual concern that LLVM ought to have a type
>> that can loslessly hold all kinds of data that exist in LLVM. Currently, that
>> is
>> not the case -- 'iN' cannot hold data with provenance.
>>
>> Kind regards,
>> Ralf
>

--
Website: https://people.mpi-sws.org/~jung/

Jeroen Dobbelaere via llvm-dev

unread,
Jun 24, 2021, 3:48:05 AM6/24/21
to Ralf Jung, Juneyoung Lee, Nicolai Hähnle, llvm...@lists.llvm.org
Hi Ralf,

My interpretation (well not just mine, we did have discussions about this in our group)
wrt to restrict handling, is that the use of decrypt/encrypt
triggers undefined behavior. Aka, we are not forced to try to track the
restrict dependency for this case. That is important if you want to optimize
restrict annotated accesses vs not-annotated accesses.

At the time that I came up with the implementation, that was also a convenient fallback
to avoid some of the pitfalls. It made thinking about the solution 'easier'.
For our customers, getting the pointer based use cases working also had the highest priority.

Now that we are going over the different pieces of the implementation and see how we can use
them in a broader context, the situation is different: instead of just tracking
the 'restrict/noalias' provenance, we now want to use that part of the infrastructure to
track provenance in general. Because of that, it also makes sense to reconsider what 'policy'
we want to use. In that context, mapping a 'int2ptr' to a 'add_provenance(int2ptr(%Decrypt), null)'
indicating that it can point to anything makes sense, but is still orthogonal to the infrastructure.

For this particular example, it would also be nice if we could somehow indicate that the
'decrypt(encrypt(%P))' can only depend on %P. But that is another discussion.

Greetings,

Jeroen
> Website: https://urldefense.com/v3/__https://people.mpi-
> sws.org/*jung/__;fg!!A4F2R9G_pg!OxsbBsUqT_ORztvmiL8KMQVNFdMPVYluQbPvIfVWl8KHjQ
> dIXhSF65d6sByCus-4fqepGR7h$

Ralf Jung via llvm-dev

unread,
Jun 24, 2021, 4:26:18 AM6/24/21
to Jeroen Dobbelaere, Juneyoung Lee, Nicolai Hähnle, llvm...@lists.llvm.org
Hi Jeroen,

> My interpretation (well not just mine, we did have discussions about this in our group)
> wrt to restrict handling, is that the use of decrypt/encrypt
> triggers undefined behavior.

Yes, that is exactly what I am pushing back against. :) I cannot see a reading
of the standard where this is UB. I also don't think it is the intention of the
standard to make this UB. Note that the line I showed could be very far away
from the 'restrict' annotation. Basically if this is UB then a 'restrict'
pointer cannot be passed to other functions unless we know exactly that they do
not do ptr-to-int casts.

> Now that we are going over the different pieces of the implementation and see how we can use
> them in a broader context, the situation is different: instead of just tracking
> the 'restrict/noalias' provenance, we now want to use that part of the infrastructure to
> track provenance in general. Because of that, it also makes sense to reconsider what 'policy'
> we want to use. In that context, mapping a 'int2ptr' to a 'add_provenance(int2ptr(%Decrypt), null)'
> indicating that it can point to anything makes sense, but is still orthogonal to the infrastructure.

That is not sufficient though. You also need to know that the provenance of the
'restrict'ed pointer can now be acquired by other pointers created literally
anywhere via int2ptr. *That* is what makes this so tricky, I think.

int foo(int *restrict x) {
*x = 0;
unk1();
assert(*x == 0); // can be optimized to 'true'
unk2((uintptr_t)x);
assert(*x == 0); // can *not* be optimized to 'true'
}

> For this particular example, it would also be nice if we could somehow indicate that the
> 'decrypt(encrypt(%P))' can only depend on %P. But that is another discussion.

It would be nice if one could express this in the surface language (C/Rust), but
I don't think we should allow LLVM to infer this -- that would basically require
tracking provenance through integers, which is not a good idea.
Put differently: as the various examples in this thread show, integers can
easily acquire "provenance" of other values simply by comparing them for
equality -- so in a sense, after "x == y" evaluates to true, now 'x' also has
the "provenance" of 'y'. I don't think we want obscure effects like this in the
semantics of the Abstract Machine. (I am not even convinced this can be done
consistently.)
So then what we are left with are those transformations that are correct without
extra support from the abstract machine. And since these dependencies can
entirely disappear from the source code through optimizations like GVN replacing
'x' by 'y', there are strong limits to what can be done here.

Kind regards,
Ralf

--
Website: https://people.mpi-sws.org/~jung/

Jeroen Dobbelaere via llvm-dev

unread,
Jun 24, 2021, 6:17:03 AM6/24/21
to Ralf Jung, Juneyoung Lee, Nicolai Hähnle, llvm...@lists.llvm.org
Hi Ralf,

>
> > My interpretation (well not just mine, we did have discussions about this in
> our group)
> > wrt to restrict handling, is that the use of decrypt/encrypt
> > triggers undefined behavior.
>
> Yes, that is exactly what I am pushing back against. :) I cannot see a
> reading
> of the standard where this is UB. I also don't think it is the intention of
> the
> standard to make this UB. Note that the line I showed could be very far away
> from the 'restrict' annotation. Basically if this is UB then a 'restrict'
> pointer cannot be passed to other functions unless we know exactly that they
> do
> not do ptr-to-int casts.

Sure, this might be a liberal reading of that sentence wrt to restrict.
And that is how it is done today in the full restrict patches, but of course,
that does not mean that this is where we need to settle on when including the
functionality. It is good to have the reviews that steer us to a solution that
is more broadly applicable.

>
> > Now that we are going over the different pieces of the implementation and
> see how we can use
> > them in a broader context, the situation is different: instead of just
> tracking
> > the 'restrict/noalias' provenance, we now want to use that part of the
> infrastructure to
> > track provenance in general. Because of that, it also makes sense to
> reconsider what 'policy'
> > we want to use. In that context, mapping a 'int2ptr' to a
> 'add_provenance(int2ptr(%Decrypt), null)'
> > indicating that it can point to anything makes sense, but is still
> orthogonal to the infrastructure.
>
> That is not sufficient though. You also need to know that the provenance of
> the
> 'restrict'ed pointer can now be acquired by other pointers created literally
> anywhere via int2ptr. *That* is what makes this so tricky, I think.
>
> int foo(int *restrict x) {
> *x = 0;
> unk1();
> assert(*x == 0); // can be optimized to 'true'
> unk2((uintptr_t)x);
> assert(*x == 0); // can *not* be optimized to 'true'
> }

Also for restrict, escape analysis must be done. So also this case can be handled.

Greetings,

Jeroen Dobbelaere

Ralf Jung via llvm-dev

unread,
Jun 25, 2021, 3:41:18 AM6/25/21
to Jeroen Dobbelaere, Juneyoung Lee, Nicolai Hähnle, llvm...@lists.llvm.org
Hi Jeroen,

>>> My interpretation (well not just mine, we did have discussions about this in
>> our group)
>>> wrt to restrict handling, is that the use of decrypt/encrypt
>>> triggers undefined behavior.
>>
>> Yes, that is exactly what I am pushing back against. :) I cannot see a
>> reading
>> of the standard where this is UB. I also don't think it is the intention of
>> the
>> standard to make this UB. Note that the line I showed could be very far away
>> from the 'restrict' annotation. Basically if this is UB then a 'restrict'
>> pointer cannot be passed to other functions unless we know exactly that they
>> do
>> not do ptr-to-int casts.
>
> Sure, this might be a liberal reading of that sentence wrt to restrict.
> And that is how it is done today in the full restrict patches, but of course,
> that does not mean that this is where we need to settle on when including the
> functionality. It is good to have the reviews that steer us to a solution that
> is more broadly applicable.

Fair enough. The standard is certainly not as unambiguous as one would hope.

Having suffered from an endless stream of 'noalias' bugs on the Rust side, I am
very excited that this part of LLVM is being overhauled. :)
I was hoping at some point to delve into those restrict patches and try to
understand them from a PL/semantics perspective, but so far I haven't had the
time -- and it's also a large patchset, much of which naturally is about the
implementation (which I can't really follow) and not about the high-level
description of the LLVM IR spec that makes the new analyses correct.
When/if I find some time -- what would be a good starting point to try to
understand the concepts of those patches without having to understand the C++
details?

>>> Now that we are going over the different pieces of the implementation and
>> see how we can use
>>> them in a broader context, the situation is different: instead of just
>> tracking
>>> the 'restrict/noalias' provenance, we now want to use that part of the
>> infrastructure to
>>> track provenance in general. Because of that, it also makes sense to
>> reconsider what 'policy'
>>> we want to use. In that context, mapping a 'int2ptr' to a
>> 'add_provenance(int2ptr(%Decrypt), null)'
>>> indicating that it can point to anything makes sense, but is still
>> orthogonal to the infrastructure.
>>
>> That is not sufficient though. You also need to know that the provenance of
>> the
>> 'restrict'ed pointer can now be acquired by other pointers created literally
>> anywhere via int2ptr. *That* is what makes this so tricky, I think.
>>
>> int foo(int *restrict x) {
>> *x = 0;
>> unk1();
>> assert(*x == 0); // can be optimized to 'true'
>> unk2((uintptr_t)x);
>> assert(*x == 0); // can *not* be optimized to 'true'
>> }
>
> Also for restrict, escape analysis must be done. So also this case can be handled.

Sure, smarter analyses can handle the easy cases, but I was asking about what
part of the spec of these operations forces the analysis to work like that.
Defeating the analysis is not that hard, so here's another example:

static int foo(int *restrict x, uintptr_t y) {


*x = 0;
unk1();
assert(*x == 0); // can be optimized to 'true'

uintptr_t addr = (uintptr_t)x;
if (addr == y)
unk2(addr);


assert(*x == 0); // can *not* be optimized to 'true'
}

Now we do GVN integer replacement:

static int foo(int *restrict x, uintptr_t y) {


*x = 0;
unk1();
assert(*x == 0); // can be optimized to 'true'

uintptr_t addr = (uintptr_t)x;
if (addr == y)
unk2(y);


assert(*x == 0); // can *not* be optimized to 'true'
}

Now let us assume there is exactly one call site of this function (and foo is
static so we know it can't be called from elsewhere, or maybe we are doing LTO),
which looks like

foo(ptr, (uintptr_t)ptr);

This means we know that the "if" in "foo" will always evaluate to true, so we have

static int foo(int *restrict x, uintptr_t y) {


*x = 0;
unk1();
assert(*x == 0); // can be optimized to 'true'

uintptr_t addr = (uintptr_t)x;
unk2(y);


assert(*x == 0); // can *not* be optimized to 'true'
}

Now we can (seemingly) optimize away the "addr" variable entirely -- but at that
point, there is no clue left for escape analysis to know that "unk2" might
legally mutate "x".

That's why I am saying that with 'restrict', we have to treat ptr-to-int casts
as side-effecting, and cannot optimize them away even if their result is unused.
They *always* have an "escape" effect, no matter what happens with their result.

Kind regards,
Ralf

--
Website: https://people.mpi-sws.org/~jung/

It is loading more messages.
0 new messages