[llvm-dev] [RFC] Adding CPS call support

264 views
Skip to first unread message

Kavon Farvardin via llvm-dev

unread,
Apr 17, 2017, 11:30:27 AM4/17/17
to llvm-dev
Summary
=======

There is a need for dedicated continuation-passing style (CPS) calls in LLVM to
support functional languages. Herein I describe the problem and propose a
solution. Feedback and/or tips are greatly appreciated, as our goal is to
implement these changes so they can be merged into LLVM trunk.


Problem
=======

Implementations of functional languages like Haskell and ML (e.g., GHC and
Manticore) use a continuation-passing style (CPS) transformation in order to
manage the call stack explicitly. This is done prior to generating LLVM IR, so
the implicit call stack within LLVM is not used for call and return.

When making a non-tail call while in CPS, we initialize a stack frame for the
return through our own stack pointer, and then pass that stack pointer to the
callee when we jump to it. It is here when we run into a problem in LLVM.

Consider the following CPS call to @bar and how it will return:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

define void @foo (i8** %sp, ...) {
someBlk:
; ...
; finish stack frame by writing return address
%retAddr = blockaddress(@foo, %retpt)
store i8* %retAddr, i8** %sp
; jump to @bar
tail call void @bar(i8** %sp, ...)

retpt: ; <- how can @bar "call" %retpt?
%sp2 = ???
%val = ???
; ...

}

define void @bar (i8** %sp, ...) {
; perform a return
%retAddr0 = load i8*, i8** %sp
%retAddr1 = bitcast i8* %retAddr0 to void (i8**, i64)*
%val = bitcast i64 1 to i64
; jump back to %retpt in @foo, passing %sp and %val
tail call void %retAddr1(i8** %sp, i64 %val)
}

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

There is currently no way to jump back to %retpt from another function, as block
addresses have restricted usage in LLVM [1]. Our main difficulty is that we
cannot jump to a block address without knowing its calling convention, i.e., the
particular machine registers (or memory locations) that the block expects
incoming values to be passed in.

The workaround we have been using in GHC for LLVM is to break apart every
function, placing the code for the continuation of each call into a new
function. We do this only so that we can store a function pointer instead of a
block address to our stack. This particularly gross transformation inhibits
optimizations in both GHC and LLVM, and we would like to remove the need for it.


Proposal
========

I believe the lowest-impact method of fixing this problem with LLVM is the
following:

First, we add a special 'cps' call instruction marker to be used on non-tail
calls. Then, we use a specialized calling convention for these non-tail calls,
which fix the returned values to specific locations in the machine code [2].

To help illustrate what's going on, let's rewrite the above example using the
proposed 'cps' call:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

define { ... } @foo (i8** %sp, ...) {
someBlk:
; ...
; finish stack frame by writing return address
%retAddr = blockaddress(@foo, %retpt)
store i8* %retAddr, i8** %sp
; jump to @bar
%retVals = cps call ghccc {i8**, i64} @bar (i8** %sp, ...)
br label %retpt

retpt:
%sp2 = extractvalue {i8**, i64} %retVals, 0
%val = extractvalue {i8**, i64} %retVals, 1
; ...

}

define {i8**, i64} @bar (i8** %sp, ...) {
; perform a return
%retAddr0 = load i8*, i8** %sp
%retAddr1 = bitcast i8* %retAddr0 to {i8**, i64} (i8**, i64)*
%val = bitcast i64 1 to i64
; jump back to %retpt in @foo, passing %sp and %val
tail call ghccc void %retAddr1(i8** %sp, i64 %val)

unreachable ; <- ideally this would be our terminator,
; but 'unreachable' breaks TCO, so we will
; emit a ret of the struct "returned" by the call.
}

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

The important point here is that the 'cps' marked call will lower to a jump. The
'cps' call marker means that the callee knows how to return using the arguments
explicitly passed to it, i.e., the stack pointer %sp. The callee cannot use a
'ret' instruction if it is 'cps' called.

Either before or during 'cps' call lowering, any instructions following the
'cps' call to @bar are sunk into the the block %retpt, and the unconditional
branch to %retpt is deleted/ignored. We include that branch to preserve
control-flow information for LLVM IR optimization passes.

The 'extractvalue' instructions are what ensure the calling convention of
%retpt, since the fields of the struct %retVals are returned in physical
registers dictated by the (modified) ghccc convention. Those same physical
registers are used by the ghccc tail call in @bar when it jumps back to %retpt.
So, the call & return convention of ghccc ensures that everything matches up.


Interaction with LLVM
=====================

(1) Caller-saved Values

One may wonder how this would work if there are caller-saved values of the 'cps'
call. But, in our first example, which closely matches what CPS code looks like,
the call to @bar was in tail position. Thus, in the second example, there are no
caller-saved values for the 'cps' call to @bar, as all live values were passed
as arguments in the call.

This caller-saved part is a bit subtle. It works fine in my experience [2] when
@bar is a function not visible to LLVM. My impression is that even if @bar is
visible to LLVM, there is still no issue, but if you can think of any corner
cases that would be great!

(2) Inlining

My gut feeling is that we cannot inline a 'cps' marked call-site without more
effort. This is because we might end up with something odd like this once the
dust settles:

%retAddr = blockaddress(@foo, %retpt)
%retAddr1 = bitcast i8* %retAddr to {i8**, i64} (i8**, i64)*
tail call ghccc %retAddr1 ( %sp, ... )

We could add a pass that turns the above sequence into just an unconditional
branch to %retpt, using a phi-node to replace each 'extractvalue' instruction in
that block.

I'm not sure whether inlining in LLVM is important for us yet, as we tend to
inline quite a lot before generating LLVM IR. I don't think this additional fix-
up pass would be too difficult to implement if it's desired.


Implementation Sketch and Conclusion
====================================

My current plan is to add this special lowering of 'cps' calls during the
translation from LLVM IR to the SelectionDAG. I welcome any suggestions or tips
on the best way to approach this. An important goal for us is to merge this into
trunk since we do not want to bundle a special version of LLVM with GHC.

Please let me know soon if you have any objections to this feature.

Thanks for reading,
Kavon


References
==========

[1] http://llvm.org/docs/LangRef.html#blockaddress
[2] http://kavon.farvard.in/papers/ml16-cwc-llvm.pdf


_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Friedman, Eli via llvm-dev

unread,
Apr 17, 2017, 2:47:52 PM4/17/17
to Kavon Farvardin, llvm-dev

I'm not following how explicitly representing the return address of a
call in the IR before isel actually solves any relevant issue. We
already pass the return address implicitly as an argument to every call;
you can retrieve it with llvm.returnaddress if you need it.

You can't branch across functions like you're proposing because the
stack and callee-save registers won't be in the right state. LLVM will
inevitably save values to the stack for calls which are not tail calls.
Therefore, this proposal simply doesn't work.

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

Manuel Jacob via llvm-dev

unread,
Apr 17, 2017, 3:56:14 PM4/17/17
to Kavon Farvardin, llvm-dev
Hi Kavon,

Is there a reason you can't use the algorithm from the paper "A
Correspondence between Continuation Passing Style and Static Single
Assignment Form" to convert your IR to LLVM's SSA IR?

-Manuel

Kavon Farvardin via llvm-dev

unread,
Apr 17, 2017, 6:52:55 PM4/17/17
to Friedman, Eli, llvm-dev
(Sorry for the 2nd email Eli, I forgot to reply-all).

> I'm not following how explicitly representing the return address of a call in the IR before isel actually solves any relevant issue. We already pass the return address implicitly as an argument to every call; you can retrieve it with llvm.returnaddress if you need it.


Unfortunately the @llvm.returnaddress intrinsic does not solve the problem, as it only reads the return address pushed onto the LLVM stack by a call. We would then need a way to move the LLVM stack pointer back to where it was before the call, because a CPS call _must_ not grow the LLVM stack (it will never be popped!), so a 'call' instruction will not do.


> You can't branch across functions like you're proposing because the stack and callee-save registers won't be in the right state. LLVM will inevitably save values to the stack for calls which are not tail calls. Therefore, this proposal simply doesn't work.

It might be helpful to think of the CPS transformation as turning the program into a form such that it always moves "forward", in the sense that the tail calls _never_ return. A "return" is simply another tail call (i.e., a jump) to a return address, which looks no different than a function call.

Thus, we have no callee-saved registers to worry about since the tail calls never come back. In addition, there are no live values in the LLVM stack frame, since they are all passed in the CPS call (which uses a different stack). Thus, retpt receives all of its values from the struct it extracts from. While the 'cps' marked call appears to be non-tail, it will be lowered to a tail call.

It's also worth noting that this proposal is based on something that already works in LLVM, using a special assembly routine. We currently this workaround in Manticore when a thread is preempted. But, it has too much overhead for GHC, since the workaround would penalize general function calls. This is why support is necessary for GHC (and related compilers) in order to use LLVM effectively.

~kavon

Kavon Farvardin via llvm-dev

unread,
Apr 17, 2017, 7:04:20 PM4/17/17
to Manuel Jacob, llvm-dev
> Is there a reason you can't use the algorithm from the paper "A Correspondence between Continuation Passing Style and Static Single Assignment Form" to convert your IR to LLVM's SSA IR?


Yes, there are a few reasons.

Undoing the CPS transformation earlier in the pipeline would mean that we are using LLVM's built-in stack. The special layout and usage of the stack in GHC is achieved through CPS, so it is baked the compiler and garbage-collected runtime system.

~kavon

Friedman, Eli via llvm-dev

unread,
Apr 17, 2017, 9:32:45 PM4/17/17
to Kavon Farvardin, llvm-dev
On 4/17/2017 3:52 PM, Kavon Farvardin wrote:
> (Sorry for the 2nd email Eli, I forgot to reply-all).
>
>> I'm not following how explicitly representing the return address of a call in the IR before isel actually solves any relevant issue. We already pass the return address implicitly as an argument to every call; you can retrieve it with llvm.returnaddress if you need it.
>
> Unfortunately the @llvm.returnaddress intrinsic does not solve the problem, as it only reads the return address pushed onto the LLVM stack by a call. We would then need a way to move the LLVM stack pointer back to where it was before the call, because a CPS call _must_ not grow the LLVM stack (it will never be popped!), so a 'call' instruction will not do.

Most architectures have a call instruction which does not push anything
onto the stack; e.g. on ARM, the "BL" instruction saves the return
address in LR. On other architectures, you can emulate this (for
example, you could lower an IR "call" to LEA+JMP on x86-64).

>> You can't branch across functions like you're proposing because the stack and callee-save registers won't be in the right state. LLVM will inevitably save values to the stack for calls which are not tail calls. Therefore, this proposal simply doesn't work.
> It might be helpful to think of the CPS transformation as turning the program into a form such that it always moves "forward", in the sense that the tail calls _never_ return. A "return" is simply another tail call (i.e., a jump) to a return address, which looks no different than a function call.
>
> Thus, we have no callee-saved registers to worry about since the tail calls never come back. In addition, there are no live values in the LLVM stack frame, since they are all passed in the CPS call (which uses a different stack). Thus, retpt receives all of its values from the struct it extracts from. While the 'cps' marked call appears to be non-tail, it will be lowered to a tail call.

The return address for the current function is fundamentally live across
any non-tail call. And LLVM will hoist other operations across non-tail
calls, and in the process introduce values which are live across calls.
You need to save those values somewhere. The key question is where.
Your proposal tries to address that by explicitly saving/restoring the
return address onto the GHC stack, but you're working at the wrong
level; you can't get a complete list of which values are live across
calls until register allocation.

Kavon Farvardin via llvm-dev

unread,
Apr 18, 2017, 6:47:59 AM4/18/17
to Friedman, Eli, llvm-dev
> Most architectures have a call instruction which does not push anything onto the stack; e.g. on ARM, the "BL" instruction saves the return address in LR. On other architectures, you can emulate this (for example, you could lower an IR "call" to LEA+JMP on x86-64).

This is similar to what I was originally thinking, since the end goal of all of this is the following machine code for a call (I'll use an x86-64 example):

leaq _retpt(%rip), %scratchReg
movq %scratchReg, (%ghcSPReg)
jmp _bar

But, if we want to end up with the above machine code using a custom lowering of just the following IR instruction,

%retVals = cps call ghccc @bar (... args ...)

_without_ explicitly taking the block address and writing it to the GHC stack pointer beforehand, there are some things to consider:

1. How do we prevent IR optimizations from eliminating the %retpt block?

If we do not explicitly take the address of the %retpt block, a pass such as simplifycfg has no reason to preserve the block, and may merge it with its only predecessor.

2. Where in the GHC stack frame do we write the block address?

We could hardcode the fact that the stack pointer must be pointing to the field where the return address should be written to when making a 'cps' IR call. This is fine for GHC, but is a seemingly unnecessary restriction.

3. How do we know which argument is the GHC stack pointer?

Similarly, we could hardcode the fact that, say, the first argument of every 'cps' IR call is the GHC stack pointer. Then we know from the calling convention which register it is in. This again fine for GHC, but reduces the generality of the feature.


> The return address for the current function is fundamentally live across any non-tail call.

I'm not quite sure what you mean by this.

While there may be a return address handed to the main function (and thus passed to every other function) it's either unused or not known to LLVM.


> And LLVM will hoist other operations across non-tail calls, and in the process introduce values which are live across calls. You need to save those values somewhere. The key question is where. Your proposal tries to address that by explicitly saving/restoring the return address onto the GHC stack, but you're working at the wrong level; you can't get a complete list of which values are live across calls until register allocation.

The part I had omitted from the proposal is the initialization of the rest of the GHC stack frame, which is laid out in GHC by finding all of the values that are live across the call before CPS conversion.

While the full list of values preserved by LLVM are not known until register allocation, there are some properties of the LLVM IR we will initially generate that are important to consider:

1. There are no IR values live across the non-tail 'cps' IR call, since they are all passed to the callee.
2. All IR values used after the 'cps' IR call come from the struct returned by that call.

Thus, it should not be possible to hoist instructions above this particular non-tail call, as they are all derived from the values returned in the struct. Any register spills allocated to the LLVM stack before the non-tail call are dead, as they have been copied to the GHC stack frame if they were live.

If you have a particular example in mind that would be great.

~kavon

Friedman, Eli via llvm-dev

unread,
Apr 18, 2017, 1:59:27 PM4/18/17
to Kavon Farvardin, llvm-dev
On 4/18/2017 3:47 AM, Kavon Farvardin wrote:
>> Most architectures have a call instruction which does not push anything onto the stack; e.g. on ARM, the "BL" instruction saves the return address in LR. On other architectures, you can emulate this (for example, you could lower an IR "call" to LEA+JMP on x86-64).
> This is similar to what I was originally thinking, since the end goal of all of this is the following machine code for a call (I'll use an x86-64 example):
>
> leaq _retpt(%rip), %scratchReg
> movq %scratchReg, (%ghcSPReg)
> jmp _bar
>
> But, if we want to end up with the above machine code using a custom lowering of just the following IR instruction,
>
> %retVals = cps call ghccc @bar (... args ...)
>
> _without_ explicitly taking the block address and writing it to the GHC stack pointer beforehand, there are some things to consider:
>
> 1. How do we prevent IR optimizations from eliminating the %retpt block?
>
> If we do not explicitly take the address of the %retpt block, a pass such as simplifycfg has no reason to preserve the block, and may merge it with its only predecessor.
If the return address is implicit, it's just the instruction immediately
after the call; it doesn't need to be explicitly represented in the CFG.

If you really do need to represent the address explicitly somehow, you
could use something like an "invoke", which has branch destinations
integrated into the call.


>> The return address for the current function is fundamentally live across any non-tail call.

> I'm not quite sure what you mean by this.
>
> While there may be a return address handed to the main function (and thus passed to every other function) it's either unused or not known to LLVM.

Any function which has a "ret" IR instruction must save the address to
return to somewhere. I guess you could generate code without any "ret"
instructions (tail-call with call+unreachable), but that's awkward
because you can't return to the non-CPS code which called into your code.


>> And LLVM will hoist other operations across non-tail calls, and in the process introduce values which are live across calls. You need to save those values somewhere. The key question is where. Your proposal tries to address that by explicitly saving/restoring the return address onto the GHC stack, but you're working at the wrong level; you can't get a complete list of which values are live across calls until register allocation.
> The part I had omitted from the proposal is the initialization of the rest of the GHC stack frame, which is laid out in GHC by finding all of the values that are live across the call before CPS conversion.
>
> While the full list of values preserved by LLVM are not known until register allocation, there are some properties of the LLVM IR we will initially generate that are important to consider:
>
> 1. There are no IR values live across the non-tail 'cps' IR call, since they are all passed to the callee.
> 2. All IR values used after the 'cps' IR call come from the struct returned by that call.
>
> Thus, it should not be possible to hoist instructions above this particular non-tail call, as they are all derived from the values returned in the struct. Any register spills allocated to the LLVM stack before the non-tail call are dead, as they have been copied to the GHC stack frame if they were live.
>
> If you have a particular example in mind that would be great.

If your code uses a large integer constant (like 0x123456789) multiple
times in the same function, LLVM will load it into a register and reuse
it across calls. Same for floating-point constants.

If you reference a global multiple times, LLVM will load the address of
that global into a register, and reuse it. Same for the address of a
function (if you're not directly calling it). And if LLVM can reason
about the value of a global (e.g. it's a constant), it could save that
value into a register.

If LLVM can prove through interprocedural optimization that a function
always returns the value of an argument, it will replace uses of the
return value of the call with uses of the argument.

And even if you could force LLVM not to keep any values live across your
calls, there would be no point: LLVM optimizations wouldn't be able to
tell which values are equivalent before and after the call. The end
result is exactly equivalent to generating multiple functions.

-----

A potential alternative to what you're proposing here is to perform the
transform which splits a function into multiple functions on LLVM IR, as
opposed to running it before generating IR. You might want to look at
the implementation of C++ coroutines
(http://llvm.org/docs/Coroutines.html) for inspiration.

Philip Reames via llvm-dev

unread,
Apr 18, 2017, 2:24:18 PM4/18/17
to llvm...@lists.llvm.org

Can you give a bit more information on what this transformation breaks
optimization wise? You're essentially explicitly representing the
continuation and converting all calls into tail calls with an explicit
continuation argument. I would expect the LLVM optimizer to do
reasonable well with such a representation.

Honestly, this proposed design seems seriously problematic. The
semantics around inlining alone are problematic enough to warrant
serious hesitation. I think you need to take a step back, describe the
original problem you're trying to solve and why the standard conversion
approaches don't work out. You'd need to strongly motivate a change of
this magnitude and I really don't feel like you've done so so far.

Philip Reames via llvm-dev

unread,
Apr 18, 2017, 2:27:39 PM4/18/17
to Kavon Farvardin, Manuel Jacob, llvm-dev

On 04/17/2017 04:04 PM, Kavon Farvardin via llvm-dev wrote:
>> Is there a reason you can't use the algorithm from the paper "A Correspondence between Continuation Passing Style and Static Single Assignment Form" to convert your IR to LLVM's SSA IR?
>
> Yes, there are a few reasons.
>
> Undoing the CPS transformation earlier in the pipeline would mean that we are using LLVM's built-in stack. The special layout and usage of the stack in GHC is achieved through CPS, so it is baked the compiler and garbage-collected runtime system.

Can you give a bit more detail here? LLVM does provide support for
describing GC frame maps.

p.s. You're going to have to justify the design of the runtime a bit
here. Extending the IR to workaround a buggy or poorly structured
runtime is not going to be sufficient justification. *Why* does the
runtime need the specific runtime stack structure used? What
alternatives exist and why should those be rejected?

Kavon Farvardin via llvm-dev

unread,
Apr 18, 2017, 4:08:14 PM4/18/17
to Philip Reames, llvm-dev
Before I try to respond to all of these points, let's consider limiting the scope of the proposed changes:

Instead of adding a 'cps' call marker, what if I were to add a custom lowering (during isel) for calls marked with the 'ghccc' calling convention? There is already language specific lowering in isel for Swift, so I imagine this would more acceptable?

~kavon

Friedman, Eli via llvm-dev

unread,
Apr 18, 2017, 4:51:03 PM4/18/17
to Kavon Farvardin, Philip Reames, llvm-dev
The cps marker marker itself isn't important. I'm much more concerned
that your proposal simply doesn't work in its current form: it will lead
to miscompiles and won't effectively expose optimization opportunities.

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________

Reid Kleckner via llvm-dev

unread,
Apr 18, 2017, 6:00:24 PM4/18/17
to Kavon Farvardin, Gor Nishanov, llvm-dev
This seems to be solving a problem very similar to C++ coroutines. You might find it helpful to read the past discussion of their design in LLVM.

What do you think is more important for GHC: getting good or highly tunable low-level code, or enabling mid-level optimizations such as GVN across calls? Is GHC using LLVM more as an optimizer or as a code generator?

If GHC's code generation pipeline leaves opportunities for LLVM mid-level optimization, then I think you want to pursue a design similar to C++ coroutines, where we leave the control flow graph "uninverted" until code generation. I might be making up terminology here, but I think of transforming a program representation to continuation-passing style as "inverting" the CFG. I'm not suggesting that you use the native stack to store VM data, I'm just suggesting that you hide the fact that CPS calls will really be tail calls and returns from LLVM until code generation.

Gor Nishanov via llvm-dev

unread,
Apr 18, 2017, 7:18:34 PM4/18/17
to llvm...@lists.llvm.org

It does look similar to LLVM Coroutines. I was considering adding a few more intrinsics to support CPS case, but, decided to defer until we have use cases and people willing to do the work J.

Coroutine passes chop up the function into smaller pieces at suspend points, a CPS call can be represented as a suspend point.

 

Currently a suspend point looks like this:

 

  %sp1 = call token @llvm.coro.save()

  <some code may go here> ; if this is a call, it will be replaced with tail call where calling convention allows

  call i8 @llvm.coro.suspend(token %sp1) ;

 

Two extra intrinsics I was considering were:

 

%addr = i8* @llvm.resume.addr(tok) ; gives an address of an extracted continuation function.

<type> @llvm.resume.value.<type>(tok) ; gives a return value on resume

 

Int foo() {

  PartA…

  Int r = bar(args);

  PartB …

}

 

It can get represented as:

 

Int foo() {

  PartA

  %sp1 = call token %llvm.coro.save()

  %resume_addr = call @llvm.coro.resume.addr(%sp1)

   call bar(args, %resume_addr)

  call @llvm.coro.suspend(%sp1)

  %r = call %llvm.coro.resume.value<int>(%sp1)

  PartB

}

 

Coroutine passes after foo() is optimized will extract the continuation into a separate function:

 

foo$resume1(int r) {

   PartB

}

 

And suspend point will be replaced with a jump to bar()

Gor Nishanov via llvm-dev

unread,
Apr 18, 2017, 7:21:55 PM4/18/17
to Reid Kleckner, Kavon Farvardin, llvm-dev

It does look similar. I was considering adding a few more intrinsics to support CPS case, but, decided to defer until we have use cases and people willing to do the work J.

This seems to be solving a problem very similar to C++ coroutines. You might find it helpful to read the past discussion of their design in LLVM.

Kavon Farvardin via llvm-dev

unread,
Apr 19, 2017, 6:49:30 AM4/19/17
to Gor Nishanov, llvm-dev
Coroutine passes chop up the function into smaller pieces at suspend points, a CPS call can be represented as a suspend point.

Yes. I think LLVM's coroutines seem too heavyweight for our use case (function call/return), but I haven't seen the machine code for my (probably wrong) example below to confirm. It seems to always crash llc during 'X86 DAG->DAG Instruction Selection' on function '@foo', saying:

Unknown type!
UNREACHABLE executed at lib/IR/ValueTypes.cpp:285!

;;;;;;;;;;;;;;;
declare i8 @llvm.coro.suspend(token, i1)
declare token @llvm.coro.save(i8*)
declare void @llvm.coro.resume(i8*)
declare token @llvm.coro.id(i32, i8*, i8*, i8*)
declare i32 @llvm.coro.size.i32()
declare i8* @malloc(i32 %size)
declare i8* @llvm.coro.begin(token, i8*)
declare i1 @llvm.coro.end(i8*, i1)

define i32 @foo(i32 %x) {
; we don't actually want LLVM to place values in this frame,
; as we wouldn't have info about GC pointers.
allocFrame:
  %id = call token @llvm.coro.id(i32 0, i8* null, i8* null, i8* null)
  %size = call i32 @llvm.coro.size.i32()
  %alloc = call i8* @malloc(i32 %size)
  %hdl = call noalias i8* @llvm.coro.begin(token %id, i8* %alloc)
  br label %entry

entry:
%save = call token @llvm.coro.save(i8* %hdl)
%never_returns = tail call i32 @bar(i8* %hdl)
%ignored = call i8 @llvm.coro.suspend(token %save, i1 true)
%unused = call i1 @llvm.coro.end(i8* %hdl, i1 false)
ret i32 17
}

define i32 @bar(i8* %hdl) {
; it seems we cannot pass any values to this handle
; unless if we add addl intrinsic
call void @llvm.coro.resume(i8* %hdl)
ret i32 20 ; <- presumably, this is not reachable.
}
;;;;;;;;;;;;;;;;

Kavon Farvardin via llvm-dev

unread,
Apr 19, 2017, 7:22:26 AM4/19/17
to Reid Kleckner, llvm-dev, Gor Nishanov
What do you think is more important for GHC: getting good or highly tunable low-level code, or enabling mid-level optimizations such as GVN across calls? Is GHC using LLVM more as an optimizer or as a code generator?

GHC currently uses LLVM more as a code generator; its an alternative backend. 

While some optimizations may not be effective on the proposed construct, we already produce rather horrible looking code just to use LLVM. Our function splitting transformation that turns it into this mess is complicated to maintain, so we'd like to remove it. 

I see at least one performance benefit of not breaking the program into hundreds of tiny functions: better code placement for I-cache locality.

~kavon

Kavon Farvardin via llvm-dev

unread,
Apr 19, 2017, 12:49:24 PM4/19/17
to Friedman, Eli, llvm-dev, Gor Nishanov
To best address your concerns Eli, it's helpful for me to explain things in the context of a short example of the transformation I have in mind. The example I've created can be found here:

https://gist.github.com/kavon/b80db874cf168bd0625cdd8ef7464f56

I'll note that cc18 is just a custom calling convention that passes struct fields in particular registers.

Okay, so now onto your points:

> If LLVM can prove through interprocedural optimization that a function always returns the value of an argument, it will replace uses of the return value of the call with uses of the argument.

Every function return in CPS looks exactly as written in lines 19-22 in 1_cps_call.ll in the example link. Unless if the interprocedural analysis can reason about a tail call to a block address loaded from memory, the replacement you describe would not occur.

Even if %retAddr1 on line 21 were instead, say, the known function @returnFunc, the replacement still shouldn't occur. The analysis would be looking at 'ret' instructions to find out what is eventually returned by all of these tail calls, but the chain of tail calls is endless, we _never_ execute a 'ret' instruction. The program terminates when we effectively tail call the _exit function.


> If your code uses a large integer constant (like 0x123456789) multiple times in the same function, LLVM will load it into a register and reuse it across calls. Same for floating-point constants.

This seems reasonable, but this choice made is made by the register allocator, right?

The proposed lowering during isel is in the file 2_isel_idea.ll in the example link. The important part is that the 'cps' call was non-tail in the IR to preserve the control-flow for IR passes, but it will be a tail call when we initially build the SelectionDAG.

Thus, the register allocator should not have a chance to keep a constant in register... BB#1 doesn't even have a predecessor! Instead, it now has a fixed register convention (the physical register constraints from the cc18 convention) and its block address was taken, so the block won't be deleted.

I have not modified the isel code before, but I imagine this lowering is not too difficult to achieve?


> And even if you could force LLVM not to keep any values live across your calls, there would be no point: LLVM optimizations wouldn't be able to tell which values are equivalent before and after the call. The end result is exactly equivalent to generating multiple functions.


The point of this proposal is so that we do not have to split the whole program hundreds of tiny functions. Here's an example of how ugly this transformation is:

By the time one of the modules of a Haskell benchmark program has reached the backend, there are 184 functions. In order to generate LLVM IR as of now, we had to break the program into 715 functions with a complicated analysis and transformation.

In an attempt to piece the program back together, LLVM's inliner yields 666 out of 715 functions inlined, but only 51 functions were actually deleted after inlining (their addresses are taken, so most cannot be deleted). On average, the geomean size of a program increases by 25% when using the LLVM inliner, and yields a 4.5% reduction in program run-time by doing so.

Thus, in terms of improving the code generated by LLVM, I think the proposal will be profitable. As I mentioned in another reply, instruction cache locality is an obvious advantage of the proposal. I don't have a prototype written yet, so I can't tell you what else improves.

~kavon

Kavon Farvardin via llvm-dev

unread,
Apr 19, 2017, 1:38:37 PM4/19/17
to Philip Reames, llvm-dev
> The semantics around inlining alone are problematic enough to warrant serious hesitation.

There are nicer ways to embed CPS call/return into LLVM; I just figured that there would not be much support for adding a new terminator because it would change a lot of code. Ideally we would have a block terminator like:

cps call ghccc @bar (.. args ..) returnsto label %retpt

Where the "returnsto" is optional to cover both CPS calls and returns. Another alternative would be to emulate this terminator using an intrinsic.

So, my question is: would there be more support for a version of this terminator (whether as an intrinsic or not) instead of what I was proposing earlier?

~kavon

Reid Kleckner via llvm-dev

unread,
Apr 19, 2017, 2:34:10 PM4/19/17
to Kavon Farvardin, llvm-dev
I don't think you need a terminator at the IR level, you need one at the MI level. You can lower a CPS-call to some call pseudo-instruction during ISel and then break the MBB in two at some later point.

Gor Nishanov via llvm-dev

unread,
Apr 19, 2017, 2:53:54 PM4/19/17
to Kavon Farvardin, llvm-dev
Here is an example of emulating a CPS call using LLVM Coroutines:


It corresponds to:

void f(int val) {
  partA(val);
  int result = cps_call_bar(41);
  partB(result)
}

int cps_call_bar(int val) {
  partC(val)
  return 42;
}

int main() {
   f(40);
}

Run: opt -enable-coroutines -O2 cps.ll -S

cps_call will look like:

define void @cps_call_bar(i8* %caller, i32* nocapture %retval.addr, i32 %n) local_unnamed_addr {
  tail call void @partC(i32 %n)
  store i32 42, i32* %retval.addr, align 4
  %1 = bitcast i8* %caller to { i8*, i8* }*
  %2 = getelementptr inbounds { i8*, i8* }, { i8*, i8* }* %1, i32 0, i32 0
  %3 = load i8*, i8** %2
  %4 = bitcast i8* %3 to void (i8*)*
  tail call fastcc void %4(i8* %caller)
  ret void
}

main will get optimized to:

define i32 @main() local_unnamed_addr {
entry:
  %alloc.i = tail call i8* @malloc(i32 24)
  tail call void @partA(i32 40)
  tail call void @partC(i32 41)
  tail call void @partB(i32 42)
  tail call void @free(i8* nonnull %alloc.i)
  ret i32 0
}

and resume part of f will be:

define internal fastcc void @f.resume(%f.Frame* %FramePtr) {
entry.resume:
  %vFrame = bitcast %f.Frame* %FramePtr to i8*
  tail call void @partB(i32 42)
  tail call void @free(i8* %vFrame)
  ret void
}

Note that this is using LLVM coroutines as is. To support CPS we would need to add a couple of intrinsics along the lines I mentioned in my earlier post.

Kyle Butt via llvm-dev

unread,
Apr 19, 2017, 6:33:30 PM4/19/17
to Kavon Farvardin, LLVM Developers
On Wed, Apr 19, 2017 at 10:38 AM, Kavon Farvardin via llvm-dev <llvm...@lists.llvm.org> wrote:
> The semantics around inlining alone are problematic enough to warrant serious hesitation.

There are nicer ways to embed CPS call/return into LLVM; I just figured that there would not be much support for adding a new terminator because it would change a lot of code. Ideally we would have a block terminator like:

    cps call ghccc @bar (.. args ..) returnsto label %retpt

Where the "returnsto" is optional to cover both CPS calls and returns. Another alternative would be to emulate this terminator using an intrinsic.


It seems that what you need to support here is multi-prologue functions. Current coroutines assume a large shared prologue and switch on entry. I was speaking with Reid earlier today, and that is something that should be possible.
You would still need to do something similar to the coroutine support to find out which values are live across a cps call and put them into some frame that gets restored in the prologue.
Prologues for functional languages are likely to be short. If this turns out to be effective, it could be useful for C++ coroutines that have small prologues.

Kavon Farvardin via llvm-dev

unread,
Apr 21, 2017, 8:48:00 AM4/21/17
to Kyle Butt, Reid Kleckner, LLVM Developers
> I don't think you need a terminator at the IR level, you need one at the MI level. You can lower a CPS-call to some call pseudo-instruction during ISel and then break the MBB in two at some later point.

Thanks for this suggestion, Reid! This has turned out to be a lot easier than doing it during isel. I'm currently working on modifying the MBB at the CPS pseudo-instruction during ExpandISelPseudos.

It seems that what you need to support here is multi-prologue functions.

Basically, yes. I think of it more like needing functions that have multiple entry blocks. If we could attach a physical-register convention to an IR block, that would solve everything.

Current coroutines assume a large shared prologue and switch on entry. I was speaking with Reid earlier today, and that is something that should be possible.

Yes, we could also manually generate, in each function, that large switch that dispatches to these return points, but of course that's a lot of overhead to pay for call/return.

You would still need to do something similar to the coroutine support to find out which values are live across a cps call and put them into some frame that gets restored in the prologue.

Luckily, we've already done this in order to layout our stack (this is part of the CPS transformation in GHC), so, we don't have anything live across the call once we reach the point where we generate LLVM IR. We'd also want to avoid putting anything in LLVM coroutine frames anyway, since the garbage collector needs to know the layout of the frame to know which slots contain pointers.

~kavon

Reid Kleckner via llvm-dev

unread,
Apr 24, 2017, 7:47:35 PM4/24/17
to Kavon Farvardin, LLVM Developers, Kyle Butt
On Fri, Apr 21, 2017 at 5:47 AM, Kavon Farvardin <ka...@farvard.in> wrote:
> I don't think you need a terminator at the IR level, you need one at the MI level. You can lower a CPS-call to some call pseudo-instruction during ISel and then break the MBB in two at some later point.

Thanks for this suggestion, Reid! This has turned out to be a lot easier than doing it during isel. I'm currently working on modifying the MBB at the CPS pseudo-instruction during ExpandISelPseudos.

Glad to hear it! :) 
It seems that what you need to support here is multi-prologue functions.

Basically, yes. I think of it more like needing functions that have multiple entry blocks. If we could attach a physical-register convention to an IR block, that would solve everything.

I avoid using the phrase "multi-entrypoint" functions, because there is still only one entry block that dominates the rest.

You should be able to add new live-in physical registers like we do for landingpads, but that may be a special case. I could imagine adding a new MBB flag that allows new live-in physical registers.
Current coroutines assume a large shared prologue and switch on entry. I was speaking with Reid earlier today, and that is something that should be possible.

Yes, we could also manually generate, in each function, that large switch that dispatches to these return points, but of course that's a lot of overhead to pay for call/return.

Large is relative, though. What's one indirect branch in the prologue compared to the cost of duplicating the prologue/epilogue pair at every CPS call point when you use all 5 GPR callee-saved registers and some XMM callee-saved registers? Especially in the context of C++, where coroutines are vanishingly rare, and usually occur around I/O points or other expensive blocking operations.

Anyway, that's why, for now, we haven't seriously considered duplicating the prologue in LLVM. It just didn't seem worth the complexity (we'd also need to invent call frame pseudos to unwind the stack...)
Reply all
Reply to author
Forward
0 new messages