[llvm-dev] Fwd: [RFC] LLVM Coroutines

198 views
Skip to first unread message

Gor Nishanov via llvm-dev

unread,
Jun 9, 2016, 1:57:35 AM6/9/16
to llvm...@lists.llvm.org
Hi all:

Below is a proposal to add experimental coroutine support to LLVM. Though this
proposal is motivated primarily by the desire to support C++ Coroutines [1],
the llvm representation is language neutral and can be used to support
coroutines in other languages as well.

Clang + llvm coroutines allows you to take this code:

generator<int> range(int from, int to) {
for(int i = from; i < to; ++n)
co_yield i;
}
int main() {
int sum = 0;
for (auto v: range(1,100))
sum += v;
return sum;
}

And translate it down to this:

define i32 @main() #5 {
entry:
ret i32 4950
}

I prototyped llvm changes to support this proposal and extended clang coroutine
implementation [2] to emit proposed intrinsics to smoke test the proposed
design and to get simple generator and async task style coroutines working.
See "More Attention Required" in the doc/Coroutines.rst for details on what
addition work is required beyond cleanup and bug fixes.

I would like to start the discussion on the overall design, representation and
a roadmap to start upstreaming the changes with the intention to continue the
development in-tree incrementally.

Roadmap:
--------
1) Get agreement on coroutine representation and overall direction (this mail).
.. repeat 1) until happy
2) Get agreement on how coroutine transformation passes integrate into the
optimizer pipeline.
.. repeat 2) until happy
3) update IR/Intrinsics.td + doc/Coroutines.rst + doc/index.rst
4) trickle in coroutine transformation passes + tests in digestible chunks.
5) get clang changes in (sometime after step 3).
6) fix bugs / remove limitations / add functionality to coroutine passes
.. repeat 6) until happy

Background
==========
Coroutine representation is the result of several offline discussions
with Richard Smith,
Reid Kleckner, and David Majnemer. All of the good came from the
collective effort.
All the yuckiness was sneaked in by me after those discussions.

I'll start with a quick overview.

LLVM coroutines are functions that have one or more `suspend points`.
When a suspend point is reached, the execution of a coroutine is suspended and
control is returned back to its caller. A suspended coroutine can be resumed
to continue execution from the last suspend point or be destroyed.

In the following example, function `f` (which may or may not be a
coroutine itself)
returns a handle to a suspended coroutine (**coroutine handle**) that is used
by `main` to resume the coroutine twice and then destroy it:

.. code-block:: llvm

define i32 @main() {
entry:
%hdl = call i8* @f(i32 4)
call void @llvm.experimental.coro.resume(i8* %hdl)
call void @llvm.experimental.coro.resume(i8* %hdl)
call void @llvm.experimental.coro.destroy(i8* %hdl)
ret i32 0
}

In addition to the function stack frame which exists when a coroutine
is executing,
there is an additional region of storage that contains values that keep the
coroutine state when a coroutine is suspended. This region of storage
is called **coroutine frame**. It is created when a coroutine is called. It is
destroyed when a coroutine runs to completion or destroyed by a call to
the `coro.destroy`_ intrinsic. Unlike stackful coroutines [3] which maintains a
stack per coroutine, an llvm coroutine frame only contains the values that need
to persist across suspend points (there is a path from a use to the definition
that crosses a suspend point).

Overall idea is that a coroutine is presented to an llvm as an ordinary function
with suspension points marked up with intrinsics. We let the optimizer party
on the coroutine for awhile. Shortly before the coroutine is eligible to be
inlined into its callers, we outline parts of the coroutine that correspond to
the code that needs to get executed after the coroutine is resumed or destroyed.
The coroutine resumption intrinsics get replaced with direct calls to those
outlined functions where possible. Then inliner can inline much leaner and nicer
coroutine or any of the outlined parts as desired.

If we discover that the lifetime of a coroutine is fully enclosed in
the lifetime
of the caller, we remove dynamic allocation for coroutine frame and replace it
with an `alloca` on the caller's frame.

Please see doc/Coroutines.rst for more details:

doc/Coroutines.rst: http://reviews.llvm.org/D21170
IR/Intrinsics.td: http://reviews.llvm.org/D21169

Concerns:
=========
(This section assumes that you looked at doc/Coroutins.rst and IR/Intrinsics.td)

Invoke like intrinsics via call + conditional branch:
-----------------------------------------------------

The `llvm.experimental.coro.fork` and `llvm.experimental.coro.suspend`
intrinsics model behavior somewhat similar to the invoke instruction. Namely,
they are used to show a normal continuation branch and an alternative branch.

They are used in the following pattern:

%where.to = call i1 @llvm.experimental.coro.XXXX()
br i1 %where.to, label %normal, label %alternative

My concern is that this is somewhat fragile. After optimizations some code can
sneak in between the intrinsic and the branch and cause trouble. I am looking
for suggestions on how to make it more robust. One thought I had was to put some
intrinsics at the beginning of the %normal and %alternative blocks so as to
prevent code motion above them. Something like:

...
%where.to = call i1 @llvm.experimental.coro.XXXX()
br i1 %where.to, label %normal, label %alternative
normal:
<phi>
call i1 @llvm.experimental.coro.normal();
...
alternative:
<phi>
call i1 @llvm.experimental.coro.alt();
...

After coro.xxx is lowered, coro.normal and coro.alt are removed.

Looking up outlined coroutine parts
-----------------------------------

Last parameter of the `llvm.experimental.coro.init" intrinsic is a pointer to
a coroutine function. To get to the outlined parts, I get the name of that
function and glue ".resume", ".destroy" or ".cleanup" suffixes to it and then
look them up by name.

Probably a better alternative is for a coroutine f to create a constant
containing an array of pointers to f.resume, f.destroy and other parts of the
coroutine and use the last parameter to refer to that constant.

Other questions
---------------

I have more questions about the best way of plugging in coroutine optimization
passes into the PassManager, but, I will defer those for later discussion.

All the feedback and comments is greatly appreciated!!!

Thank you,
Gor

References:
===========

[1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0057r4.pdf
[2] http://llvmweekly.org/issue/95 (Coroutine support added to clang)
[3] http://www.boost.org/doc/libs/1_61_0/libs/coroutine/doc/html/index.html

Review Links:
=============

doc/Coroutines.rst: http://reviews.llvm.org/D21170
IR/Intrinsics.td: http://reviews.llvm.org/D21169
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

David Chisnall via llvm-dev

unread,
Jun 9, 2016, 4:25:59 AM6/9/16
to Gor Nishanov, llvm...@lists.llvm.org
On 9 Jun 2016, at 06:57, Gor Nishanov via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> Though this
> proposal is motivated primarily by the desire to support C++ Coroutines [1],
> the llvm representation is language neutral and can be used to support
> coroutines in other languages as well.

I’m sorry if this is off-topic, but has anyone in WG21 yet come up with a sane proposal about how coroutines interact with thread-local storage? In LLVM-compiled code, it is possible that the address of a thread-local variable will be cached to avoid repeated lookups. In an environment that contains both coroutines and TLS, it is possible that the coroutine will be migrated to a different thread between suspension and resumption. Does this mean that, for correctness, we must regard these suspension points as barriers that prevent loads or stores of thread-local variables and prevent caching of loads of addresses of thread-local variables? Do we have to treat every function call as a suspension point, as no function can know whether it is executing in a coroutine or whether a function that it calls may trigger a thread migration? This sounds as if it would cause noticeable performance regressions in code that makes heavy use of thread-local variables.

David

Eli Friedman via llvm-dev

unread,
Jun 9, 2016, 4:33:29 AM6/9/16
to Gor Nishanov, llvm...@lists.llvm.org
On Wed, Jun 8, 2016 at 10:57 PM, Gor Nishanov via llvm-dev <llvm...@lists.llvm.org> wrote:
Hi all:

Below is a proposal to add experimental coroutine support to LLVM. Though this
proposal is motivated primarily by the desire to support C++ Coroutines [1],
the llvm representation is language neutral and can be used to support
coroutines in other languages as well.

I looked over the proposal... a few comments:

1. This adds a lot of intrinsics, and they have weird semantics.  This is bad in general; the more weirdness you add to the IR, the more you'll struggle to make the optimizer actually understand what you're doing.  And magic pattern matching *will* break.  You've already pointed out the issue with branches.  (It's actually worse than you think: there isn't any guarantee there will even be a branch when you've thrown the optimizer at it.)  The save intrinsic says "Its return value should be consumed by exactly one `coro.suspend` intrinsic.", but it might have zero uses by the time your lowering pass runs.  You could end up with code getting inserted before llvm.experimental.coro.fork.
2. If you need some sort of unusual control flow construct, make it a proper terminator instruction; don't try to glue your intrinsic to a normal branch instruction.  A new kind of terminator instruction might be appropriate.  Or maybe you can make "invoke" work for you.
3. Maybe you could reduce the amount of magic involved by splitting the initialization into a separate function?  So the initialization is always just something like "f(int x) { frame = init(); f_inner(x, frame); return frame; }", and you don't have to deal with the weirdness of fork().

-Eli

Gor Nishanov via llvm-dev

unread,
Jun 9, 2016, 10:41:59 AM6/9/16
to David Chisnall, llvm...@lists.llvm.org
Hi David:

>> has anyone in WG21 yet come up with a sane proposal about how coroutines interact with thread-local storage?

There is a discussion in SG1 subgroup (Concurrency and Parallelism) on
the meaning of TLS in the lightweight thread of execution. I don't
think there is a resolution yet. Whether to think of a coroutine of
this kind as a thread of execution or just an "unusual" function is
not decided either.

>> Does this mean that, for correctness, we must regard these suspension points as barriers that prevent loads or stores of thread-local variables and prevent caching of loads of addresses of thread-local variables?

You are right! P0057 implies (probably need to make a note about it)
that when a coroutine is executing, you get thread_local data for the
current thread. For your first point (caching TLS loads), I think I am
getting it for free. coro.suspend intrinsic, from llvm standpoint, may
read and write any memory it can get access to, therefore, if a value
was loaded from TLS before coro.suspend, it would have to be reloaded
after a call to coro.suspend.

With respect to caching of addresses of TLS variables.

1) if a user does the caching, we should let him:

thread_local int a;
task f() {
auto x = &a; // captures the address of a in Thread 1
co_await some_async_call();
*x = *x + 1; // updates the value in Thread1, whatever the
current thread is
}

2) If llvm is does it, than, absolutely, we should prevent "prevent
caching of loads of addresses across suspend points". I poked around a
little and was not able to come up with the case where it would, but,
I'll keep looking. If you know of such a case, let me know.
Coroutines are split into normal functions in the middle optimizer, so
if there is any caching done during lowering, coroutines won't be
affected.

>> Do we have to treat every function call as a suspension point, as no function can know whether it is executing in
>> a coroutine or whether a function that it calls may trigger a thread migration? This sounds as if
>> it would cause noticeable performance regressions in code that makes heavy use of thread-local variables.

That should not be necessary. The proposed coroutines are "stackless",
they can only get suspended when control is in the body of the
coroutine at explicitly marked suspend points. Thus thread migration
can only happen at suspend point boundaries.

With stackful coroutines (aka Fibers, Green Threads, User-Mode
threads, etc), it is a big concern, since none of the functions that
runs on a fiber are aware that they run on a fiber and potentially can
get a thread switch at every call. Yep. TLS is pretty much busted
there.

Thank you very much for your comments!!!

Gor Nishanov via llvm-dev

unread,
Jun 9, 2016, 1:16:58 PM6/9/16
to Eli Friedman, llvm...@lists.llvm.org
Hi Eli:

Thank you very much for your comments!

>> If you need some sort of unusual control flow construct, make it a
>> proper terminator instruction

I would love to. I was going by the advice in "docs/ExtendingLLVM.rst":

"WARNING: Adding instructions changes the bitcode format, and it will
take some effort to maintain compatibility with the previous version.
Only add an instruction if it is absolutely necessary.

Having coro.fork and coro.suspend to be proper terminators (as they are) will
be much cleaner. Something like:

corobegin to label %start
suspend label %retblock

corosuspend [final] [save %token]
resume label %resume
cleanup label %cleanup

I did consider "repurposing" 'invoke', but felt it is too much of an abuse.
If there is a support for making corobegin/corosuspend instructions, I would
love to do it. Also, if there is a support for invoke abuse :-), I will be
glad to try it out and come back with the report on how it went.

>> Maybe you could reduce the amount of magic involved by splitting the
>> initialization into a separate function? So the initialization is always
>> just something like "f(int x) { frame = init(); f_inner(x, frame); return
>> frame; }", and you don't have to deal with the weirdness of fork().

I think outlining the "start" part will be sufficient (starting at the coro.fork
and cutting all of the branches ending in coro.end). In this case llvm won't see
coro.fork and coro.end as the clang will do the outlining work. Something like

T f(int x, A y) {
auto mem = coro.elide();
if (!mem) mem = malloc(llvm.frame.size(&f_inner));
auto hdl = coro.init(mem);
T result = <whatever>;

f_inner(hdl, x, &y);

return result;
}

f will be a normal function and f_inner will be magic function that will require
to be split at suspend points. It will be processed in the IPO order, so by the
time f_inner is considered for inlining into f, it will be a normal function.

I think this can work, but we are trading off one complexity for another.

As presented:

coroutine is defined as a magic function (that has weird intrinsics) and
requires splitting during IPO

With coro.fork / coro.end removed

coroutine is defined as a combination of normal function + magic function that
requires splitting during IPO. If there are other languages that acquire
coroutines they would need to have the same outlining pass as clang.

I need to think about it more, but, I am currently leaning toward the first
option (with coro.fork and coro.end). The "f_inner" is still part of the f,
but nicely delineated with corobegin and coro.end.

corobegin to label %coro.start
suspend label %retblock

corosuspend [final] [save %token] resume label %resume
cleanup label %cleanup

call void @llvm.coro.end();

Does it look better?

Gor

Eli Friedman via llvm-dev

unread,
Jun 9, 2016, 2:30:03 PM6/9/16
to Gor Nishanov, llvm...@lists.llvm.org
On Thu, Jun 9, 2016 at 10:16 AM, Gor Nishanov <gorni...@gmail.com> wrote:
Hi Eli:

Thank you very much for your comments!

>> If you need some sort of unusual control flow construct, make it a
>> proper terminator instruction

I would love to. I was going by the advice in "docs/ExtendingLLVM.rst":

       "WARNING: Adding instructions changes the bitcode format, and it will
        take some effort to maintain compatibility with the previous version.
        Only add an instruction if it is absolutely necessary.

The point of this is that we want to strongly encourage people to look for other solutions first... but on the other hand this isn't a blanket ban on new instructions; pushing intrinsics too far consistently caused trouble for exception handling in LLVM for years.

I need to think about it more, but, I am currently leaning toward the first
option (with coro.fork and coro.end). The "f_inner" is still part of the f,
but nicely delineated with corobegin and coro.end.

  corobegin to label %coro.start
         suspend label %retblock

   corosuspend [final] [save %token] resume label %resume
            cleanup label %cleanup

    call void @llvm.coro.end();

Does it look better?

Yes, that seems fine.  There's still the potential for non-initialization instructions sneaking into the initialization block, but you can probably handle it somehow.

That said, thinking about it a bit more, I'm not really sure why you need to tie the suspend call to the branch in the first place.  What exactly is your algorithm doing that requires it?  I mean, a naive implementation of CoroSplit based on your llvm.experimental.coro.suspend intrinsic clones the whole function, replaces the result of suspend calls with "true" in one version and "false" in the other, and runs SimplifyCFG to kill the dead code.  And even with a an algorithm that tries to be more clever, you probably need to do some cloning anyway: presumably frontends will want to share code between the destroy and unwinding codepaths.

-Eli

Gor Nishanov via llvm-dev

unread,
Jun 9, 2016, 4:17:43 PM6/9/16
to Eli Friedman, llvm...@lists.llvm.org
Hi Eli:

>> corobegin to label %coro.start
>> suspend label %retblock
>>
>> corosuspend [final] [save %token]
>> resume label %resume
>> cleanup label %cleanup

> Yes, that seems fine. There's still the potential for non-initialization


> instructions sneaking into the initialization block, but you can probably
> handle it somehow.

I don't mind non-initialization instructions sneaking in at all. I want them to.
All of the code that runs until it hits the suspends can stay in `f`.
Only post suspend code need to go to `f.resume` and `f.destroy`.

For example, consider fire and forget coroutine:

void f(int n) {
while (n-- > 0) {
do1();
<suspend> -- will subscribe to some async continuation
do2();
}
}

Post split, it will look something like this:

struct f.frame { Fn* ResumeFn, Fn* DestroyFn, int n };

void f(int n) {
f.frame* state = <init-stuff>;
state->n = n;
if (state->n-- > 0)
do1();
else
<destroy>
}

void f.resume(f.frame* state) {
do2();
if (state->n-- > 0)
do1();
else
<destroy-state>
}

void f.destroy(f.frame* state) { <destroy-state> }

> That said, thinking about it a bit more, I'm not really sure why you need to
> tie the suspend call to the branch in the first place.

I am not sure I understand the question. Suspend intrinsic is replaced with
a branch to a return block in the 'f' and with 'ret void' in resume.

> What exactly is your algorithm doing that requires it? I mean, a naive
> implementation of CoroSplit based on your llvm.experimental.coro.suspend
> intrinsic clones the whole function, replaces the result of suspend calls
> with "true" in one version and "false" in the other, and runs SimplifyCFG
> to kill the dead code.

Very close. We do clone function body twice, once for resume and once for
destroy. We make a new entry block with a switch instruction that will branch to
all resume branches in `f.resume` and to all cleanup branches in `f.destroy` and
then let SimplifyCFG to remove the rest.

Changing the subject a little bit. I was thinking we can get rid of the
coro.fork altogether it we add a third label to the corosuspend.

Namely:

corosuspend [final] [save %token] to label %return.block


resume label %resume
cleanup label %cleanup

corosuspend is lowered as follows:
in 'f': corosuspend is replaced with `br %return.block`

in 'f.resume':
add a new entry block with a switch jumping to all resume blocks
corosuspend is replaced with `ret void`

in 'f.destroy':
add a new entry block with a switch jumping to all cleanup blocks
corosuspend is replaced with `ret void`

I think this makes understanding of the model clearer. The only negative side
to a corosuspend with three branching targets is that it is very likely that
to label block will be the same in all corosuspend's thus wasting valuable
bits. :-)

Gor

P.S.

Return block in 'f' may contain a lot of stuff, like destructors for parameters
passed by value (if ABI requires), possibly some code related to return value.

`f.resume` does not have any of that, so the return block is just:

return.block:
ret void

P.P.S

I did consider making resume part return something other than void. One scheme
would be:

i1 false - at final suspend point (or destroyed)
i1 true - can be resumed again

If that case:
corosuspend => ret i1 true
corosuspend final => ret i1 false
control flow reaches the end of the `f.resume` ret i1 false.

I wasn't sure that it is a clear win, so I went with 'void' as a return value.

Eli Friedman via llvm-dev

unread,
Jun 9, 2016, 4:49:28 PM6/9/16
to Gor Nishanov, llvm...@lists.llvm.org
On Thu, Jun 9, 2016 at 1:17 PM, Gor Nishanov <gorni...@gmail.com> wrote:
Hi Eli:

> That said, thinking about it a bit more, I'm not really sure why you need to
> tie the suspend call to the branch in the first place.

I am not sure I understand the question. Suspend intrinsic is replaced with
a branch to a return block in the 'f' and with 'ret void' in resume.

Right... but that doesn't mean the call to the suspend intrinsic has to be the last non-terminator instruction in the basic block before you run CoroSplit.  You can split the basic block in CoroSplit so any instructions after the suspend call are part of a different basic block.  Then resume and destroy both branch to the continuation of the basic block (and run different codepaths based on the boolean).

> What exactly is your algorithm doing that requires it?  I mean, a naive
> implementation of CoroSplit based on your llvm.experimental.coro.suspend
> intrinsic clones the whole function, replaces the result of suspend calls
> with "true" in one version and "false" in the other, and runs SimplifyCFG
> to kill the dead code.

Very close. We do clone function body twice, once for resume and once for
destroy. We make a new entry block with a switch instruction that will branch to
all resume branches in `f.resume` and to all cleanup branches in `f.destroy` and
then let SimplifyCFG to remove the rest.

Okay, that helps me visualize it.

Changing the subject a little bit. I was thinking we can get rid of the
coro.fork altogether it we add a third label to the corosuspend.

Namely:

  corosuspend [final] [save %token] to label %return.block
    resume label %resume
    cleanup label %cleanup

corosuspend is lowered as follows:
  in 'f': corosuspend is replaced with `br %return.block`

  in 'f.resume':
    add a new entry block with a switch jumping to all resume blocks
    corosuspend is replaced with `ret void`

  in 'f.destroy':
    add a new entry block with a switch jumping to all cleanup blocks
    corosuspend is replaced with `ret void`

I think this makes understanding of the model clearer. The only negative side
to a corosuspend with three branching targets is that it is very likely that
to label block will be the same in all corosuspend's thus wasting valuable
bits. :-)

Yes, this makes sense.

-Eli

Gor Nishanov via llvm-dev

unread,
Jun 9, 2016, 5:33:50 PM6/9/16
to Eli Friedman, llvm...@lists.llvm.org
On Thu, Jun 9, 2016 at 1:49 PM, Eli Friedman <eli.fr...@gmail.com> wrote:
>> Right... but that doesn't mean the call to the suspend intrinsic has to be
>> the last non-terminator instruction in the basic block before you run
>> CoroSplit. You can split the basic block in CoroSplit so any instructions
>> after the suspend call are part of a different basic block. Then resume and
>> destroy both branch to the continuation of the basic block (and run different
>> codepaths based on the boolean).

I love it!!! it is so much simpler. Indeed, we absolutely don't need to care
about the resume and cleanup branches. Let me restate the model as
I understand it.

Lowering of coro.suspend number X:

1) split block at coro.suspend, new block becomes jump point for resumption X
2) RAUW coro.suspend with i1 true in resume clone i1 false in destroy clone
3) replace coro.suspend with br %return.block in 'f', 'ret void' in clones

Scratch the proposed corosuspend instruction. I think the intrinsic will work
just fine!

Gor

Eli Friedman via llvm-dev

unread,
Jun 9, 2016, 7:51:03 PM6/9/16
to Gor Nishanov, llvm...@lists.llvm.org
On Thu, Jun 9, 2016 at 2:33 PM, Gor Nishanov <gorni...@gmail.com> wrote:
Lowering of coro.suspend number X:

1) split block at coro.suspend, new block becomes jump point for resumption X
2) RAUW coro.suspend with i1 true in resume clone i1 false in destroy clone
3) replace coro.suspend with br %return.block in 'f', 'ret void' in clones

Scratch the proposed corosuspend instruction. I think the intrinsic will work
just fine!

That sounds right.


If you're going down that route, that still leaves the question of the semantics of the fork intrinsic... thinking about it a bit more, I think you're going to run into problems with trying to keep around a return block through optimizations:

[...]
%first.return = call i1 @llvm.experimental.coro.fork()
%cmp = icmp eq i32 %x, 0
br i1 %cmp, label %block1, label %block2

block1:
[...]
br i1 %first.return, label %coro.return, label %coro.start

block2:
[...]
br i1 %first.return, label %coro.return, label %coro.start

coro.return:
%xxx = phi i32 [ %a, %block1 ], [ %b, %block2 ]
call void @f(i32 %xxx)
ret i8* %frame

(Now you can't trivially insert branches to the return block because of the PHI node.)

-Eli

Gor Nishanov via llvm-dev

unread,
Jun 9, 2016, 8:31:49 PM6/9/16
to Eli Friedman, llvm...@lists.llvm.org
Hi Eli:

>> semantics of the fork intrinsic... thinking about it a bit more, I think
>> you're going to run into problems with trying to keep around a return block
>> through optimizations:

How about this? Make all control flow explicit (duh).

declare i8 coro.suspend([...])

returns:
0 - resume
1 - cleanup
anything else - suspend

Now we can get rid of the coro.fork altogether.

[...]
%0 = call i8 @llvm.coro.suspend([...])
switch i8 %0, label %suspend [ i32 0, label %resume,
i32 1, label %cleanup]
suspend:
call void @llvm.coro.end()
br %return.block

We no longer have to do any special handling of return block as coro.end takes
care of it.

Lowering of coro.suspend X

1. Split block at coro.suspend, new block becomes the resume label X
2. RAUW coro.suspend with 0 in f.resume and 1 in f.destroy
3. Replace coro.suspend with 'br %suspend'

coro.end expands as before:

in f => no-op
in f.resume/f.destroy => ret void (+ fancy things in landing pads)

Gor

Gor Nishanov via llvm-dev

unread,
Jun 10, 2016, 9:36:16 AM6/10/16
to Eli Friedman, llvm...@lists.llvm.org
>> If you're going down that route, that still leaves the question of the
>> semantics of the fork intrinsic... thinking about it a bit more, I think
>> you're going to run into problems with trying to keep around a return block
>> through optimizations:

Thinking about it a bit more, it is even simpler, I don't have to involve
coro.end in this case at all and should not worry about branches, RAUW-ing
coro.suspend with an appropriate constant takes care of it all.

Here is an updated version.
1) coro.fork is gone, gone gone!!!!
2) coro.suspend is changed as follows:

declare i8 coro.suspend([...])

returns:
0 - in resume clone
1 - in destroy clone
-1 - in the original function

Usage:

[...]
%0 = call i8 @llvm.coro.suspend([...])

switch i8 %0, label %return [ i32 0, label %resume,
i32 1, label %cleanup]

Lowering of coro.suspend X

In the original function:
* RAUW coro.suspend to -1
* Remove coro.suspend

In both clones:
* RAUW coro.suspend to 0 in `f.resume` and 1 in `f.destroy`
* Split block after coro.suspend, new block becomes the resume label X
* replace coro.suspend with
- `ret void` in resume clone
- @llvm.trap() in destroy clone
(in destroy clone none of the suspends points should be reachable,
if they are it is a front end bug)

Much prettier than before :-)

Gor

Eli Friedman via llvm-dev

unread,
Jun 10, 2016, 7:08:20 PM6/10/16
to Gor Nishanov, llvm...@lists.llvm.org
I'm not sure this quite works the way you want it to in terms of the optimizer.  For example:

block1:
suspend
switch (resume, destroy, return)

resume:
store zero to global @g
[...]

destroy:
store zero to global @g
[...]

return:
store zero to global @g
[...]


Naively, you would expect that it would be legal to hoist the store... but that breaks your coroutine semantics because the global could be mutated between the first return and the resume.

Probably the most natural way to solve this is the two-function approach I suggested earlier... that way, the first call to suspend isn't special, so everything just works.

There might be some other way to solve it, but I think it just becomes more awkward. For example, you could add a boolean that tracks whether the function has been suspended yet, and explicitly reroute the control flow to conditionally perform one-time operations before each call to suspend.  But then you need a giant switch statement or a separate one-time cleanup function to actually route the control flow correctly.

-Eli

Josh Klontz via llvm-dev

unread,
Jun 10, 2016, 8:06:00 PM6/10/16
to Gor Nishanov, llvm...@lists.llvm.org
Sorry for the novice question, but how are coroutines lowered in the backend? It requires making operating system calls to something like pthreads right?

Gor Nishanov via llvm-dev

unread,
Jun 10, 2016, 8:25:30 PM6/10/16
to Eli Friedman, llvm...@lists.llvm.org
Hi Eli:

>> Naively, you would expect that it would be legal to hoist the store...
>> but that breaks your coroutine semantics because the global could be mutated
>> between the first return and the resume.

Hmmm... I don't see the problem. I think hoisting the store is perfectly legal
transformation with all semantics preserved.

Let's look at your example:

>> block1:
>> suspend
>> switch (resume, destroy, return)
>>
>> resume:
>> store zero to global @g

>> doA()


>> [...]
>>
>> destroy:
>> store zero to global @g

>> doB()


>> [...]
>>
>> return:
>> store zero to global @g

>> doC
>> [...]

As written the behavior is:

1) when we encounter a suspend during the first pass through the function,
store 0 to @g and doC()

2) when we encounter a suspend after coroutine was resumed
ret void

3) When coroutine is resumed:
store 0 to @g and doA()

4) When coroutine is destroyed:
store 0 to @g and doB()

If this is a coroutine that can be resumed asynchronously from a different
thread there is a data race. For example, if resume happens before 'f' returns,
doA() can write something useful to @g, and 'f' will clobber it with zero.
But, this race is already present in the code and is not introduced by LLVM.

Let's hoist the store and see what happens:

>> block1:
>> suspend


>> store zero to global @g
>> switch (resume, destroy, return)
>>
>> resume:

>> doA()
>> [...]
>>
>> destroy:
>> doB()
>> [...]
>>
>> return:
>> doC()
>> [...]

Now, let's split it up:
1. RAUW coro.suspend with -1,0,1 in f, f.resume and f.destroy
2. remove coro.suspend in f, replace with ret void in f.resume

void f() {
[...]


store zero to global @g

doC();
[...]
}

void @f.resume() {
entry:


store zero to global @g

doA();
[....]
}

void @f.destroy() {
entry:


store zero to global @g

doB();
[....]
}

Behavior looks exactly as before. What am I missing?

Thank you,
Gor

P.S.

Eli, thank you very much for your input. I struggled with the coro.fork for
months and in just a few days here it seems to be getting so much better!

Gor Nishanov via llvm-dev

unread,
Jun 10, 2016, 8:42:55 PM6/10/16
to Josh Klontz, llvm...@lists.llvm.org
Hi Josh:

> Sorry for the novice question, but how are coroutines lowered in the
> backend? It requires making operating system calls to something like
> pthreads right?

No OS, no target specific code required. Just simple rewriting of a function
during the middle optimizer. One of the first examples in the RFC shows
that a coroutine is optimized down to a constant. Try doing that with
pthreads :-)

>> generator<int> range(int from, int to) {
>> for(int i = from; i < to; ++n)
>> co_yield i;
>> }
>> int main() {
>> int sum = 0;
>> for (auto v: range(1,100))
>> sum += v;
>> return sum;
>> }
>>
>> And translate it down to this:
>>
>> define i32 @main() #5 {
>> entry:
>> ret i32 4950
>> }


Very briefly:

LLVM coroutines are functions that have one or more `suspend points`_.


When a suspend point is reached, the execution of a coroutine is suspended and
control is returned back to its caller. A suspended coroutine can be resumed
to continue execution from the last suspend point or be destroyed.

Let's say you get coroutine that looks something like this

void *f(int n) {
for(;;) {
bar(n++);
<suspend> // magic: returns a coroutine handle on first suspend
}
}

We will split it up into start, resume and destroy functions and will create
a coroutine frame that will keep values that need to be live across suspension
points.

struct f.frame { int n; }

void* f(int n) {
auto s = new f.frame{n};
bar(s->n++);
return s;
}

void f.resume(f.frame* s) {
bar(s->n++);
}

void f.destroy(f.frame* s) {
delete s;
}

You can read more details in the doc/Coroutines.rst up for review at

http://reviews.llvm.org/D21170

or in the RFC that started this thread.

Cheers
Gor

Eli Friedman via llvm-dev

unread,
Jun 10, 2016, 10:40:53 PM6/10/16
to Gor Nishanov, llvm...@lists.llvm.org
Err... I guess nothing; sorry, I got myself confused.  Essentially executing a return statement, then jumping back, seems wrong, but I'm having trouble coming up with a good example of how it would actually break something.  I guess there aren't really any issues on a pure control-flow basis: you can't execute a global side-effect a different number of times than it would otherwise happen.

You might run into problems with allocas: LLVM's optimizer will assume the lifetime of any alloca in the current function ends when you hit a "ret" instruction, so jumping back from the ret to the middle of the function could lead to wrong results.  Silly example:

x = alloca...

block1:
  suspend
  ; increment could get moved here.
  switch (resume, destroy, return)

resume:
  x += 1
   [...]
 
destroy:
  x += 1
   [...]

return:
  (don't touch x)
  [...]

The increment is only supposed to execute once, but instead executes twice.

This isn't a great example... and I'm not sure issues are limited to allocas... but that's the idea, anyway.

-Eli

Gor Nishanov via llvm-dev

unread,
Jun 11, 2016, 1:32:44 AM6/11/16
to Eli Friedman, llvm...@lists.llvm.org
Hi Eli:

>> Err... I guess nothing; sorry, I got myself confused.

That is my fault :-). I confused myself in the beginning where I was trying to
treat coro.suspend + cond.branch pair as a terminator and was looking for
branches. You showed me the way... Just ignore the branches and represent
the control flow cleanly and SimplifyCFG will take care of the rest :-).

>> You might run into problems with allocas:

>> x = alloca...


>>
>> block1:
>> suspend
>> ; increment could get moved here.
>> switch (resume, destroy, return)
>>
>> resume:

>> x += 1 ; load + inc + store
>> [...]
>>
>> destroy:
>> x += 1 ; load + inc + store


>> [...]
>>
>> return:
>> (don't touch x)
>> [...]

Yep. You are right. Need some kind of barrier to prevent the motion.

Okay. New intrinsics:

declare void coro.barrier(i64); // will get a unique number for every insertion

>> x = alloca...
>>
>> block1:
>> suspend

>> ; nothing can sneak in here???, Right?


>> switch (resume, destroy, return)
>>
>> resume:

>> coro.barrier(0)
>> x += 1 ; load + inc + store
>> [...]
>>
>> destroy:
>> coro.barrier(1)
>> x += 1 ; load + inc + store
>> [...]
>>
>> return:
>> coro.barrier(2)


>> (don't touch x)
>> [...]

Now we are good, right?
I am still not giving up hope for an intrinsic approach.

Thanks a lot, Eli, for your help with the design!
Gor

P.S

There is always a backup plan to go to your "two function" idea with a slight
twist.


Almost two functions (coro.fork(), coro.suspendO() same as in original proposal)
================================================================================

CoroEarly pass which runs at EP_EarlyAsPossible will extract f.start using
coro.fork as a guide (since false branch bypasses the coroutine body).

Coroutine will now travel as two functions until CoroSplit gets to f.start
(Probably will use 'coroutine' attribute on a function to indicate that it
requires CoroSplit to munge on it.)

I am still hoping we can find a solution without requiring to separate f.start
out. Seems like a huge hammer to avoid undesirable code motion / hoisting.

(I really hope that @coro.barrier(i64) approach can work)

Eli Friedman via llvm-dev

unread,
Jun 11, 2016, 5:27:23 PM6/11/16
to Gor Nishanov, llvm...@lists.llvm.org
On Fri, Jun 10, 2016 at 10:32 PM, Gor Nishanov <gorni...@gmail.com> wrote:
Okay. New intrinsics:

declare void coro.barrier(i64); // will get a unique number for every insertion

>>   x = alloca...
>>
>> block1:
>>   suspend
>>   ; nothing can sneak in here???, Right?
>>   switch (resume, destroy, return)
>>
>> resume:
>>   coro.barrier(0)
>>   x += 1 ; load + inc + store
>>   [...]
>>
>> destroy:
>>   coro.barrier(1)
>>   x += 1 ; load + inc + store
>>   [...]
>>
>> return:
>>   coro.barrier(2)
>>   (don't touch x)
>>   [...]

Now we are good, right?
I am still not giving up hope for an intrinsic approach.

coro.barrier() doesn't work: if the address of the alloca doesn't escape, alias analysis will assume the barrier can't read or write the value of the alloca, so the barrier doesn't actually block code movement.

There is always a backup plan to go to your "two function" idea with a slight
twist.


Almost two functions (coro.fork(), coro.suspendO() same as in original proposal)
================================================================================

CoroEarly pass which runs at EP_EarlyAsPossible will extract f.start using
coro.fork as a guide (since false branch bypasses the coroutine body).

Coroutine will now travel as two functions until CoroSplit gets to f.start
(Probably will use 'coroutine' attribute on a function to indicate that it
requires CoroSplit to munge on it.)

I am still hoping we can find a solution without requiring to separate f.start
out. Seems like a huge hammer to avoid undesirable code motion / hoisting.

(I really hope that @coro.barrier(i64) approach can work)

 
If you really don't want two functions, you could probably do something like this:

block1:
%first_time = load...
br i1 %first_time, label return, label suspend1

suspend1:
suspend

switch (resume1, destroy1)

block2:
%first_time = load...
br i1 %first_time, label return, label suspend2

suspend2:
suspend

switch (resume2, destroy2)

label return:
%cont = phi i32 [(block1, 1), (block2, 2)]
; do stuff
switch %cont... (1, label suspend1), (2, label suspend2)

It's a little awkward to generate the return block, but otherwise it's reasonably clean. Also, if some non-C++ language wants to generate coroutines, it might not have to generate the return block at all.

-Eli

Gor Nishanov via llvm-dev

unread,
Jun 11, 2016, 8:43:50 PM6/11/16
to llvm...@lists.llvm.org, Eli Friedman
(Dropped llvm-dev by accident. Putting it back)


HI Eli:

>> coro.barrier() doesn't work: if the address of the alloca doesn't escape,
>> alias analysis will assume the barrier can't read or write the value of
>> the alloca, so the barrier doesn't actually block code movement.

Got it. I am new to this and learning a lot over the course
of this thread. Thank you for being patient with me.

Two questions and one clarification:

Q1: Do we have to have a load here?
===================================

>> block1:
>> %first_time = load... <--- What are we loading here?


>> br i1 %first_time, label return, label suspend1
>>

>> supend1:
>> %0 = coro.suspend()
>> switch %0 (resume1, destroy1)

Can we use three way coro.suspend instead?

Block1:
%0 = call i8 coro.suspend()
switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1
Suspend1:
switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1

One problem I can see is that someone can write a pass that might merge
two branches / switches into one switch and we are back where we were.
I guess what you meant by load, is to call some coro.is.first.time() intrinsic.
So it looks like:

>> block1:
>> %first_time = call i1 coro.is.first.time()


>> br i1 %first_time, label return, label suspend1
>>

>> supend1:
>> %0 = coro.suspend()
>> switch %0 (resume1, destroy1)

This looks fine, there may be more uses for this intrinsic in the frontend.
Killing two birds with one stone. Good.

Question 2: Why the switch in the return block?
===============================================

I would think that **pre-split** return block would be simply:

return:
<run dtors for parameters, if required>
<conversion ops for ret value, if required>
<ret void> or <ret whatever>

Where and why I should put a switch that you mentioned in this return block?

BTW, I am speaking of the return block as if it is one block,
but, it could be a dominating block over all the blocks that together
run the destructors, do return value conversion, etc.

Clarification:
==============

>> Also, if some non-C++ language wants to generate coroutines,
>> it might not have to generate the return block at all.

C++ coroutines are flexible. The semantic of a coroutine is defined via
traits, so you may define a coroutine that returns void. It does not have
to return coroutine handle or some struct that wraps the coroutine handle.

For example, a developer may have a queue of suspended coroutines that is
processed by a scheduler, so a coroutine can queue its handle on suspension
either into a global queue or thread_local queue. If there are no destructors
to run for parameters, the return block would be simply:

return:
ret void


Cheers,
Gor

Eli Friedman via llvm-dev

unread,
Jun 11, 2016, 9:08:52 PM6/11/16
to Gor Nishanov, llvm-dev
On Sat, Jun 11, 2016 at 5:09 PM, Gor Nishanov <gorni...@gmail.com> wrote:
HI Eli:


>> coro.barrier() doesn't work: if the address of the alloca doesn't escape,
>> alias analysis will assume the barrier can't read or write the value of
>> the alloca, so the barrier doesn't actually block code movement.

Got it. I am new to this and learning a lot over the course
of this thread. Thank you for being patient with me.

Two questions and one clarification:

Q1: Do we have to have a load here?
===================================

>> block1:
>>    %first_time = load... <--- What are we loading here?

Just an local alloca, initialized to false, and changed to true in the return block.
 
>>    br i1 %first_time, label return, label suspend1
>>
>> supend1:
>>    %0 = coro.suspend()
>>    switch %0 (resume1, destroy1)

Can we use three way coro.suspend instead?

  Block1:
    %0 = call i8 coro.suspend()
    switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1
  Suspend1:
    switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1

This doesn't look right: intuitively the suspend happens after the return block runs.
 
One problem I can see is that someone can write a pass that might merge
two branches / switches into one switch and we are back where we were.
I guess what you meant by load, is to call some coro.is.first.time() intrinsic.
So it looks like:

>> block1:
>>    %first_time = call i1 coro.is.first.time()
>>    br i1 %first_time, label return, label suspend1
>>
>> supend1:
>>    %0 = coro.suspend()
>>    switch %0 (resume1, destroy1)

This looks fine, there may be more uses for this intrinsic in the frontend.
Killing two birds with one stone. Good.

It doesn't really matter whether the bit gets tracked in an alloca or through intrinsics.
 

Question 2: Why the switch in the return block?
===============================================

I would think that **pre-split** return block would be simply:

return:
   <run dtors for parameters, if required>
   <conversion ops for ret value, if required>
   <ret void> or <ret whatever>

Where and why I should put a switch that you mentioned in this return block?

BTW, I am speaking of the return block as if it is one block,
but, it could be a dominating block over all the blocks that together
run the destructors, do return value conversion, etc.

The best way to be sure the compiler will understand the control flow is if the coroutine acts like a normal function.  Another way to put it is that it should be possible to lower a coroutine to a thread rather than performing the state machine transformation. 

The switch answers the question of where the control flow actually goes after the return block runs.  Under normal function semantics, the "return" block doesn't actually return: it just performs the one-time operations, then jumps back to the suspend call.  Therefore, you can't use "ret" there; you have to connect the control flow back to the correct suspend call.  The switch makes that connection.  So the return block looks like this:


   <run dtors for parameters, if required>
   <conversion ops for ret value, if required>
   call coro.first_time_ret_value(value) ; CoroSplit replaces this with a ret
   switch ... ; jump to suspend; this is always dead in the lowered version

The dead switch is there so the optimizer will understand the control flow.

And yes, this would be much more straightforward with a two-function approach.

Clarification:
==============


>> Also, if some non-C++ language wants to generate coroutines,
>> it might not have to generate the return block at all.

C++ coroutines are flexible. The semantic of a coroutine is defined via
traits, so you may define a coroutine that returns void. It does not have
to return coroutine handle or some struct that wraps the coroutine handle.

Oh, okay.  I haven't actually looked at the spec; I'm mostly just going off your description of what it does.

-Eli

Sanjoy Das via llvm-dev

unread,
Jun 11, 2016, 9:09:49 PM6/11/16
to Gor Nishanov, llvm...@lists.llvm.org
Hi Gor,

How will you handle (potentially variably sized) alloca instructions
that cannot be elided by promotion to SSA? E.g.

void cor() {
int a = 0;
escape(&a);
for (;;) yield(a++);
}

-- Sanjoy

--
Sanjoy Das
http://playingwithpointers.com

Gor Nishanov via llvm-dev

unread,
Jun 11, 2016, 10:07:01 PM6/11/16
to Sanjoy Das, llvm...@lists.llvm.org
Hi Sanjoy:

>> How will you handle (potentially variably sized) alloca instructions
>> that cannot be elided by promotion to SSA? E.g.
>>
>> void cor() {
>> int a = 0;
>> escape(&a);
>> for (;;) yield(a++);
>> }

Any escaped allocas go to the coroutine frame. It is an important
use case with async coroutines:

task<void> f() {
char buf[N];
... fill buff with stuff
await async_send(buf); // <suspend-point>
... do more
}

Even though an alloca is not mentioned after suspend point, it is still
need to go to the coroutine frame, as async_send will be referencing it
while coroutine is suspended (and its stack frame gone).

CoroSplit pass is doing two interesting things.
1) Creates the coroutine frame.
2) Chops up the coroutine into pieces

Here is a brief description:

========================
Coroutine Frame Building
========================

Coroutine frame building algorithm has two parts:
* Deal with allocas
* Deal with virtual registers.

Alloca part
-----------

Step 1: Compute the blocks that will end up in resume and destroy
Step 2: Classify alloca into three buckets
Start Only: Only used in start
Resume Only: Only used in resume/destroy
Shared: Used in both (or escaped)
Step 3: Put all shared values in the coroutine frame, RAUW shared allocas with
GEPs from coroutine frame
Step 4: Move all resume only allocas to the new entry block in resume/destroy

(Future Work)

Step X-1:
insert dbg.values properly
Step X:
Utilize lifetime information for shared allocas to decide how to pack them
tightly in coroutine frame. At the moment I throw away lifetime intrinsics.
Step X+1:
Deal with variable allocas. One way to deal with them is to chain them
to the coroutine frame using the same allocation and deallocation routines
used for coroutine frame itself). Until step X+1, no var.size alloca can
go into a coroutine frame. Start only or Resume only allocas don't
cause problems.

Virtual Register part
---------------------

Step 1: Precompute block relationship X(B1,B2), when B1 dominates B2 and there
is a path from B1 to B2 that crosses a suspend point.
Step 2:
Check if a path from a use of value in a resume/destroy block to a
definition crosses a suspend point, ie X(Bd,Bu) == true, where
Bu is a block with the use
Bd is a block with the definition or an entry block if d is func argument
Step 3:
Check if we already have value 'd' in the coroutine frame, if so, replace
the use with GEP from the coroutine frame. Otherwise, spill 'd' into the
coroutine frame and replace the use with GEP.
(future)
Step X:
Make step 3 spilling and reloading smarter.
Step X + 1:
Do live-ranges, coloring, etc to make coroutine frame
really really tiny.


Cheers,
Gor

Gor Nishanov via llvm-dev

unread,
Jun 12, 2016, 12:11:22 AM6/12/16
to Eli Friedman, llvm-dev
Hi Eli:

>> Block1:
>> %0 = call i8 coro.suspend()
>> switch i8 %0, label suspend1 [i8 0 %return] ; or icmp + br i1
>> Suspend1:
>> switch i8 %0, label %resume1 [i8 1 %destroy1] ; or icmp + br i1
>>
>> This doesn't look right: intuitively the suspend happens after the return
>> block runs.

Perhaps, but, that is not the intended semantics.

The following two things are distinct concepts:

* Coroutine is considered suspended

* Coroutine returns control back to its caller
(initial caller, resumer, or destroyer)

Coroutine is considered suspended means that it is legal to use coro.done,
coro.resume, coro.destroy intrinsics (either in this thread or in a different
one). It may seem strange, but, consider this example:

task<void> f() {
char buf[N];
... fill buff with stuff
await async_send(buf); // <suspend-point>
... do more
}

IR for await expression will be (conceptually):

%0 = coro.save()
async_send(%buf)
coro.suspend(%0)

coro.save is the point when coroutine is considered suspended.
This allows async_send to resume the coroutine even before async_send
returned control back to f.

To make it safe, absolutely nothing should sneak-in between coro.save
and coro.suspend beyond what frontend/user put there.

Aside:
-----
Based on what we discussed earlier, I suspect that optimizer may try
to sneak stuff in between coro.save and coro.suspend. If it is the case,
I have a solution :-), Richard Smith suggested it earlier to me.
Use a distinct function (Just like you said with coro.fork :-)).

There will be no distinct coro.save. Instead, coro.suspend will take an
optional function pointer parameter and frontend will generate a little
function for all the code that needs to go in between coro.save and
coro.suspend.

f.await(%frame) {
%buf = get from %frame
async_send(%buf)
}

in f()

coro.suspend([...], &f.await)

CoroEarly will do the extraction. Frontend will keep emitting
coro.save/coro.suspend pairs. After CoroEarly, coro.save's will be removed.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
P A U S E
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Okay getting back to the switch in the return block.

>> The switch answers the question of where the control flow actually goes
>> after the return block runs.

Since control can reenter the coroutine at any point after coro.save
and before, I don't think there is any point where putting the switch can
help.

We already told the optimizer that we have three possible paths out of suspend:
return (first time), resume and destroy. It think it is as close to the
truth as we can get.

>> Another way to put it is that it should be possible to lower a coroutine
>> to a thread rather than performing the state machine transformation.

That is very good point! It is a great way to think about the semantics.

Unfortunately, I cannot avoid doing state machine transformation in general
case.

For the coroutines with "initial_suspend" point, i.e. it suspends before
entering user authored body, it is very easy.

future<void> f(Params) { // <initial suspend point> at open curly


char buf[N];
... fill buff with stuff
await async_send(buf); // <suspend-point>
... do more
}

Will be transformed into equivalent of:

future<void> f(Params) {
promise<void> prom;
future<void> result = prom.get_future();

std::thread([prom = std::move(prom), Params = std::move(params)] {


char buf[N];
... fill buff with stuff

BLOCK_WAIT async_send(buf); // block on event instead of suspend
... do more
}).detach();

return result;
}

No state machine transformation whatsoever, just capture the parameters and
promise and good to go. [No need to involve llvm either :-) it is a simple
lambda like transformation that can be done easily in frontend]

What if we don't have an initial suspend point? That means that the
coroutine executes in the caller thread until it hits the suspend point.
Well, we need to build up a coroutine frame, we need to chop up the function
at coro.suspends. We need to put an index in a coro frame and a switch in an
entry block of resume and suspend (if more than one suspend point).

The only difference is what coro.suspend is replaced with:

In start: coro.suspend launches a thread and gives f.resume as a start
routine and a frame pointer as a context. So, f.resume will be executing
concurrently with f slowly finishing up and returning back to the caller.
(Btw, that is exactly the semantics we are trying to capture)

In resume, coro.suspend is replaced with a blocking call waiting on
whatever will signal that the thread can resume.

=====================
Two function approach
=====================

Even though I pushing back against it, I might prototype it and see how it
looks anyways.

I will outline in CoroEarly and during CoroSplit inline it back so that
I can build the coroutine frame properly.

But even as I am typing it, it just feels too much.
Look what we are trying to solve is this.

%0 = coro.suspend()
; A thing should not be hoisted here
%first = icmp %0, 0
; or here
br %first, %return, %suspend ; <=== please don't move stuff above me

suspend:
%1 = icmp %0, 1
br %1, label %resume, label %destroy

resume:
A Thing That Can Be Hoisted (same as in destroy)
[...]

destroy:
A Thing That Can Be Hoisted (same as in resume)
[...]

return:
Something Else

Or maybe

%0 = coro.suspend()
; A thing should not be hoisted here
%first = coro.is.first(%0)
; or here
br %first, %return, %suspend ; <=== please don't move stuff above me

suspend:
br %0, label %resume, label %destroy

resume:
A Thing That Can Be Hoisted (same as in destroy)
[...]

destroy:
A Thing That Can Be Hoisted (same as in resume)
[...]

return:
Something Else

Would the optimizer really try to stick 'A thing' above br %first?

As I mentioned before, I am very new to this and probably missing a lot,
but, (apart from outlining a portion of f), is there no simple way of
preventing instructions with side effects from resume or destroy blocks
being moved above 'br %first branch'?

(and if there is no good answer, two function it is. I will stop
torturing you with this :-))

Gor

Gor Nishanov via llvm-dev

unread,
Jun 12, 2016, 12:16:08 AM6/12/16
to Eli Friedman, llvm-dev
Typo:

"and before" should not be in this sentence:

Since control can reenter the coroutine at any point after coro.save

>>> and before <<<<, I don't think there is any point where putting the switch can
help

This sentence should read:

Since control can reenter the coroutine at any point after coro.save,


I don't think
there is any point where putting the switch can help

Gor Nishanov via llvm-dev

unread,
Jun 12, 2016, 12:31:43 PM6/12/16
to Eli Friedman, llvm-dev
I think I got it. Original model (with coro.fork and two-way coro.suspend)
will work with a tiny tweak.

In the original model, I was replacing coro.suspend with br %return in original
function. The problem was coming from potential phi-nodes introduces into
return block during optimizations.

Let's make sure that there is only entry into the return block.
In the original model, ReturnBB had two entries, one from coro.fork,
another from fall-through from the delete block.

T coro() {
%mem = malloc(...);
if (!coro.fork()) {
Body Of The Coroutine
With Suspends and All The Fun
DeleteBB:
free(%mem)
coro.end()
}
ReturnBB:
return some_code_creating_T();
}

In the original model,
coro.end is replaced with `no-op` in start, 'ret void' in clones (*)
suspend is replaced with `br %return` in start, 'ret void' in clones

Let's tweak it. Add an i1 %fallthru parameter to coro.end. Now the behavior is:
in clones coro.end behaves as before: replaced with 'ret void' (*)
in original:
coro.end(false) => no-op (as before). (*)
coro.end(true) => 'br %return'

Now the body of the coroutine will look like this:

T coro() {
%mem = malloc(...);
if (coro.fork()) {
ReturnBB:
return some_code_creating_T();
}
Body Of The Coroutine
With Suspends and All The Fun
DeleteBB:
free(%mem)
coro.end(true)
UNREACHABLE
}

Now I don't think any code that has side effects can sneak up from the
"Body of The Coroutine" into ReturnBB. So we are good, right?

Gor

-----------------
(*) There are two uses for coro.end. One is for the fallthrough point in
the delete block. Another, in unwind code responsible for cleanup in
the original function if there is an exception **BEFORE** coroutine had a
chance to suspend.

coro.end in unwind blocks is replaced with appropriate 'unwind to caller'
instruction, cutting the rest out.

Sanjoy Das via llvm-dev

unread,
Jun 12, 2016, 2:39:14 PM6/12/16
to Gor Nishanov, llvm-dev
On Sun, Jun 12, 2016 at 9:31 AM, Gor Nishanov via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Let's tweak it. Add an i1 %fallthru parameter to coro.end. Now the behavior is:
> in clones coro.end behaves as before: replaced with 'ret void' (*)
> in original:
> coro.end(false) => no-op (as before). (*)
> coro.end(true) => 'br %return'
>
> Now the body of the coroutine will look like this:
>
> T coro() {
> %mem = malloc(...);
> if (coro.fork()) {
> ReturnBB:
> return some_code_creating_T();
> }
> Body Of The Coroutine
> With Suspends and All The Fun
> DeleteBB:
> free(%mem)
> coro.end(true)
> UNREACHABLE
> }

I may have missed something obvious here, but in this case doesn't the
transformed function have a path (DeleteBB -> ReturnBB) that is not
present in the original program? This is problematic since it will
allow the optimizer to do things to the program that will be invalid
once you transform the program. One example is that SSA construction
will be wrong:

T coro() {
%t = alloca
%mem = malloc(...);
(*%t) = 10
if (coro.fork()) {
ReturnBB:
%val = *%t
return some_code_creating_T(%val);


}
Body Of The Coroutine
With Suspends and All The Fun
DeleteBB:
free(%mem)

(*%t) = 20
coro.end(true)
UNREACHABLE
}


Now in the above CFG %val can be folded to 10, but in reality you need
a PHI of 10 and 20.

Generally (if I understand your idea), you have a program above whose
ultimate execution cannot be "explained" by these intrinsics executing
like normal function calls; i.e. there is a "magic" edge from DeleteBB
to ReturnBB that isn't present in the CFG.

The thought experiment relevant here is that **could** you implement
your semantics with reasonable bodies for the llvm.coro intrinsics?
It does not ultimately matter whether you do that or you do some fancy
program transformation, but assuming the program **could** have had
the semantics you wanted from the very beginning will (by principle)
prevent the optimizer from breaking you.

-- Sanjoy

Gor Nishanov via llvm-dev

unread,
Jun 12, 2016, 9:40:25 PM6/12/16
to Sanjoy Das, llvm-dev
Hi Sanjoy:

>> Now in the above CFG %val can be folded to 10, but in reality you need
>> a PHI of 10 and 20

Quick answer: folding %val to 10 is OK, but, I would prefer it to be 'undef'
or even catch it in the verifier as it is a programmer/frontend error.

Details are in the second half of my reply. I would like to start
with the thought experiment you suggested, as it might help to explain
the behavior a little better.

>> The thought experiment relevant here is that **could** you implement
>> your semantics with reasonable bodies for the llvm.coro intrinsics?

I can emulate **control flow** very closely with user mode thread switching
APIs like make_context, jump_context. Memory behavior not so much.

Library emulation
=================

Let's build a fiber abstraction out of make_context, jump_context.

i1 fiber_fork(i8* %mem) - clones this function into a fiber
returns 0 - in a fiber
returns 1 - in a thread
uses %mem to store fiber context + fiber stack

i1 fiber_suspend() - suspends current fiber, switches back to thread
stores the fiber context to be able to resume later

void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified by %mem,
fiber_suspend will return %val when resumed

void fiber_end() noreturn
- switches back to a thread, does not remember fiber
context and does not touch %mem.
(Thus you can use this after %mem is freed)


Detail: fiber_fork(i8* %mem)
----------------------------
* Prepares fcontext in %mem utilizing the rest of the memory for stack.
* Saves current thread context on a current thread stack.
(We should not store if in %mem, otherwise we won't be
able to switch back after fiber memory is freed)
* Sets up **current thread** context, so that when resumed it will proceed
as if fiber_fork returned true
* Sets up **fiber** context so that when resumed, the fiber will proceed
as if fiber_fork returned false.
* Clones the locals on the thread stack to a fiber stack
(This is very approximate behavior, and won't work if we took address of
alloca and stashed it somewhere, but let's ignore it for now as we
are focusing on control flow)
* Switches to fiber

Okay, let's map it to our intrinsics:

coro.fork == fiber_fork
coro.suspend == fiber_suspend
coro.end(true) == fiber_end
coro.resume = fiber_join(%mem, i1 false)
coro.destroy = fiber_join(%mem, i1 true)
coro.size == 1,000,000 (or some other big number)

Now let's walk through this example in the fiber model:

T coro() {
%t = alloca
%mem = malloc(...);
(*%t) = 10

if (coro.fork()) { // switches to fiber at label Start
ReturnBB:
%val = *%t // will be 10, as fiber has its own copy
return some_code_creating_T(%val);
}
Start:


Body Of The Coroutine
With Suspends and All The Fun
DeleteBB:

(*%t) = 15 ; // fine, copy of %t on a fiber stack updated
free(%mem)
(*%t) = 20 ; // undef: as the fiber memory is freed already
coro.end(true) // switches back to the thread at ReturnBB
UNREACHABLE // <== never reaches here. So we are honest with optimizer
}


Back to the LLVM based implementation
=====================================

In your example, there will be a path from the block where %t is used to
where %t is defined that crosses a suspend point. Therefore %t will be part of
the coroutine frame. So before the split, the program will be transformed into:

coro.frame = type { int %t }

T coro() {
%t = alloca
%mem = malloc(...);

%s = (coro.frame*)coro.init(%mem)
s->%t = 10
if (coro.fork()) {
ReturnBB:
%val = 10 // const.prop as 10, OK, but really undef


return some_code_creating_T(%val);
}
Body Of The Coroutine
With Suspends and All The Fun
DeleteBB:
free(%mem)

s->%t = 20 //// BOOM writing to freed memory
coro.end(true)
UNREACHABLE
}

Accessing anything in the coroutine frame after coro.fork "true" is undefined.
As by the time you get to returnBB, the coroutine frame could already
be destroyed.

Both problems can be statically detected in the Verifier.

Alternatively, in CoroEarly pass that runs at EP_EarlyAsPossible extension
point, we can replace:

%val = *%t
with
%val = undef (used after coro.fork() returned true)

and

free(%mem)
(*%t) = 20
with
@llvm.trap() ; used after frame is destroyed


--Gor

Gor Nishanov via llvm-dev

unread,
Jun 12, 2016, 11:29:51 PM6/12/16
to Sanjoy Das, llvm-dev
Small clarification. In coro.* emulation via fibers, the comment on
coro.end(true)

coro.end(true) // switches back to the thread at ReturnBB <-- this one


UNREACHABLE // <== never reaches here. So we are honest with optimizer
}

should read something like:

coro.end(true) // switches back to the thread that resumed this fiber.

Which will be ReturnBB if we did not hit any suspends on the way to coro.end.
If we did hit a suspend, it is the suspend that will switch back to ReturnBB.

With llvm implementation, it is modeled by replacing coro.suspend and
coro.end(true) with
br ReturnBB in the initial function which represents the first
execution of the coroutine before it reached any suspends.

In resume function, coro.suspend and coro.end are replaced with `ret
void`, thus modelling returning to whomever resumed us.

To me it looks that it does match fiber behavior exactly.

-- Gor

Sanjoy Das via llvm-dev

unread,
Jun 14, 2016, 4:25:31 PM6/14/16
to Gor Nishanov, llvm-dev
Hi Gor,

On Sun, Jun 12, 2016 at 6:40 PM, Gor Nishanov <gorni...@gmail.com> wrote:
> Quick answer: folding %val to 10 is OK, but, I would prefer it to be 'undef'
> or even catch it in the verifier as it is a programmer/frontend error.

Ok.

>>> The thought experiment relevant here is that **could** you implement
>>> your semantics with reasonable bodies for the llvm.coro intrinsics?
>
> I can emulate **control flow** very closely with user mode thread switching
> APIs like make_context, jump_context. Memory behavior not so much.
>
> Library emulation
> =================
>
> Let's build a fiber abstraction out of make_context, jump_context.
>
> i1 fiber_fork(i8* %mem) - clones this function into a fiber
> returns 0 - in a fiber
> returns 1 - in a thread
> uses %mem to store fiber context + fiber stack

I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
setjmp, in that it can "return twice"? If so, I'd urge you to try to
not rely on that kind of behavior. LLVM has unresolved issues around
modeling setjmp like behavior, and if possible, it is preferable to
make all of the control flow explicit from the very beginning.

> i1 fiber_suspend() - suspends current fiber, switches back to thread
> stores the fiber context to be able to resume later

This looks like a normal function call, except that the "return" may
be arbitrarily spaced out from the call?

> void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified by %mem,
> fiber_suspend will return %val when resumed
>
> void fiber_end() noreturn
> - switches back to a thread, does not remember fiber
> context and does not touch %mem.
> (Thus you can use this after %mem is freed)
>
>
> Detail: fiber_fork(i8* %mem)
> ----------------------------
> * Prepares fcontext in %mem utilizing the rest of the memory for stack.
> * Saves current thread context on a current thread stack.
> (We should not store if in %mem, otherwise we won't be
> able to switch back after fiber memory is freed)

As I said above, these following two points can be problematic:

> * Sets up **current thread** context, so that when resumed it will proceed
> as if fiber_fork returned true
> * Sets up **fiber** context so that when resumed, the fiber will proceed
> as if fiber_fork returned false.

>


> Okay, let's map it to our intrinsics:
>
> coro.fork == fiber_fork
> coro.suspend == fiber_suspend
> coro.end(true) == fiber_end
> coro.resume = fiber_join(%mem, i1 false)
> coro.destroy = fiber_join(%mem, i1 true)
> coro.size == 1,000,000 (or some other big number)
>
> Now let's walk through this example in the fiber model:
>
> T coro() {
> %t = alloca
> %mem = malloc(...);
> (*%t) = 10
> if (coro.fork()) { // switches to fiber at label Start
> ReturnBB:
> %val = *%t // will be 10, as fiber has its own copy
> return some_code_creating_T(%val);
> }
> Start:
> Body Of The Coroutine
> With Suspends and All The Fun
> DeleteBB:
> (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
> free(%mem)
> (*%t) = 20 ; // undef: as the fiber memory is freed already

Say there was no (*%t) = 20. There is nothing so far that prevents
sinking "(*%t) = 15" from before the free (where it is legal) to after
the free (where it is not). %t is an unescaped alloca, and LLVM can
move around loads from and stores to these (as a first approximation)
as it pleases.

So when you say "some_code_creating_T()" are there restrictions on
what kind of code there could be as part of the creation logic (I've
so far assumed that it can be arbitrary code). If it can't
(semantically) touch anything in the stack frame then why have it as
part of @coro() ?

-- Sanjoy

Gor Nishanov via llvm-dev

unread,
Jun 14, 2016, 9:22:32 PM6/14/16
to Sanjoy Das, llvm-dev
Hi Sanjoy,

>> I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
>> setjmp, in that it can "return twice"?

Yes, user-mode stack switching API are somewhat similar to setjmp. Here are
links to a doc page and implementation, just in case you are curious:

http://www.boost.org/doc/libs/1_59_0/libs/context/doc/html/context/context.html
http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_ms_pe_masm.asm
http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_sysv_elf_gas.S

>> If so, I'd urge you to try to not rely on that kind of behavior. LLVM has
>> unresolved issues around modeling setjmp like behavior

Absolutely! This was just a thought experiment how one could implement the
intrinsics in a library. One of the benefits of compiler based coroutines is
that they allow to avoid all of that mess. Suspend is just a 'ret void',
resume is simply a direct (or indirect) call and a coroutine state is tiny
(assuming you don't use 64K arrays as local variables :-) )

>> DeleteBB:
>> (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
>> free(%mem)
>> (*%t) = 20 ; // undef: as the fiber memory is freed already

>> coro.end(true)
>> UNREACHABLE


>>
>> Say there was no (*%t) = 20. There is nothing so far that prevents
>> sinking "(*%t) = 15" from before the free (where it is legal) to after
>> the free (where it is not). %t is an unescaped alloca, and LLVM can
>> move around loads from and stores to these (as a first approximation)
>> as it pleases.

Good point. During CoroSplit I should sink CoroutineFrame deallocation
all the way down to coro.end().

>> So when you say "some_code_creating_T()" are there restrictions on
>> what kind of code there could be as part of the creation logic (I've
>> so far assumed that it can be arbitrary code).

You are right. It *is* arbitrary code. In my reply to you earlier I had a
particular use case in mind, where you should not touch anything in the
coroutine frame in the return block, but that is higher level semantics that
frontend or library imbues coroutine with, not something that LLVM enforces.

The case I had in mind was that a frontend generated an asynchronous coroutine
without RAII semantics. Once launched it will run driven by asynchronous
completions and when it runs to the end it will destroy itself.

Such a coroutine can get suspended, get resumed by a different thread, run to
completion and free the coroutine frame **before** the thread that created
the coroutine manages to get to its return block. In such a coroutine reading
or writing to coroutine frame in return block is undefined behavior.

On the other hand, if it is a synchronous generator that always starts
suspended, then any writes or reads from a coroutine frame in the
return block are perfectly fine.

Since LLVM does not know what kind of coroutine user is trying to build, it
should take a conservative approach. LLVM should not introduce writes or reads
to a coroutine frame in the return block if they weren't there, but, if
frontend put them there, LLVM should not replace them with
`@llvm.traps` and `undef`s either.

Gor

Gor Nishanov via llvm-dev

unread,
Jun 15, 2016, 10:10:43 AM6/15/16
to Sanjoy Das, llvm-dev
Hi Sanjoy, Eli:

I was thinking a little bit more about your comments related to optimizer
moving code up and down and I think I can generalize the problem.

Background
==========

A coroutine is transformed into a state machine just before it becomes eligible
to be inlined into its caller, so that the body of the coroutine is simplified
as much as possible before we decide what needs to go into a coroutine frame
and before we separate the coroutine into start, resume and destroy parts in
the CoroSplit pass.

SCC_Inliner - FuncPass - FuncPass - ... - SCC_CoroSplit

The Problem
===========

After we form the coroutine frame we replace 'uses' with loads from
the coroutine
frame and put spills at the 'defs' for those values that must be on the
coroutine frame. Before we do this, the optimizer has no idea
that 'defs' and 'uses' have any relationship with the code that allocates and
frees the coroutine frame and this unexpressed dependency is the root
of the problem.

IR emitted from the frontend can be thought of having several parts separated
by coro.init, coro.fork and coro.delete intrinsics.


Allocate Frame Code
------ coro.init --------------------- [1]
Init Code
------ coro.fork --------------------- [2]
| \
| \
| Body Code
| ---- coro.delete ------- [3]
| Deallocate Frame Code
| ------ coro.end --------
|
|
Return Block Code

Before CoroSplit runs, we need to make sure that:

1. No instruction from below coro.init can move above it as coroutine
frame is not formed yet.

2. No instructions from "Init Code" can move into the return Block code, since
Init Code may access the coroutine frame, but, it may be already deleted
when we get to the return block

3. No instructions from above coro.delete can move below coro.delete, since
after coro.delete coroutine frame is deallocated.


The Solution
============

Eli suggested it earlier and now I am dialing it up to 11.
Code as emitted by the frontend is in the correct order. We will outline parts
of the coroutine in CoroEarly pass that runs at EP_EarlyAsPossible to prevent
any code motion across the barriers described earlier.

Probably the parts need to be marked as noinline, so that the inliner
won't put them back together until CoroSplit will do the transformation.
In CoroSplit, I may inline them back as appropriate.

I will think more and experiment with this approach.
Any thoughts and suggestions are appreciated.

Gor

Gor Nishanov via llvm-dev

unread,
Jun 15, 2016, 11:44:46 PM6/15/16
to Sanjoy Das, llvm-dev
Hi Eli, Sanjoy:

Quick update. I took your advice to make control flow explicit and
extract parts of the coroutine into subfunctions for protection against
undesirable code motion.

It is working out quite nicely and looks prettier too.

* all control flow is explicit (or very close to it :-))
* coro.start now marks the point where all initialization is done and
coroutine can be suspended;
* coro.suspend is now three way showing all interesting control flow;
* things like coro.fork or unreachable after coro.end are no more!

Out of the frontend, IR is structured like this:

Entry:
Allocation Code
----------- coro.init -----------
Initialization Code
----------- coro.start ----------
whatever...
| coro.suspend
V (R) = resume control flow
coro.suspend X (S) = suspend control flow
/ |(R) \(D) (D) = destroy control flow
/ V \
(S)/ whatever2 |
/ | |
/ V V
----- | --- coro.delete ---
\ Deallocation Code
\____ |
| |
V V
----------- coro.end ----
whatever 4
ret Whatever

At EP_EarlyAsPossible, init, delete and return parts are extracted
to protect against any code movement:

Entry:
call f.init_part(inputs1, outputs1)
...
... everything between coro.start and coro.delete
... stays put
DeleteBB:
f.delete_part(inputs2, outputs2)
RetBB: %result = f.ret_part(inputs3)
ret %result


Thank you for all your help!

Gor

Reply all
Reply to author
Forward
0 new messages