Hello everyone,I do not mean to start a flame here, but I am still wondering why the coroutines fromP0057R0 are still being considered.For what it is worth, I find the paper from Christopher Kohlhoff very clarifying, very wellreasoned, and providing alternatives for all the important use cases from P0057R0 withsuperior implementations.
P0114 seems more... experimental. It sounds like something that has been discussed to some degree, but is as of yet lacking a proof-of-concept implementation. A lot is said about how it would be "possible" to implement some particular facet under their new rules. But the paper never claims that they've taken Clang or GCC or whatever and actually implemented it.
That's not to say that P0114 is dead and all effort should be focused on P0057. But however much you may find P0114 to be technically superior, P0057 has earned the right to be considered.
I see a bit of a mistake, again, in my opinion, to embed a scheduler into the language, when you could do it in a library, as Christopher's paper shows.
I see a bit of a mistake, again, in my opinion, to embed a scheduler into the language, when you could do it in a library, as Christopher's paper shows.There is absolutely no embedded scheduler in P0057 and never was.
P0057 and its predecessors provide syntactic sugar for common async and sync patterns and it is up to the library to decide what meaning to imbue the coroutine with.
I suggest to look at this presentation:
Another thing that the presentation above highlights is that the abstraction proposed is unique as it is not just zero-overhead. It is negative overhead :-) . Meaning that for some problems, taking the well-written code that uses functions / callbacks and rewriting it using higher level abstractions, namely, the coroutines as proposed by PP0057 will result in simpler implementation, smaller object size and faster execution.
That said, no single proposal has yet discussed why a compliant
compiler will never be able to deduce such scenarios without extra
keywords and rearranging of code - Similar things have been done to
achieve tail call optimizations, why can this not be done with
coroutines?
On Sunday, October 4, 2015 at 11:44:48 AM UTC+7, Gor Nishanov wrote:I see a bit of a mistake, again, in my opinion, to embed a scheduler into the language, when you could do it in a library, as Christopher's paper shows.There is absolutely no embedded scheduler in P0057 and never was.Hello Gor. If there is no scheduler, I do not understand how await can work. Forgive my ignorance, as I said above, I do not know to detail. Butmy understanding is that if you have a call to await, that state for the suspended coroutine must be kept somewhere. Where? I understand that this state must live somewhere. Where is that state held?
Sorry if I make any mistakes during my explanation, I am not an expert on this papers, I just happen to understand quite wellChristopher's metaphor of function objects and I see very difficult something more performant that non-type erased coroutinesthat only take the space strictly required.
When a resumable function is used in a resumable expression, the definition of the function must appear before the end of the translation unit.
How much more performant? Is it enough to be worth arguing about? After all, most things you'll be using await for won't be cheap operations. Will you actually notice any such performance loss?
When a resumable function is used in a resumable expression, the definition of the function must appear before the end of the translation unit.
How much more performant? Is it enough to be worth arguing about? After all, most things you'll be using await for won't be cheap operations. Will you actually notice any such performance loss?
If it's a choice between forbidding inlining and forcing inlining, I'll accept the overhead of forbidding inlining.
auto hello(char const* p) {
while (*p) yield *p++;
}
int main() {
for (auto c : hello("Hello, world"))
putchar(c);
}
int main() {
auto p = "Hello, world";
while (*p) putchar(*p++);
}
How much more performant? Is it enough to be worth arguing about? After all, most things you'll be using await for won't be cheap operations. Will you actually notice any such performance loss?
Well this is a usual argument to use "productivity languages". As far as I know, the definition of performance for C++ is that between C++ and machine code, we can only choose assembly.
So far, it has been good to me. Boxing is bad, bad, bad.
I do not think it is a good idea in a language abstraction. About the scheduling, not sure, but I believe what you say for now :).
When a resumable function is used in a resumable expression, the definition of the function must appear before the end of the translation unit.
There is an example of a boxed generator with separate compilation in the paper. Doesn't that contradict whay you are claiming?
If it's a choice between forbidding inlining and forcing inlining, I'll accept the overhead of forbidding inlining.If coroutine lifetime is fully enclosed in the lifetime of the calling function, then we can1) elide allocation and use a temporary on the stack of the caller2) replace indirect calls with direct calls and inline as appropriate:
How exactly does std::experimental::generator<T> accomplish that? How can the object know that it is contained entirely in this way? After all, the promise type is what holds the state, and therefore the promise has to decide whether to statically or dynamically allocate memory, as well as how to handle the forwarding to the function to be resumed.
void coroutine_handle::resume() { _coro_resume(_Ptr); }
void coroutine_handle::destroy() { _coro_destroy(_Ptr); }
P0057/[dcl.fct.def.coroutine]/8 A coroutine may need to allocate memory to store objects with automatic storage duration local
to the coroutine. If so, it shall obtain the storage by calling an allocation function (3.7.4.1).
The allocation function’s name is looked up in the scope of the promise type of the coroutine.
If this lookup fails to find the name, the allocation function’s name is looked up in the global
scope...
Why not make `resumable` do this "boxing" work for you, and have `inline resumable` do what the current thing suggests?
Users should not have to do this nonsense manually, especially considering how common such code will be.
Why not make `resumable` do this "boxing" work for you, and have `inline resumable` do what the current thing suggests?Because you can provide library solutions for the boxing in the proposals, without embedding mandatory boxing into the feature.
Why not make `resumable` do this "boxing" work for you, and have `inline resumable` do what the current thing suggests?Because you can provide library solutions for the boxing in the proposals, without embedding mandatory boxing into the feature.Just provide a generator<int> and you are done. Nothing prevents you from giving these library types, that can box, butit does not *force* you from the beginning.
I cannot see how resumable expressions are worse than await when:
3. no viral await when refactoring.
2. it can emulate async from a library.
On Tuesday, October 6, 2015 at 10:32:31 AM UTC-7, Germán Diago wrote:Why not make `resumable` do this "boxing" work for you, and have `inline resumable` do what the current thing suggests?Because you can provide library solutions for the boxing in the proposals, without embedding mandatory boxing into the feature.Germán:On Boilerplate:
Chris proposal suffers from the same problem as lambda*. It requires you to write more code as a user without providing tangible benefit.
3. no viral await when refactoring.
You say that as though resumable expressions don't have their viral aspects too. You can only call a resumable function as part of a resumable context: either the calling function is resumable or the expression making that call is resumable. Both of these require explicit annotation (except in those cases where the compiler magically works it out for you... somehow).
Oh sure, you won't be using `resumable` like you would `await`, or quite as much. But the fact is, you can't call a `resumable` function unless you've typed `resumable` somewhere nearby. So you still have to annotate up your call graph. And you still can't call coroutines of either kind without some kind of annotation.
So I'm not seeing how that's a point in resumable expression's favor.
Oh, and I have to agree with P0054: I think I'd rather see awaiting and yielding happen than for them to be implicit.
Await or NotIf you look at http://wg21.link/P0054, you will find a section "Exploring design space" which sketches out how you can evolve P0057 to add the "magic" so that you don't have to write awaits. However, I am not sure that absence of explicit indication of suspend points is a good thing, but, may get convinced otherwise in the future.
3. no viral await when refactoring.
You say that as though resumable expressions don't have their viral aspects too. You can only call a resumable function as part of a resumable context: either the calling function is resumable or the expression making that call is resumable. Both of these require explicit annotation (except in those cases where the compiler magically works it out for you... somehow).
Oh sure, you won't be using `resumable` like you would `await`, or quite as much. But the fact is, you can't call a `resumable` function unless you've typed `resumable` somewhere nearby. So you still have to annotate up your call graph. And you still can't call coroutines of either kind without some kind of annotation.
I think that comparison is simply no honest: if you put await 7 levels down the stack, you need to decorate all the way up with await.
For resumable, you would need to do it in the 7th level only, and you
would not need to refactor the rest of the code.
resumable int level3()
{
return 5;
}
int level2()
{
return level3(); //Cannot call a resumeable function here.
}
That is 7 vs 1 refactoring. Needless to say the reusability problem that Chris exposes in the paper: you cannot reuse algorithms, for example,
with await.
You cannot have member variables with await either, right?
These are all tangible benefits from having a function object as a representation.
So I'm not seeing how that's a point in resumable expression's favor.
You can see it: Imagine a deep stack of calls. How much refactoring do you need in each of the proposals? Imagine code reuse: resumable expressions can reuse code.
We cannot say the same about await *unless* I missed something. In the C++ style, I think resumable functions are more well behaved than await, in the sense that
it is just a function object, you know what it is doing, you could make it (maybe in future proposals) copyable, movable, you know the representation: jump point + strictly needed data.
I think the resumable expressions proposal puts the bar very high to the rest of the proposals, because besides its benefits, you can also implement
what other proposals are proposing.
//Some header file.
inline int a_func()
{
return 5;
}
inline int b_func()
{
break resumable;
return 5;
}
//Some cpp file.
static void func()
{
auto x = a_func() + b_func();
internal_global_var += x;
}
void caller()
{
func();
}
Therefore, the only reason you would need to "put await 7 levels down the stack" is if you want every function in that call graph to halt when the top-most function does, and thus return control to the caller "7 levels down". Correct?
If so:
For resumable, you would need to do it in the 7th level only, and you
would not need to refactor the rest of the code.
That's not true.
A function marked `resumable` can only be called from a resumable context. This is either another function marked `resumable` or from an expression marked `resumable`.
So this is illegal:
resumable int level3()
{
return 5;
}
int level2()
{
return level3(); //Cannot call a resumeable function here.
}
Therefore, every one of those 7 levels is going to have to be a `resumable` function. So every one of those levels will have to mark their signature with `resumable`. Any code that calls into any one of those 7 levels will have to mark each use of them with `resumable`, or will themselves have to be coroutines.
So yes, it's just as viral. Only it's worse, because not only do you have to mark them `resumable`, they must be inline.
The only saving grace you get is that the proposal allows automatic deduction of resumeable functions. But not everywhere; only in template code and lambdas. So normal functions don't provide this feature.
That is 7 vs 1 refactoring. Needless to say the reusability problem that Chris exposes in the paper: you cannot reuse algorithms, for example,with await.
That helps demonstrate the viral nature of resumable expressions. If I call an algorithm that internally does an implicitly resumable operation, then that algorithm internally becomes a coroutine. It becomes resumable.
Which means... I now must call that algorithm instantiation from a resumable context. So either my function itself is `resumable`, or I have to say `resumable for_each(...)`.
This doesn't invalidate your point, namely that algorithms will deduce how to properly be `resumable` for their contents. The exact suspend/resume points will not be defined by the writer of the algorithm, but by the functions the algorithm actually calls. And there is value to that.
At this point, that is basically the only advantage of resumable expressions. It's a non-trivial thing to be sure, but I don't see anything about resumable functions that would prevent you from addressing these concerns there.
What resumable functions lack relative to resumable expressions are two things:
1) A way for a function to effectively force the caller to become a coroutine (implicit await).
2) A way for the caller of a function to reverse the implicit `await` of a function call (that's what `resumable` applied to expressions does).
These features are all it takes to allow for the kind of template code reuse you're talking about. Though it does make it slightly more inconvenient than the resumable expressions model, since RE coroutines don't have return type requirements.
But neither of these is impossible with the resumable functions model. It's simply a matter of finding the best way to add those features in.
And of deciding if we want them at all (that's not necessarily a given).You cannot have member variables with await either, right?
I don't know what you mean by this.
These are all tangible benefits from having a function object as a representation.
... huh? Those benefits have nothing to do with "having a function object as a representation." Those benefits come from having to declare whether a function is a coroutine at the function level, rather than in the function's implementation.
Code reuse in template functions, perhaps. Code reuse elsewhere? Not so much.
Yes, you could implement resumable functions on top of resumable expressions. But that doesn't prove resumable expressions are better. Not does it prove that anything you could build atop resumable expressions cannot also be built atop resumable functions.
Can you tell me that resumable expressions have anywhere near the field experience?
German:If you noticed the theme of my answers to your questions was to move you away from feature list style comparison and take a look at how it is reduced to practice.Alex Stepanov said many insightful things and one of them was: "I still believe in abstraction, but now I know that one ends with abstraction, not starts with it. I learned that one has to adapt abstractions to reality and not the other way around." (http://web.archive.org/web/20071120015600/http://www.research.att.com/~bs/hopl-almost-final.pdf page 18).
On hidden magic:Coroutine proposal is similar to a range-base-for. A compiler does the syntactic sugar. The magic is an idea of iterable that allows a compiler to communicate with the library.
Similarly, with coroutine proposal, the magic that gets you zero/negative-overhead is in an idea of awaitable. You do the magic yourself. You can find samples of "negative-overhead" awaitable in the slides http://wg21.link/N4287. Also http://wg21.link/P0055 shows how this magic can be extended via CompletionToken technique to any template library that models their API after the networking library.
The transformation that compiler does is specified in http://wg21.link/P0057.
1. Resumable expressions do not need to be type erased, but can.
2. Resumable expression objects can be held as objects,, (even non-type erased? I am not sure).
Well, I agree that you accumulated a good deal of experience during the implementation. Noone can negate that. And evidence shows that, for your test cases,
the negative overhead seems impressive.
Also, I see the yield keyword. I am not sure how it works. That is on the library side in Chris' proposal, represented as an object.
What are the chances that we could capture the coroutines themselves in variables, and make them copyable and movable?
I tend to see as a standard idiom to have objects that can be copied/moved/compared, etc. that is the trend lately I think.
I do not mean the rest is not good, but why should we prevent these semantics in the first place in coroutines?
Also, I see the yield keyword. I am not sure how it works.
On Friday, October 9, 2015 at 1:49:45 AM UTC-4, germa...@gmail.com wrote:What are the chances that we could capture the coroutines themselves in variables, and make them copyable and movable?
I tend to see as a standard idiom to have objects that can be copied/moved/compared, etc. that is the trend lately I think.
I do not mean the rest is not good, but why should we prevent these semantics in the first place in coroutines?
If you're talking about value semantics, that doesn't make sense for coroutines. Remember that part of a coroutine's state is the stack. And stack variables are often references or pointers to other stack variables. You cannot effectively copy such a construct. And it's silly for the user to have to define a "copy constructor" for their call stack.
That's why `coroutine_handle` has reference semantics. It just makes more sense for coroutines. That doesn't prevent you from being able to pass them (and any containing object) around. You can even have a `std::vector<generator<int>>` and resume each one in turn.
Or to put it another way, just because you see `await` used to catch the coroutine promise returned by a coroutine does not mean you have to use it that way.
Also, I see the yield keyword. I am not sure how it works.
It returns a value and suspends the coroutine's execution at that point.
If you're wondering about the details of how the value is passed to the coroutine promise and all, that's part of the proposal.
El viernes, 9 de octubre de 2015, 22:21:56 (UTC+7), Nicol Bolas escribió:Or to put it another way, just because you see `await` used to catch the coroutine promise returned by a coroutine does not mean you have to use it that way.I am not sure about this. I have to take a more serious look to both proposals and do a comparison. I think a good starting point would be to convert Gor's codeto resumable expressions, which is what I think is more low-level proposal, and see how code looks.
Yesterday I was taking a look and I still have the impression that Gor's proposal is not as minimal as it could be. At least it does not embed any scheduler, that is true,but I see that it "hardcodes" a protocol into the language that is bigger than Chris'. But as I said, I need to take a more serious look at this to make a reallyfair and accurate comparison.
Also, I see the yield keyword. I am not sure how it works.
It returns a value and suspends the coroutine's execution at that point.In resumable expressions, that can be done on top of a library abstraction, why putting this into the language should be better?
Remember that when you put something into the language, there is no way back.
On the recent `operator await` syntax in P0057. Is it possible for a user to call this operator these themselves? And if so, will it work correctly for types that don't provide one (that is, resolving to the original type or issuing a compiler error)?
If not, it would probably be useful if the user could invoke it themselves.
int main() {
operator await(1ms);
operator await(awaiter{ 1ms });
not
operator await(1ms);
Yes, I think the wording is written to make it possible, however, I just checked, the implementation we ship in VS Update 1 does not do that. I filed a bug against myself.
04.10.2015 18:44, Vicente J. Botet Escriba:
Hmm, await can not work with list as a monad, isn't it?
I suggest to look at this presentation:
http://open-std.org/JTC1/SC22/WG21/docs/papers/2014/n4287.pdf
which walks through some of the aspects of P0057 proposal. Note, that
the
await syntax is actually quite old. It first appeared as do-notation in
Haskell in 1998 and you may notice that P0057 can be used to perform
more
general "monadic" transformations and not only limited to coroutines.
Bit no proposal is tempting to take care of this case.
If use same underlying technique as was used at macro-based stackless coroutines of Boost.Asio then it can work with list monad, because such coroutine is just value type which can be copied/moved.
auto operator await(list<T> const& l) {
struct awaiter {
list<T> const * list_;
static thread_local T* result_;
bool await_ready() { return false; }
void await_suspend(coroutine_handle<> h) {
auto l = list_;
for (auto && item : *l) { result_ = &item; h.resume(); }
// add code to extract the result from the promise and do something with it.
}
// for every element of the list return the value that we stashed in thread_local
T const & await_resume() { return *result_; }
}
return awaiter{&l};
}
12.10.2015 21:46, Richard Smith:
If use same underlying technique as was used at macro-based
stackless coroutines of Boost.Asio then it can work with list monad,
because such coroutine is just value type which can be copied/moved.
It doesn't really work; you can't support local variables with such a
model, because their lifetimes could be reentered after they end.
Local variables do work with technique used by stackless coroutines of Boost.Asio (and proposals like N4244).
With such approach coroutine is transformed into class. Local variables are transformed into fields of class (more precisely into nested unions corresponding to scopes, as described in N4244), and coroutine body is transformed into method-state-machine, where it's states correspond to yield points.
This already can be implemented via macros to some extent.
Moreover, C#'s await is implemented based on similar approach: http://www.codeproject.com/Articles/535635/Async-Await-and-the-Generated-StateMachine
Evgeny:Note the "purity" word in Richard's answer.If you write a body of the coroutine in a pure manner, you can hack P0057 and in your await_suspend for the list monad resume the coroutine multiple times.You need to provide proper final_suspend and return_value to make it work. But it will work ONLY if your body is pure :-). That is the body of your coroutine. And you cannot save any state in the awaiter, since it is torn down at the end of the full expressions, hence, I am using thread_local to ferry a value from await_suspend to await_resume.Here is "do not try this at home" awaiter for the list<T>. Untested. Just an idea of how it can look like.
auto operator await(list<T> const& l) {
struct awaiter {
list<T> const * list_;
static thread_local T* result_;
bool await_ready() { return false; }
void await_suspend(coroutine_handle<> h) {
auto l = list_;
for (auto && item : *l) { result_ = &item; h.resume(); }
... you'd need to write:list_monad<int> foo(list_monad<int> v) {loop:int x = await v;// somehow return x * x then conditionally goto loop.}... because you don't have any kind of call/cc primitive.
void await_suspend(coroutine_handle<> h) {
auto l = list_;
auto checkpoint = h.checkpoint();
for (auto && item : *l) { result_ = &item; h.resume(); h.load(checkpoint); }
// add code to extract the result from the promise and do something with it.
}
Same applies here - I want to copy/move/fork/serialize/etc coroutines,
and I agree to pay for possibility of problems with locals lifetime
issues.
My main concern about P0057R0 is type-erasure - it is far from being
zero-overhead.
Evgeny:Note the "purity" word in Richard's answer.If you write a body of the coroutine in a pure manner, you can hack P0057 and in your await_suspend for the list monad resume the coroutine multiple times.
COROUTINE(vector<int>, list_demo, (int, param),
(int, local_x)
(int, local_y))
{
AWAIT(local_x =) vector<int>{1,2,3};
AWAIT(local_y =) vector<int>{10, 20, 30};
RETURN(local_x + local_y + param);
}
COROUTINE_END;
int main()
{
auto xs = list_demo{1000}();
for(auto x : xs)
cout << x << " ";
}
// Prints: 1011 1021 1031 1012 1022 1032 1013 1023 1033
vector<int> list_demo(int param)
{
int local_x = await vector<int>{1,2,3};
int local_y = await vector<int>{10, 20, 30};
return local_x + local_y + param;
}
It is intrinsically non-zero overhead, due to type-erasure/allocations.
This fact alone is strong argument against it.
While with approach based on method-state-machine - we can get both:
generality and performance.
This is my main concern with the proposal: type-erasure and allocations.
Seems that there are some fancy optimizations possible, but I am not sure
why we should rely on these when there are other alternatives.
13.10.2015 2:03, Gor Nishanov:
>
> My main concern about P0057R0 is type-erasure - it is far from being
> zero-overhead.
>
>
> I have had an outstanding challenge for a year already to anyone who
> thinks that way to come up with a real world problem, reduce it to
> managable size (say async_tcp_reader) write it up it both ways using
> P0057 and whatever you consider zero overhead and evaluate on three
> criteria:
Example of real world problem is generator/yield.
An extra allocation here results in significant overhead. Even if some
kind of "small object optimization" scheme is used - it is still not
zero overhead.
vector<coroutine> x(N);
What about "vector<coroutine>(N)" use case? As I can see - overhead of N
allocations can't be elided automatically.
14.10.2015 1:22, Nicol Bolas:On Tuesday, October 13, 2015 at 5:37:50 PM UTC-4, Evgeny Panasyuk wrote:13.10.2015 2:03, Gor Nishanov:
>
> My main concern about P0057R0 is type-erasure - it is far from being
> zero-overhead.
>
>
> I have had an outstanding challenge for a year already to anyone who
> thinks that way to come up with a real world problem, reduce it to
> managable size (say async_tcp_reader) write it up it both ways using
> P0057 and whatever you consider zero overhead and evaluate on three
> criteria:
Example of real world problem is generator/yield.
An extra allocation here results in significant overhead. Even if some
kind of "small object optimization" scheme is used - it is still not
zero overhead.
Except that he's already proven (in this thread no less) that a good optimizer can elide the allocation. If the compiler can reasonably make it zero overhead, then it is zero overhead.
1. It is impossible (practically) in general case.
For instance in case when we put coroutines in container, like:In case of coroutines with concrete types and sizeof known at compile - this can be done within single allocation.
vector<coroutine> x(N);
But if coroutine type is erased the we will have N+1 allocations in general case - it can't be practically elided.
2. Even if consider only functions scopes - escape analysis would not give 100% guarantee for elision in every case. First of all - I think it would hit halting problem, second - some of functions in call tree may not be inlined for adequate reasons - and this would blind analysis.
3. This would put additional burden on implementers, and I don't see reasonable benefits which we get for such burden.
4. C++11 has lambdas with concrete type - this ensures zero overhead, and fits naturally into language. We don't have type-erasured closures. We can use external type erasure like std::function when needed.
Why we should have type-erasure for stackless coroutines?
It does not erase type:
But, even if there would be no copyability/moveability, and as the
result we cannot use std::vector - we still can place N coroutines into
array with single allocation: make_unique<coroutine[]>(N)
I am not talking specifically about P0114. I am talking about stackless
coroutines with concrete non-type-erased types.
On Tuesday, October 13, 2015 at 8:17:53 PM UTC-7, Evgeny Panasyuk wrote:I am not talking specifically about P0114. I am talking about stackless
coroutines with concrete non-type-erased types.The starting point of my design was a lambda* with the properties you describe. When applied to problems I needed solving I found it unsatisfactory and therefore went with N4134 proposal. That does not mean that at some point, somebody won't be able to invent a better lambda* and get it standardized.
P0114, P0057 and lambda* are all powered by the same underlying machinery. A transformation of a state machine written in imperative fashion into an actual state machine. The difference is in a public face of the state machine. You just need to figure out compelling use-cases and sane semantics that are not already covered efficiently by P0057 and write a proposal. I cannot do it for you as the problems I need solving, namely async I/O and async programming in general, are addressed by P0057 succinctly and efficiently.
14.10.2015 4:57, Nicol Bolas:
> Except that he's already proven (in this thread no less) that a
> good optimizer can elide the allocation. If the compiler can
> reasonably /make/ it zero overhead, then it /is/ zero overhead.
>
>
>
> 1. It is impossible (practically) in general case.
> For instance in case when we put coroutines in container, like:
> |
> vector<coroutine>x(N);
> |
> In case of coroutines with concrete types and sizeof known at
> compile - this can be done within single allocation.
> But if coroutine type is erased the we will have N+1 allocations in
> general case - it can't be practically elided.
>
>
> Ignoring the rest of the discussion on this point, I never claimed that
> P0057 could guarantee elision in the case you present here. Before, you
> asked about a /specific/ problem, and I answered with a specific example
> showing that it was elidable. What you've shown here hardly disproves my
> point.
It is not zero overhead even with good optimizer/compiler, because they
can't elide every allocation, and I am not talking about some exotic cases.
>
> Also... how does `vector<coroutine>` make any kind of sense with regard
> to P0114? The type isn't type erased, so each coroutine has its own
> type. Therefore, in order to put them in a homogeneous container like
> `vector`, you'll have to type-erase them. Which requires memory allocation.
> At which point, your version gains /nothing/ over P0057.
Same coroutines have same concrete types. For instance, with P0114 it
may be:
|
struct concrete_coroutine
{
resumable auto r = expression;
// ...
};
...
make_unique<concrete_coroutine[]>(N);
|
Again, I am not talking specifically about P0114. Even P0057 can be
changed to have concrete coroutine type.
If you need only async I/O - yes, I could imagine that extra allocation is tolerable in such context. But P0057 describes not only async I/O - but also for instance generators. And for generators (like transform iterators) an extra allocation is huge price.
Okay. Slightly less cryptic reply. Coroutine frame must be stationary once the coroutine starts running.Resumable Expressions abandoned movability/copyability of the lambda* that was present in earlier resumable lambda proposal.
The difference is that in Resumable Expressions you must do type erasure by hand which is difficult to eliminate.Whereas in P0057 compiler decides whether it needs to do type erasure or not, thus, allowing to optimize it out when unnecessary.
P0114 requires that all resumable functions you call are inlined. If they're not inlined, you have to manually box them (and the boxing function is no longer resumable). Boxing involves type erasure. And as previously stated, memory allocation.
I do not claim that feature lists are identical for P0057 and P0114. I
claim that for a complicated problem, like async programming, for example,
P0057 solution will results in:
1) less user written code
2) less library support code
3) less abstraction overhead (see TcpReader, for example)
Than a solution to the same problem in P0114.
If you want to argue superiority of P0114, pick a problem (hint, hint
async programming) write a solution, compare with equivalent of P0057.
And I think C++ ISO needs both - stackless and stackful coroutines. (though I would prefer to get stackless into ISO first, because stackful can be completely implemented in library, like Boost.Context/Coroutine)
I have had an outstanding challenge for a year already to anyone who thinks that way to come up with a real world problem, reduce it to managable size (say async_tcp_reader) write it up it both ways using P0057 and whatever you consider zero overhead and evaluate on three criteria:
1) How much code end-user have to write2) How much library support required3) What is an abstraction penalty, how many instructions need to get executed to get from, say, await Read(buf, len) to an low-level API/hardware, say WSARecvMy statement is that P0057 is as good or better on all 3 criteria than any other proposal I've seen. If you want to accept the challenge, write up an equivalent to TcpReader described in one of these two presentations:
3) What is an abstraction penalty, how many instructions need to get executed to get from, say, await Read(buf, len) to an low-level API/hardware, say WSARecvMy statement is that P0057 is as good or better on all 3 criteria than any other proposal I've seen. If you want to accept the challenge, write up an equivalent to TcpReader described in one of these two presentations:
That is a good way forward. I think the abstraction penalty should be the same,
ResultType r = await async_xyz(p);
becomes
async_xyz`Awaiter __tmp{p};
$promise.resume_addr = &__resume_label; // save the resumption point of the coroutine
__tmp.resume = $RBP; // inlined await_suspend
os_xyz(p,&OsContextBase::Invoke, &__tmp); // inlined await_suspend
jmp Epilogue; // suspends the coroutine
__resume_label: // will be resumed at this point once the operation is finished
R r = move(__tmp.result); // inlined await_resume
resumable void echo(tcp::socket socket);
resumable void listen(tcp::acceptor acceptor) {
...
spawn([s = std::move(socket)]() mutable { echo(std::move(s)); });
echo.h:
void echo(tcp::socket socket);
echo.cpp:
resumable echo_impl(tcp::socket socket) { ... }
void echo(tcp::socket socket) {
spawn([s = std::move(socket)]() mutable { echo_impl(std::move(s)); });
}
14.10.2015 7:51, Nicol Bolas:
> If you need only async I/O - yes, I could imagine that extra
> allocation is tolerable in such context. But P0057 describes not
> only async I/O - but also for instance generators. And for
> generators (like transform iterators) an extra allocation is huge price.
>
>
> A price you will never pay because it will be elided.
>
> Please stop repeating statements that have been disproven; it's not
> helping your case. You have yet to post an example of a generator that
> would not be elided.
I already described it several times, just put generator into some
structure/array or return somewhere.
In similar situation:
http://coliru.stacked-crooked.com/a/0c09744abd5e57ae
- allocation of P0057 generator will not be elided, there will be N
allocations, i.e. for each coroutine.
Coroutine<resumable {expr}>.
using coroutine_type = decltype(resumable {expr});
You can read what Gor said previously in this topic:
"If coroutine lifetime is fully enclosed in the lifetime of the calling
function, then we can
1) elide allocation and use a temporary on the stack of the caller
2) replace indirect calls with direct calls and inline as appropriate:"
There is "if" condition. Even if we assume that optimizers always do
elision when condition is true, there is still no elision for cases with
false condition.
It does not mean that for some problems, some incarnation of lambda* might be better than P0057. That is wonderful. When this happen, let's add it in, in addition to P0057 and P0099 (modestly called "A low-level API for stackful context switching").
On Wednesday, October 14, 2015 at 9:37:50 AM UTC-4, Gor Nishanov wrote:It does not mean that for some problems, some incarnation of lambda* might be better than P0057. That is wonderful. When this happen, let's add it in, in addition to P0057 and P0099 (modestly called "A low-level API for stackful context switching").
There are two problems with the "let's add it in" approach.
First, how do you teach when to use which?
The second problem is interoperation. How P0114 would interop with P0057?
On Wednesday, October 14, 2015 at 1:13:56 AM UTC-4, Evgeny Panasyuk wrote:14.10.2015 7:51, Nicol Bolas:
> If you need only async I/O - yes, I could imagine that extra
> allocation is tolerable in such context. But P0057 describes not
> only async I/O - but also for instance generators. And for
> generators (like transform iterators) an extra allocation is huge price.
>
>
> A price you will never pay because it will be elided.
>
> Please stop repeating statements that have been disproven; it's not
> helping your case. You have yet to post an example of a generator that
> would not be elided.
I already described it several times, just put generator into some
structure/array or return somewhere.
In similar situation:
http://coliru.stacked-crooked.com/a/0c09744abd5e57ae
- allocation of P0057 generator will not be elided, there will be N
allocations, i.e. for each coroutine.
And in your case, the `Coroutine` will have to type-erase them too. Thus performing N allocations.
Put it another way. In order to make something a member of a struct, you must first be able to name it. In C++ as it currently stands, it is impossible to store an unnamable type in a non-static data member. Whether it's a lambda or the result of a resumable expression or anything else, it simply cannot happen.
const int numCoroutines = 10000;
std::vector<What> v;
v.reserve(10000);
for(int i : range(0, 10000))
v.emplace_back(createACoroutine);
auto createCoroutineVector(const int numCoroutines)
{
std::vector<decltype(createACoroutine())> v;
v.reserve(numCoroutines);
for(int i : range(0, 10000))
v.emplace_back(createACoroutine);
return v;
}
On Wednesday, October 14, 2015 at 11:16:35 AM UTC-4, Giovanni Piero Deretta wrote:
On Wednesday, October 14, 2015 at 2:47:09 PM UTC+1, Nicol Bolas wrote:On Wednesday, October 14, 2015 at 1:13:56 AM UTC-4, Evgeny Panasyuk wrote:14.10.2015 7:51, Nicol Bolas:
> If you need only async I/O - yes, I could imagine that extra
> allocation is tolerable in such context. But P0057 describes not
> only async I/O - but also for instance generators. And for
> generators (like transform iterators) an extra allocation is huge price.
>
>
> A price you will never pay because it will be elided.
>
> Please stop repeating statements that have been disproven; it's not
> helping your case. You have yet to post an example of a generator that
> would not be elided.
I already described it several times, just put generator into some
structure/array or return somewhere.
In similar situation:
http://coliru.stacked-crooked.com/a/0c09744abd5e57ae
- allocation of P0057 generator will not be elided, there will be N
allocations, i.e. for each coroutine.
And in your case, the `Coroutine` will have to type-erase them too. Thus performing N allocations.
Put it another way. In order to make something a member of a struct, you must first be able to name it. In C++ as it currently stands, it is impossible to store an unnamable type in a non-static data member. Whether it's a lambda or the result of a resumable expression or anything else, it simply cannot happen.
It is trivially possible. The name is unimportant, only the type is. In this example a lambda stands for an unnamed type. I could have used other unnamed types.
[...]
OK, yes you can do that. My mistake.
However, that code is not a complete example. It doesn't match with your sample code (which currently uses type erasure/chicanery). So what does the non-trick version look like?
Thus far, including in this post, you haven't mentioned an example that would actually compile.
You linked to some macro code, but macros are, basically, cheating. They get to break all kinds of C++ rules, which an actual language feature would not.
COROUTINE(vector<int>, list_demo, (int, param),
(int, local_x)
(int, local_y))
{
AWAIT(local_x =) vector<int>{1,2,3};
AWAIT(local_y =) vector<int>{10, 20, 30};
RETURN(local_x + local_y + param);
}
COROUTINE_END;
vector<int> list_demo(int param)
{
int local_x = await vector<int>{1,2,3};
int local_y = await vector<int>{10, 20, 30};
return local_x + local_y + param;
}
>
> Also... how does `vector<coroutine>` make any kind of sense with regard
> to P0114? The type isn't type erased, so each coroutine has its own
> type. Therefore, in order to put them in a homogeneous container like
> `vector`, you'll have to type-erase them. Which requires memory allocation.
> At which point, your version gains /nothing/ over P0057.
Same coroutines have same concrete types. For instance, with P0114 it
may be:
|
struct concrete_coroutine
{
resumable auto r = expression;
// ...
};
...
make_unique<concrete_coroutine[]>(N);
|
`auto` doesn't work that way. Non-static data members cannot be `auto`. Normally I wouldn't care about a small issue like that, but it basically makes your code impossible.
Without `auto` NSDMI (and I wouldn't hold my breath on seeing it), you can't store a resumable expression. So you can't make containers of them.
Unless you erase their types. So again, you've gained nothing.
Your macro solution gets around this because it uses macros.
Again, I am not talking specifically about P0114. Even P0057 can be
changed to have concrete coroutine type.
I'd be curious to see how, exactly.
struct generator
{
...
coroutine_handle<promise_type> coro;
};
generator example_generator()
{
yield 1;
}
int main()
{
generator x = example_generator();
x.move_next();
g.current_value();
}
template<template<typename> class coroutine_value>
struct generator
{
...
coroutine_value<promise_type> coro;
};
generator example_generator()
{
yield 1;
}
// example_generator is transformed to:
using example_generator = generator< synthesized_coroutine >;
int main()
{
example_generator x{};
x.move_next();
g.current_value();
}
If you want this done, then you're going to need to go through the effort of designing the feature to work without type erasure. Then you have to get someone to implement it.
Then, you can know whether it works just as well as P0057, whether it's equally easy to use, and how much of a performance advantage it gets.
If any.
14 October 2015 г., 7:46:51 UTC+3 Nicol Bolas :Thus far, including in this post, you haven't mentioned an example that would actually compile.
Actually examples do compile and do run.
You linked to some macro code, but macros are, basically, cheating. They get to break all kinds of C++ rules, which an actual language feature would not.
Macros allow us to emulate language feature, to test it now, with current compilers. Even Stroustrup uses macros in Mach7 library to emulate language feature.
I think it is obvious that following macro-based code:
COROUTINE(vector<int>, list_demo, (int, param),
(int, local_x)
(int, local_y))
{
AWAIT(local_x =) vector<int>{1,2,3};
AWAIT(local_y =) vector<int>{10, 20, 30};
RETURN(local_x + local_y + param);
}
COROUTINE_END;
Is equivalent to following code with language support:
vector<int> list_demo(int param)
{
int local_x = await vector<int>{1,2,3};
int local_y = await vector<int>{10, 20, 30};
return local_x + local_y + param;
}
And if macro-based version does work, then this one will work without problems.
Same coroutines have same concrete types. For instance, with P0114 it
may be:
|
struct concrete_coroutine
{
resumable auto r = expression;
// ...
};
...
make_unique<concrete_coroutine[]>(N);
|
`auto` doesn't work that way. Non-static data members cannot be `auto`. Normally I wouldn't care about a small issue like that, but it basically makes your code impossible.
Such usage of auto is at p0114r0.pdf at page 11.
Without `auto` NSDMI (and I wouldn't hold my breath on seeing it), you can't store a resumable expression. So you can't make containers of them.
Unless you erase their types. So again, you've gained nothing.
Your macro solution gets around this because it uses macros.
I showed two versions above - macro based version and possible syntax with language support.
Do you have any concrete reasoning why it would be impossible without macros?
And if macro-based version does work, then this one will work without problems.
I'll talk about this more later, but a good language feature should be minimal, not do whatever it takes. That's why a macro approach is a bad idea for a proposal. It's fine for a general sketch. But macros make you brave; you can do anything with them.
When it comes to a language feature, you shouldn't do anything. You should do just enough, and no more.
Without `auto` NSDMI (and I wouldn't hold my breath on seeing it), you can't store a resumable expression. So you can't make containers of them.
Unless you erase their types. So again, you've gained nothing.
Your macro solution gets around this because it uses macros.
I showed two versions above - macro based version and possible syntax with language support.
Do you have any concrete reasoning why it would be impossible without macros?
Because you'd have to get language support for `auto` in NSDMI's. And I just linked to you a discussion about precisely that and how it's not gonna happen. So your "possible syntax with language support" doesn't hold water.
With concrete coroutine type it could be something like this:
template<template<typename> class coroutine_value>
struct generator
{
...
coroutine_value<promise_type> coro;
};
generator example_generator()
{
yield 1;
}
// example_generator is transformed to:
using example_generator = generator< synthesized_coroutine >;
int main()
{
example_generator x{};
x.move_next();
g.current_value();
}
Um, what does that code mean? Where does `synthesized_coroutine` come from?
How does `example_generator` get defined twice?
And how does `example_generator` return a template that has no template arguments?
struct generator
{
template<template<typename> class coroutine_value
>
struct apply { ... };
};
A nice thing about resumable functions is that it doesn't take a sledgehammer to basic elements of the language. If a coroutine function returns a type, it returns that type, and the return value has all the rights and behaviors of a return value from a regular function.
What you're suggesting requires a bunch of different changes to lots of elements of C++. You have to be able to return a template with no arguments, who's arguments are provided by that `using` declaration, I guess.
After all, there's no proposal even remotely like this at present. Even your idea above is incomplete, as it's not clear what all of those pieces actually mean or do (P0057 makes `await` mean one thing. What does it mean in your idea?). You have one general notion: coroutines having a firm type. And you're ready and willing invent a plethora of subsidiary C++ language features that exist for the sole purpose of making that work.
That's not a good way to make a solid proposal. If that one thing requires so many subsidiary language features... maybe that one thing is not worth it.
Even if we accept that this is a good way to make a proposal... it's not a proposal yet. It's just some ideas being batted around on a forum. None of the various coroutine proposals do anything like what you've suggested. Why should we halt or delay progress on P0057 because you think you might be able to do better?
As far as having the concrete type goes, that sounds like it requires even
more inlining and across-call-stack transparency.
future<int> concrete_coroutine()
{
int local = await async_operation();
return local;
}
struct concrete_coroutine
{
state_value_type current_state;
int local;
future<int> method_state_machine(); // or operator()()
};
In a stackless coroutine,
the erased type combined with elision of the erasure and allocations avoids
having all coroutines have a different type.
On 15 October 2015 at 00:09, Evgeny Panasyuk <evgeny....@gmail.com> wrote:
> 14 October 2015 г., 10:42:53 UTC+3 Ville Voutilainen :
>>
>>
>> As far as having the concrete type goes, that sounds like it requires even
>> more inlining and across-call-stack transparency.
>
>
> It actually requires less inlining and transparency. For instance, this
> code:
> future<int> concrete_coroutine()
> {
> int local = await async_operation();
> return local;
> }
> Can be straightforwardly transformed to something like:
> struct concrete_coroutine
> {
> state_value_type current_state;
> int local;
>
> future<int> method_state_machine(); // or operator()()
> };
> Where method_state_machine can be compiled separately, in another
> translation unit.
> And actually this approach is already implementable with macros, to some
> extent (it works, but compiler-side transformation will give better result).
Where does this transformation happen translation-unit-wise, and how
would the method_state_machine get compiled in a different translation
unit?
What type does the caller of the previous concrete_coroutine() see?
>> In a stackless coroutine,
>> the erased type combined with elision of the erasure and allocations
>> avoids
>> having all coroutines have a different type.
> Yes, but it is easy to get erased type from concrete when needed.
> For instance different lambdas have different types, but can be easily
> placed into std::function (if has appropriate signature).
For some values of "easily". For the many users who don't care about the
underlying type of the coroutine, it's not so easy when they have to wrap every
time they use a coroutine.
concrete_generator<int> cg1()
{
yield 1;
}
concrete_generator<int> cg2()
{
yield 2;
}
type_erased_generator<int> teg1()
{
yield 1;
}
type_erased_generator
<int> teg2()
{
yield 2;
}
void test()
{
concrete_coroutine coro{};
future<int> f = coro.resume(); // or coro()
}
concrete_low_level_generator<int> positive_numbers(int N)
{
for(int x=1; x<=N; ++x)
yield x;
}
void test()
{
positive_numbers xs{100};
while(xs.resume())
print(xs.current_value());
}
concrete_high_level_generator<int> positive_numbers(unsigned N)
{
for(int x=1; x<=N; ++x)
yield x;
}
void test()
{
for(auto x : positive_numbers{100});
print(x);
}
First of all, I would like to have copy and move semantics, at least
some explicit control for it.
14.10.2015 16:47, Nicol Bolas:
> I already described it several times, just put generator into some
> structure/array or return somewhere.
> In similar situation:
> http://coliru.stacked-crooked.com/a/0c09744abd5e57ae
> <http://coliru.stacked-crooked.com/a/0c09744abd5e57ae>
> - allocation of P0057 generator will not be elided, there will be N
> allocations, i.e. for each coroutine.
>
>
> And in your case, the `Coroutine` will have to type-erase them too. Thus
> performing N allocations.
No need for type-erasure - no need for N allocations.
>
> Put it another way. In order to make something a member of a struct, you
> must first be able to name it. In C++ as it currently stands, it is
> /impossible/ to store an unnamable type in a non-static data member.
> Whether it's a lambda or the result of a resumable expression or
> anything else, it simply /cannot happen/.
Again, I am not talking specifically about P0114r0. I am talking about
at least adding possibility to have concrete types in P0057R0-like proposal.
generator<int> numbers()
{
yield 1;
}
we can use "numbers" as a name for synthesized class (which represents
concrete coroutine), instead of name for synthesized function.
generator<int> numbers();
static_assert(std::is_function_v<decltype(numbers)>);
[]numbers() -> generator<int>;
14 October 2015 г., 22:35:49 UTC+3 Nicol Bolas:
With concrete coroutine type it could be something like this:
template<template<typename> class coroutine_value>
struct generator
{
...
coroutine_value<promise_type> coro;
};
generator example_generator()
{
yield 1;
}
// example_generator is transformed to:
using example_generator = generator< synthesized_coroutine >;
int main()
{
example_generator x{};
x.move_next();
g.current_value();
}
Um, what does that code mean? Where does `synthesized_coroutine` come from?
"using" part is done by compiler, synthesized_coroutine comes from compiler.
How does `example_generator` get defined twice?
It is not defined twice. First one is what user writes, second one ("using" part) is what compiler does for this code. In essence user code is transformed into type with name example_generator.
And how does `example_generator` return a template that has no template arguments?
If you don't like it,
it is possible to return type with template inside. For instance
struct generator
{
template<template<typename> class coroutine_value
>
struct apply { ... };
};
A nice thing about resumable functions is that it doesn't take a sledgehammer to basic elements of the language. If a coroutine function returns a type, it returns that type, and the return value has all the rights and behaviors of a return value from a regular function.
It is not truly return type. Even P0057 does not have true return type, you can't return value of that type from body - it just mimics normal function syntax, but it is not normal function at all, it is just synthetic language construction.
Even if we accept that this is a good way to make a proposal... it's not a proposal yet. It's just some ideas being batted around on a forum. None of the various coroutine proposals do anything like what you've suggested. Why should we halt or delay progress on P0057 because you think you might be able to do better?
If authors of P0057 still would insist on design with intrinsic overhead and high burden on optimizers, then you are right - probably viable path is to make another proposal.
Personally I would prefer to get fast coroutines in like 2020, then to get some coroutines in 2017.
> Passive-aggressiveness does not prove your point. For example, you have yet to prove that the "burden on optimizers" is "high" by some definition of that word. It's merely "non-zero".
I think there is no need to, well, insult anyone for having a different view insinuating he is being too agressive. I think it is good to have discussion.
That said, one of the principles of C++ is the zero-overhead principle. Not the "little overhead" principle. Why? Because c++ is for max. performance, and if u do something suboptimal by design, people are going to invent another solution.
Herb Sutter defined this zero overhead as nothing between c++ that is not assembly. I agrew with that. P
The only library that does have a design I dnt like, and u mentioned before, is iostreams. No lib ever followed iostream path since then. We have templated non-inheritance components that are generic mostly.
Erasure does have costs and there are alternatives. Chris paper mentions about the inherent erasure overhead. Why it is mentioned if it is not that important... Erasure cannot be controlled in all scenarios once it is embedded into the design. That is something that is simply true. So the question here should be if we can have a design with inherently minimal overhead, not if you sympathize with one solution or another only.
Again, I am not talking specifically about P0114r0. I am talking about
at least adding possibility to have concrete types in P0057R0-like proposal.
Well, it's hard to gauge how reasonable a proposal is when said proposal doesn't actually exist. You don't have a proposal; you just have some general notions of how you think it ought to act, with no demonstrated knowledge of how feasible that will be to implement.
And no, Boost.Asio's macro hacks are not a feasibility study.
That's breaking basic rules of C++: a function declaration should be a function declaration, not a struct declaration. Even lambdas don't look like non-lambda functions.
OK, so the compiler sees this and knows that `numbers` is a struct.
How big is it?
The compiler doesn't know. The compiler cannot know. Not from the information presented here. What `numbers` is here is an incomplete type.
The only way to generate a complete type is to complete the function definition of `numbers`. That way, the alignment and storage of the stack data is available.
And that means that the function must be inline.
So all you've done is re-invent P0114 with slightly different syntax. For someone who keeps claiming that their idea isn't P0114, it seems to have a lot of P0114's restrictions.
The beauty of P0057's design is that it works with C++ as it currently exists. It changes the bare minimum needed to make the feature work. It doesn't require that resumable functions are inlined or anything like that. It doesn't make function declarations automatically become struct declarations.
// Type erasure version:
generator<int> numbers(int x)
{
yield x;
...
}
// Compiler uses std::coroutine_triats<generator<int>, int
> (as it is proposed in P0057)
/**************************/
// Version with concrete coroutine type:
auto numbers(int x, concrete_generator_tag = 0)
{
yield x;
...
}
// Compiler uses coroutine_triats<auto_result_tag, int
,
concrete_generator_tag
> , and based on this specialization it will generate return type and value.
// Type of coroutine is
decltype(numbers(1)), i.e.:
decltype(numbers(1)) coro = numbers(1);
What, did you think P0057 made the decision to type-erase coroutines on a whim? It is there specifically to avoid all of these elements. That design decision is what allows P0057 to work generally. No forced inlining. Virtual calls are allowed. Resumable functions look and behave just like any other functions.
It's not a question of what I like. It is simply not possible in C++. You cannot return a template; you can only return a concrete type. This may be a specific instantiation of a template, but you cannot return a template itself.
What you wrote is syntactic nonsense. Therefore, if you want it to stop being syntactic nonsense, your proposal will need to define what it means.
Your proposal seems to require a lot of special-case handling at the type level.
it is possible to return type with template inside. For instance
struct generator
{
template<template<typename> class coroutine_value
>
struct apply { ... };
};
OK, so... what does that do? Does `generator` store the coroutine?
OK, internally it may only "mimic normal function syntax", but by design, the "mimicry" is complete. It looks and behaves no different from any other function. There is absolutely no way to tell the difference between it and non-coroutine functions.
Your proposal exposes everyone to the deep guts of working with a coroutine. All just for some minor performance gain that in most situations compilers can optimize out.
And even when they can't, you can optimize them out with a decent allocator.
Even if we accept that this is a good way to make a proposal... it's not a proposal yet. It's just some ideas being batted around on a forum. None of the various coroutine proposals do anything like what you've suggested. Why should we halt or delay progress on P0057 because you think you might be able to do better?
If authors of P0057 still would insist on design with intrinsic overhead and high burden on optimizers, then you are right - probably viable path is to make another proposal.
Passive-aggressiveness does not prove your point.
For example, you have yet to prove that the "burden on optimizers" is "high" by some definition of that word. It's merely "non-zero".
Just like the burden on optimizers for dealing with template code, inlining, and the like.
Personally I would prefer to get fast coroutines in like 2020, then to get some coroutines in 2017.
Considering that you haven't proven that P0057 is particularly slow, I remain yet unconvinced that the performance gains you claim are necessary to be "fast" are actually worth those 3 years.