>>>>> On Wed, 16 Dec 2020 08:47:11 -0800, Sean Halle <
sean...@gmail.com> said:
| Thank you Krste,
| I am honored that you have taken the time to write a personal response,
| that is exceptionally clear and well reasoned. If the TSC decides to ratify
| for logistical reasons, then I understand. I can sense the desire for this
| first 45 day test to be a success, on the part of many. And I understand the
| judgement call of "something imperfect now, make it better later". Obviously
| I have deep respect for you and Greg, and I'm fine with whatever is ultimately
| decided.
| However, with great respect, and at the risk of appearing.. ungrateful?
| Impertinent? Could I just ask a few probing questions?
| 1) Is there any sort of solid idea of the cost of restarting the proposal with
| the better form? Is it.. 10 hours of people's time? 100? Will it delay
| approval by.. 3 months? Longer?
I would guess multiple months.
| 2) Have you seen the phenomenon before, where things get stuck in a local
| minima? For example, if the fixed, less performant form is adopted, and then
| when it comes time to vote on the better form, the argument is made "we
| already have a form of PAUSE" and as a result the better form doesn't have the
| votes? Do you see a realistic chance that this could happen here? If so, is
| the desire to save that 10 to 100 person hours and shave off that 3 months of
| time worth the risk of the ISA getting stuck with the less performant form?
If the proposal doesn't get support, that means there is no consensus
that the proposal is a worthwhile improvement.
In general, I'd say forces work in the opposite direction, that people
generally want more added to ISA than can be justified by implementer
ROI.
| 3) Have you seen the phenomenon where two successive forms of the same
| instruction are introduced, and then the one that is first to arrive is used
| by compiler writers, library implementers, kernel developers.. and then when
| the second, better form arrives, it takes a very long time for it to supplant
| the inferior form that came first (especially given that the better form has
| highest impact in OoO designs in large core count chips, which are not yet
| popular for RISC-V)? This would be the same question, is the savings now,
| worth the consequences down the road?
Speaking in generalities, the phenomenon is certainly there, but if
the second form is a real improvement, then it usually gets adopted
over time. If the second form is not a big improvement, then this
phenomonen is entirely rational behavior on part of the community.
In this specific case, experienced implementers are not clear that
there is a real improvement with second form.
Many developers have been working on OoO RISC-V designs for a while,
and many technical members have decades of experience building
previous high-end industry OoO designs, so proposals are reviewed with
this in mind.
| Those are judgement calls, so all that I ask is that those voting do take the
| time to quantify the two costs: on one side is the cost in people time and
| cost in months of delay required to approve the better version, versus on the
| other side is the long term cost of an inferior instruction in the ISA. (A
| useful instruction, but less valuable than the better form).
Your opinion is that this is a better version, the community does not
agree.
| You may not have had time to follow, but there have been very long, detailed,
| responses that lay out the following points, which have not been substantially
| contested:
I was going to follow up on technical group but can follow up here
quickly.
| 1) Out-of-order micro-architectures can choose to speculate past atomic
| instructions (which makes the common case fast, and I believe is typical in
| OoO, but this needs to be verified). In this case, speculation and roll back
| repeats, separated by PAUSE number of cycles in the spin loop. This robs
| performance from the other harts in the core (and wastes energy).
Base PAUSE should have number of stall cycles set to avoid most
overhead in this case. There is no reason to flush fetch buffer when
PAUSE hits decode stage. Same for multithreaded core with fetch
buffer per thread. A highly multithreaded core might want to throw
work away and replay, but again, the PAUSE can be sized to remove most
overhead.
| During high
| contention, the consequences can be profound slowdown (we have one measured
| data point -- 4 socket 10 core/socket 2 thread/core Broadwell circa 2012, more
| work is needed to quantify the two PAUSE proposals). The better form of
| PAUSE, in which Rs1 specifies the number of cycles to pause, fixes this (when
| used with adaptive backoff).
As others have noted, a software loop around the base PAUSE can
implement exponential backoff delay in face of contention. The base
PAUSE delay should be set so that almost all overhead is removed from
executing iterations of this loop, so your proposal to move the count
into hardware is only a minor perf improvement.
| 2) The superior form has only upside -- no concrete example has been given of
| a technical downside to making Rs1 contain the number of cycles to pause the
| hart (Rs1 is already in the op-code and the value in Rs1 is a hint to the
| number of cycles to pause that hart).
Implementations of the extended form would have to pick an upper bound
on count supported, if only for purposes of bounding verification
effort. If this bound is small, then software might not see correct
backoff behavior and would be motivated to add a software loop in any
case. This severely limits the general usefulness of the added form.
| There may be an organizational
| logistic cost to updating the spec, but no technical downside has been
| identified, while definite performance advantage, as above, has been
| illustrated. The only standing block to the superior form that has been
| stated is the logistics of this approval process.
I think the issue is that others see the perceived benefit as low.
| Again, I understand the need to make executive decisions that prioritize
| logistics above technical merit, so no worries on our end if that is what wins
| out.
| Our desire has been to highlight the cost of the fixed form of PAUSE in
| the hopes that the judgement is informed by quantified, concrete information.
| So, here's my last impertinent question :-) Is there the possibility that
| there's some element of digging in heels, battling to get what is wanted? I'm
| not saying that there is, but to someone outside the TSC process, who hasn't
| experienced the logistics, it is bewildering, the level of resistance to what
| appears to be a simple change that is clearly better in the long run. The
| hope is just for a dispassionate decision, based on quantified
| tradeoffs.
The issue is not about prioritizing logistics over technical merit.
The issue is that your perception of technical merit is not widely
shared. Specifically, you need to address why a software backoff loop
is not sufficient.
| Thank you for writing a personal note, I hope my questions have not given
| offense, and thank you for RISC-V :-)
None taken, and thanks for working with RISC-V!
Krste
|
https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAJ4GwDJwvr%2BRNStm7nC7-%2Be7BKGYGcBXycpHm8gq61GdoDgQkA%40mail.gmail.com
| .