Start of 45-day public review period for "PAUSE Hint instruction" extension

387 views
Skip to first unread message

Greg Favor

unread,
Oct 30, 2020, 12:57:59 AM10/30/20
to RISC-V ISA Dev
The first small ISA extension to go through the new fast-track extension process is now starting its 45-day public review period as the last major step towards ratification.  This "PAUSE Hint instruction" extension adds a single instruction - the PAUSE instruction (encoded as a HINT instruction) - to the ISA that mirrors what other architectures have, i.e.  x86 PAUSE and ARMv8 WFE/YIELD.  This instruction serves to reduce the power and/or improve the performance of “spin-wait loops”.
Greg
zihintpause-20201029.pdf

David Chisnall

unread,
Oct 30, 2020, 8:28:53 AM10/30/20
to isa...@groups.riscv.org
Hi Greg,

I'm concerned at the conflation of PAUSE / YIELD with WFE / WFI. This
isn't helped by the ARM documentation for WFE, but these two families
have very different uses and semantics.

The x86 PAUSE instruction is intended to prevent superscalar
architectures from unrolling a spin loop and issuing a load of
speculative instructions for no reason. If you enter a spin loop then
the pathological case will fill up the pipeline with loop iterations and
will then have to cancel a load of them. If the hint is present then
the microarchitecture can monitor the relevant cache line for changes.

This is not trapped when run in userspace and simply avoids some
pathological cases in superscalar implementations. In a simple
pipeline, it will typically be a NOP, though with SMT it may be a hint
to prioritise the other thread(s) on the same core for a while.

In contrast, WFE is an instruction that explicitly puts the processor in
a low-power state. On ARM, this can be trapped to EL1 or EL2 and will
deschedule the current process / VM until an event is delivered.

Although they use similar mechanisms, their software uses are very
different. You will use a PAUSE instruction in the body of *any*
busy-wait loop, because if you hit the waiting path it is expected to
make your code faster. You will only ever use a WFE in a place where
you are explicitly making a trade of peak throughput versus power (or
efficient sharing in a virtualised environment).

A WFE may be implemented as a NOP, but implementers can assume that
non-NOP implementations can be expensive because the contract with the
programmer is that they are used only on very cold code paths (the OS
literally has nothing else it can usefully od when it issues a WFE
instruction). A PAUSE may be a NOP, but implementers should aim to make
non-NOP implementations cheap because it will be used in userspace code
in places where a system call is considered to have too much overhead
(i.e. you're doing a low-latency spin-lock instead of a futex-like lock,
or you're in the futex fast path).

Conflating these two ends up with an instruction that is not useful to
software.

If an implementer actually does the suggested operation (slows down the
HART for a fixed number of cycles), then you have introduced a simple
denial of service attack. As spec'd, there are no restrictions on the
mode that this instruction can operate in and so a U-mode program can
slow down the HART. S / M modes have no mechanism for turning this NOP
into a trap and so they are unable to prevent this (and there's no
corresponding un-slow hint that they could use on ISRs or system calls).
If you want to exploit a timing vulnerability in S-mode code, being able
to slow down the HART immediately before triggering that code path is a
very useful primitive to the attacker.


My suggestion for this hint would be to explicitly enumerate the
*intent* from a programmer's perspective. What are the characteristics
for code that would use this hint? Implementers are then free to do
whatever they want to make code that has those properties faster. For
example (very rough text):

> The PAUSE instruction indicates that it is part of a busy-wait loop
that will access a single memory location and branch on some condition.
Implementers should ensure that, in the common case, when used in this
context, adding a PAUSE instruction is no slower than adding a NOP in
the same place.

This establishes a contract with the programmer (they know when to use
the instruction and what the performance guarantees are) and gives
implementers useful freedom to implement this as pausing instructions
until an interrupt is retired or one of the cache lines touched in the
current loop trace is modified, as a superscalar NOP (in MIPS
terminology) to slow down issuing redundant instructions, as a NOP
that's discarded in the decoder, or anything else that makes sense in
their pipeline design.

If you intend a WFI / WFE-like instruction then you need to define a
mechanism for more privileged modes to trap if it's issued and you need
to explicitly define the set of things that are guaranteed to wake it up
(implementations are free to spuriously wake up for any event, including
one cycle having elapsed).

David
> <https://lists.riscv.org/g/tech/topic/preparation_to_start_public/77415298?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,77415298>
>
> Greg
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/13c42582-ef3d-4cc6-85d3-a975f1888190o%40groups.riscv.org
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/13c42582-ef3d-4cc6-85d3-a975f1888190o%40groups.riscv.org?utm_medium=email&utm_source=footer>.

Greg Favor

unread,
Oct 30, 2020, 12:17:19 PM10/30/20
to RISC-V ISA Dev, David.C...@cl.cam.ac.uk
Sorry for the confusion.  The intent of the initial comparative comment was simply to mention some of the other instructions on the ISA landscape that are related to varying degrees.  But it may have been better to have only mentioned x86 PAUSE - which is closest to this RISC-V PAUSE instruction.  They both establish similar hardware/software contracts with similar degrees of implementation flexibility (although we do try to provide a bit more implementation guidance, without restricting what any given RISC-V implementation can choose to do if it has atypical needs for its intended applications).

As you note, ARM WFE - even though it is used in spin-wait loops - is a very different animal in its details (as well as being usable in other scenarios).  Plus use of it also involves a few other key bits of architectural functionality besides the WFE instruction itself.


Sean Halle

unread,
Oct 30, 2020, 12:28:42 PM10/30/20
to Greg Favor, RISC-V ISA Dev

Hi Greg,

   Thanks for moving forward on this.  However, there are technical aspects to the previous discussion that I don't feel were addressed.  I would like to make those explicit, so that the vote is based on clear communication.

The previous discussion had three types of arguments:
1) Technical, arguments, which were not fully covered, (key points were not addressed, see below),  The proposed PAUSE instruction is technically inferior, see the discussion below.
2) Social arguments -- "everyone else has this, so we should too."
3) Emotional argument -- "we want to use this as the first example in this new process."
4) Possibly political arguments. The discussion so far seems to have the feeling that minds are made up, and there is no openness to real discussion at this point.  There just seems to be this nagging feeling that there is a political desire to ram this through, without adequate discussion outside of the TG.  Not sure whether I'm the only person with that feeling..?  Or whether that's an accurate read?

So, I wasn't involved in the TG while this proposal was being put together, sadly, so I didn't have the chance to be part of the discussion.  It may just be too late, and the social, emotional, and political factors will override the technical.  If that is how it is, then that is how it is.  But, the purpose of this phase of the process, correct me if I'm wrong, is for those like myself to bring up technical aspects that are relevant.  I would like to participate in the process by doing so.

Purely Technical Discussion:

The two main technical points are:

1) PAUSE R0 == PAUSE  --  In other words, the pause with timeout, with register 0 as the source,  is, verbatim, equal to the currently proposed pause.  The same spec.  The same behavior.  The current wording, unchanged. 

2) PAUSE Rx  has superior performance to PAUSE -- as discussed below.

3) No implementation nor software downsides -- there didn't seem to be any downsides uncovered that were voiced, as far as I could tell.  (The previous discussion seemed to focus mainly on social, emotional, and political arguments, as far as I could tell -- again, if those are what decide the outcome, then it is what it is, but I would like to have the chance to focus purely on the technical aspects here, so that the decision is clear, on which grounds it was made).

To expand on PAUSE R0:  The new proposal is for the PAUSE instruction to take a single source register as input.  If that source register is register 0, then the current proposal is the specification.

To expand on PAUSE Rx:  When the register is not R0, then the register contents are fetched, and the contents equal the (HINT) number of CPU cycles to pause issuing instructions from the hart.  This allows other harts in the same core to issue instructions.  The number of cycles is a hint, so the implementation is free to make that a fixed number of cycles and ignore the register contents.  The implementation is also free to make PAUSE Rx a no-operation.

As Greg explained, the most visible current use of this type of instruction is inside spin locks, especially those in the Linux Kernel.

To illustrate the advantage of PAUSE Rx, here is simple pseudo code to implement linux style spinlocks.  (This code has superior performance):

R1 = 2
while ( TRUE )
{   
   have_lock = get_lock()
   if ( have_lock )  break
   else 
       PAUSE R1
       R1 << 1        (note, in real code, saturate at max value)
}

The PAUSE time exponentially increases each time the lock is not acquired.  In practice, this has been measured to provide up to 1000x improvement, depending upon the level of contention and the hardware implementation of atomic instructions.  In fact, on one implementation of the Xeon (Broadwell) in a machine with 10 dual threaded cores and 4 sockets, the machine failed to make progress for more than a day on a highly contentious lock.  Please see the attached paper for details of the work during which these measurements were made.  If requested, we can also share our measurements of performance impact on the Linux kernel's spinlocks, in our RISC-V "cluster CPU" core (see upcoming Global RISC-V Summit talk on the core).  Those measurements compare fixed pause time (the current proposal) against pause time specified by a register, as per the pseudo code above. 

The best, though, is if the interested readers perform their own experiments, such as with FireSim, with an implementation of the proposed PAUSE Rx instruction inserted into rocket core.  We can share such an implementation if any are interested.

When measuring, you will notice that the degree of gain depends directly on the level of contention for the lock.  The benefit becomes quite large when locks are highly contested.  This happens in several popular benchmarks, especially those for Big Data and Cloud workloads.  A fixed pause time is only optimal on one particular level of contention.  A self-adjusting pause time, as per above pseudo code, is efficient on all levels of contention.

Beyond hard evidence of benefit on current hardware with just spin locks in the kernel, is a discussion of expected future trends.  This is less solid footing, but consider that we are entering the era of 100 core chips -- see Ampere's 80 core chip and AMD's upcoming offering.  A platform that has efficient synchronization constructs has a distinct advantage on this new generation, which will shortly be the norm for Enterprise CPUs.  

By the way, this opens the question of whether the RISC-V ISA should take into account datacenter and Enterprise considerations, or should it stay only embedded and IoT?  The main benefits of PAUSE Rx (instead of fixed-time PAUSE) are that it will give RISC-V a measurable advantage.  It is fair game, of course if the community decides that Enterprise workloads are not interesting, and vote according to what is best for embedded only, but please be explicit.

Thank you for the work you're doing in the TG, and thank you for offering this chance for public discussion.  I understand that the decision will involve aspects outside of what is purely best on a technical level.  But I appreciate the chance to be clear that, on a technical only basis, as far as the evidence presented so far, and the discussion so far, the PAUSE Rx has superior performance, and no identified technical downside.

Best,

Sean Halle
CTO Intensivate



--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/13c42582-ef3d-4cc6-85d3-a975f1888190o%40groups.riscv.org.
0_Holistic_Perf_Tuning_from_Nina_Jan_11_2013.pdf

Sean Halle

unread,
Nov 23, 2020, 12:03:54 PM11/23/20
to Greg Favor, RISC-V ISA Dev

Hi Greg,

Have you had a chance to take a more thorough look at this?

So, some mixed.. feelings?  This.. feels like a delicate thing..  a lot of people involved.. and, literally, global consequences.  We're deciding on something that cannot be undone.  I'm okay with whatever outcome the group decides, but it.. feels..  sad?  If the process doesn't seem to work, you know?

So..  did the previous email communicate effectively?  Did it make sense?  The idea in the email was to just focus on the technical aspects.  

From the exchanges so far, what it looks like to me is that we're here:  Call them proposal default and proposal enhanced.  Enhanced is identical to default, literally in every way.  The exact same wording (with the added option to gain higher performance).  So enhanced seems to be free from downside, free from complications, and free from any negatives (unlike other enhanced that you pointed out), as far as has come up in the discussion.  However, enhanced has been demonstrated to provide superior performance.

It seems like a no brainer?

So.. 

I guess the question is.. how does this work?

Do we just leave it at that, and everyone votes?  They only get the choice of yes or no on the default?

I get the impression that it's too late to modify the proposal.  So, the only choice is to vote the default down?

Is this email list the only way to make the case for the enhanced version?

It would be nice if the process allowed alternative proposals to make a case on the ballot, so that it increases the chance that each vote has full information behind it.  We don't want to get this wrong, especially if it's just because voters didn't see the discussion, so they weren't even aware of the alternatives.

Thanks for all the hours put into the default proposal and this process.  Whatever happens, I have tremendous respect for you, Andrew, and the community driving the instruction set.  Thank you for all you're doing to make RISC-V a globally leading instruction set.

Thanks Greg,

Sean
Intensivate



K. York

unread,
Nov 23, 2020, 3:51:44 PM11/23/20
to Sean Halle, Greg Favor, RISC-V ISA Dev
On reading the thread, it seems like there was a contention between "register is a memory address" and "register is a cycle count". It sounds like the authors weren't convinced which was better, so decided to specify it with neither.

Your pseudocode example is persuasive, but brings up the point that the delay duration needs an upper limit (where simply context switching and letting the OS idle the core is the optimal action), and the software needs to prevent the shift from overflowing.

That's a significant amount of spec work and I'd say it would need to be resent for ratification after those changes.

kr...@berkeley.edu

unread,
Dec 15, 2020, 3:13:33 PM12/15/20
to Sean Halle, Greg Favor, RISC-V ISA Dev

Sean,

I'd like to weigh in on technical and non-technical aspects of this,
but in separate threads.

On the non-technical aspects, the fast track process is a way to get
smaller ideas incorporated that are generally agreed to be useful.
Any member can propose these, but needs to build support with a few
other members and chairs as per policy.

There is a public review, as we've just had, where anyone (even
non-members) can voice concerns, and the summary sent to TSC must
address these concerns. This does not mean they must modify proposal
for each objection, but must include rationale why they did not.

The TSC is the only group that votes to officially ratify proposals,
but they should see summary of all the feedback gathered during
review. That is the final step however. Along the way there are
discussions, such as these, and ISA committee chairs work with
proposers to make sure proposals ready for review. Objections must be
addressed in proposal to TSC, but not all objections necessarily
result in spec changes. We do not have binding votes below TSC level
(sense votes are possible to help chairs understand views of
interested parties at earlier stages), but do have need for consensus
among some number of chairs and other members depending on process.

As well as the usual development process, there is always an
escalation path if someone feels something bad is happening. We're
working on clarifying policies and escalation paths, but for now, tell
Mark Himelstein (CTO), he'll route accordingly. If the issue is with
Mark, raise with Calista (CEO), Yunsup (TSC chair) or me (BoD chair).
Also, you're always free to petition other TSC members, though please
use this sparingly.

We have been working to speed up the development/ratification process
at RISC-V International, and we've observed that bundling too many
instructions together in an extension has been one cause for slower
development, and so board and TSC are giving direction to break
proposals into smaller manageable pieces.

In this case, there seems to be no objections raised to the original
proposal.

You have proposed a further extension, which has raised some technical
issues to be addressed. That discussion can continue on the unprivileged list
(I'll chime in separate email) and could result in a fast track
proposal later once there is consensus.

There is no harm to your proposal in allowing the uncontested original
variant to be ratified. It is completely compatible with your
proposal and does not occupy encoding space, assembler forms, or ISA
string names that your proposal would need. However, there is harm to
the community in delaying ratification of uncontested useful
extensions.

Krste
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAJ4GwDJwvr%2BRNStm7nC7-%2Be7BKGYGcBXycpHm8gq61GdoDgQkA%40mail.gmail.com.

Sean Halle

unread,
Dec 16, 2020, 11:47:55 AM12/16/20
to Krste Asanovic, Greg Favor, RISC-V ISA Dev

Thank you Krste,

   I am honored that you have taken the time to write a personal response, that is exceptionally clear and well reasoned.  If the TSC decides to ratify for logistical reasons, then I understand.  I can sense the desire for this first 45 day test to be a success, on the part of many.  And I understand the judgement call of "something imperfect now, make it better later".  Obviously I have deep respect for you and Greg, and I'm fine with whatever is ultimately decided.  

However, with great respect, and at the risk of appearing.. ungrateful?  Impertinent?  Could I just ask a few probing questions?

1) Is there any sort of solid idea of the cost of restarting the proposal with the better form?  Is it.. 10 hours of people's time?  100?  Will it delay approval by.. 3 months?  Longer?
2) Have you seen the phenomenon before, where things get stuck in a local minima?  For example, if the fixed, less performant form is adopted, and then when it comes time to vote on the better form, the argument is made "we already have a form of PAUSE" and as a result the better form doesn't have the votes?  Do you see a realistic chance that this could happen here?  If so, is the desire to save that 10 to 100 person hours and shave off that 3 months of time worth the risk of the ISA getting stuck with the less performant form?
3) Have you seen the phenomenon where two successive forms of the same instruction are introduced, and then the one that is first to arrive is used by compiler writers, library implementers, kernel developers.. and then when the second, better form arrives, it takes a very long time for it to supplant the inferior form that came first (especially given that the better form has highest impact in OoO designs in large core count chips, which are not yet popular for RISC-V)?   This would be the same question, is the savings now, worth the consequences down the road?

Those are judgement calls, so all that I ask is that those voting do take the time to quantify the two costs: on one side is the cost in people time and cost in months of delay required to approve the better version, versus on the other side is the long term cost of an inferior instruction in the ISA.  (A useful instruction, but less valuable than the better form).

You may not have had time to follow, but there have been very long, detailed, responses that lay out the following points, which have not been substantially contested:
1) Out-of-order micro-architectures can choose to speculate past atomic instructions (which makes the common case fast, and I believe is typical in OoO, but this needs to be verified).  In this case, speculation and roll back repeats, separated by PAUSE number of cycles in the spin loop.  This robs performance from the other harts in the core (and wastes energy).  During high contention, the consequences can be profound slowdown (we have one measured data point -- 4 socket 10 core/socket 2 thread/core Broadwell circa 2012, more work is needed to quantify the two PAUSE proposals).  The better form of PAUSE, in which Rs1 specifies the number of cycles to pause, fixes this (when used with adaptive backoff).
2) The superior form has only upside -- no concrete example has been given of a technical downside to making Rs1 contain the number of cycles to pause the hart (Rs1 is already in the op-code and the value in Rs1 is a hint to the number of cycles to pause that hart).  There may be an organizational logistic cost to updating the spec, but no technical downside has been identified, while definite performance advantage, as above, has been illustrated.  The only standing block to the superior form that has been stated is the logistics of this approval process.

Again, I understand the need to make executive decisions that prioritize logistics above technical merit, so no worries on our end if that is what wins out.  Our desire has been to highlight the cost of the fixed form of PAUSE in the hopes that the judgement is informed by quantified, concrete information. 

So, here's my last impertinent question :-)   Is there the possibility that there's some element of digging in heels, battling to get what is wanted?  I'm not saying that there is, but to someone outside the TSC process, who hasn't experienced the logistics, it is bewildering, the level of resistance to what appears to be a simple change that is clearly better in the long run.  The hope is just for a dispassionate decision, based on quantified tradeoffs.

Thank you for writing a personal note, I hope my questions have not given offense, and thank you for RISC-V :-)

Best,

Sean






kr...@berkeley.edu

unread,
Dec 16, 2020, 2:33:49 PM12/16/20
to Sean Halle, Krste Asanovic, Greg Favor, RISC-V ISA Dev

>>>>> On Wed, 16 Dec 2020 08:47:11 -0800, Sean Halle <sean...@gmail.com> said:

| Thank you Krste,
|    I am honored that you have taken the time to write a personal response,
| that is exceptionally clear and well reasoned.  If the TSC decides to ratify
| for logistical reasons, then I understand.  I can sense the desire for this
| first 45 day test to be a success, on the part of many.  And I understand the
| judgement call of "something imperfect now, make it better later".  Obviously
| I have deep respect for you and Greg, and I'm fine with whatever is ultimately
| decided.  

| However, with great respect, and at the risk of appearing.. ungrateful? 
| Impertinent?  Could I just ask a few probing questions?

| 1) Is there any sort of solid idea of the cost of restarting the proposal with
| the better form?  Is it.. 10 hours of people's time?  100?  Will it delay
| approval by.. 3 months?  Longer?

I would guess multiple months.

| 2) Have you seen the phenomenon before, where things get stuck in a local
| minima?  For example, if the fixed, less performant form is adopted, and then
| when it comes time to vote on the better form, the argument is made "we
| already have a form of PAUSE" and as a result the better form doesn't have the
| votes?  Do you see a realistic chance that this could happen here?  If so, is
| the desire to save that 10 to 100 person hours and shave off that 3 months of
| time worth the risk of the ISA getting stuck with the less performant form?

If the proposal doesn't get support, that means there is no consensus
that the proposal is a worthwhile improvement.

In general, I'd say forces work in the opposite direction, that people
generally want more added to ISA than can be justified by implementer
ROI.

| 3) Have you seen the phenomenon where two successive forms of the same
| instruction are introduced, and then the one that is first to arrive is used
| by compiler writers, library implementers, kernel developers.. and then when
| the second, better form arrives, it takes a very long time for it to supplant
| the inferior form that came first (especially given that the better form has
| highest impact in OoO designs in large core count chips, which are not yet
| popular for RISC-V)?   This would be the same question, is the savings now,
| worth the consequences down the road?

Speaking in generalities, the phenomenon is certainly there, but if
the second form is a real improvement, then it usually gets adopted
over time. If the second form is not a big improvement, then this
phenomonen is entirely rational behavior on part of the community.

In this specific case, experienced implementers are not clear that
there is a real improvement with second form.

Many developers have been working on OoO RISC-V designs for a while,
and many technical members have decades of experience building
previous high-end industry OoO designs, so proposals are reviewed with
this in mind.

| Those are judgement calls, so all that I ask is that those voting do take the
| time to quantify the two costs: on one side is the cost in people time and
| cost in months of delay required to approve the better version, versus on the
| other side is the long term cost of an inferior instruction in the ISA.  (A
| useful instruction, but less valuable than the better form).

Your opinion is that this is a better version, the community does not
agree.

| You may not have had time to follow, but there have been very long, detailed,
| responses that lay out the following points, which have not been substantially
| contested:

I was going to follow up on technical group but can follow up here
quickly.

| 1) Out-of-order micro-architectures can choose to speculate past atomic
| instructions (which makes the common case fast, and I believe is typical in
| OoO, but this needs to be verified).  In this case, speculation and roll back
| repeats, separated by PAUSE number of cycles in the spin loop.  This robs
| performance from the other harts in the core (and wastes energy).

Base PAUSE should have number of stall cycles set to avoid most
overhead in this case. There is no reason to flush fetch buffer when
PAUSE hits decode stage. Same for multithreaded core with fetch
buffer per thread. A highly multithreaded core might want to throw
work away and replay, but again, the PAUSE can be sized to remove most
overhead.

| During high
| contention, the consequences can be profound slowdown (we have one measured
| data point -- 4 socket 10 core/socket 2 thread/core Broadwell circa 2012, more
| work is needed to quantify the two PAUSE proposals).  The better form of
| PAUSE, in which Rs1 specifies the number of cycles to pause, fixes this (when
| used with adaptive backoff).

As others have noted, a software loop around the base PAUSE can
implement exponential backoff delay in face of contention. The base
PAUSE delay should be set so that almost all overhead is removed from
executing iterations of this loop, so your proposal to move the count
into hardware is only a minor perf improvement.

| 2) The superior form has only upside -- no concrete example has been given of
| a technical downside to making Rs1 contain the number of cycles to pause the
| hart (Rs1 is already in the op-code and the value in Rs1 is a hint to the
| number of cycles to pause that hart).

Implementations of the extended form would have to pick an upper bound
on count supported, if only for purposes of bounding verification
effort. If this bound is small, then software might not see correct
backoff behavior and would be motivated to add a software loop in any
case. This severely limits the general usefulness of the added form.

| There may be an organizational
| logistic cost to updating the spec, but no technical downside has been
| identified, while definite performance advantage, as above, has been
| illustrated.  The only standing block to the superior form that has been
| stated is the logistics of this approval process.

I think the issue is that others see the perceived benefit as low.

| Again, I understand the need to make executive decisions that prioritize
| logistics above technical merit, so no worries on our end if that is what wins
| out.
| Our desire has been to highlight the cost of the fixed form of PAUSE in
| the hopes that the judgement is informed by quantified, concrete information. 

| So, here's my last impertinent question :-)   Is there the possibility that
| there's some element of digging in heels, battling to get what is wanted?  I'm
| not saying that there is, but to someone outside the TSC process, who hasn't
| experienced the logistics, it is bewildering, the level of resistance to what
| appears to be a simple change that is clearly better in the long run.  The
| hope is just for a dispassionate decision, based on quantified
| tradeoffs.

The issue is not about prioritizing logistics over technical merit.
The issue is that your perception of technical merit is not widely
shared. Specifically, you need to address why a software backoff loop
is not sufficient.

| Thank you for writing a personal note, I hope my questions have not given
| offense, and thank you for RISC-V :-)

None taken, and thanks for working with RISC-V!

Krste
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAJ4GwDJwvr%2BRNStm7nC7-%2Be7BKGYGcBXycpHm8gq61GdoDgQkA%40mail.gmail.com
| .

David Horner

unread,
Dec 16, 2020, 8:02:19 PM12/16/20
to RISC-V ISA Dev, krste, Greg Favor, RISC-V ISA Dev, Sean Halle

@krste

| | instructions (which makes the common case fast, and I believe is typical in
| | OoO, but this needs to be verified).  In this case, speculation and roll back
| | repeats, separated by PAUSE number of cycles in the spin loop.  This robs
| | performance from the other harts in the core (and wastes energy).

| Base PAUSE should have number of stall cycles set to avoid most
| overhead in this case. There is no reason to flush fetch buffer when
| PAUSE hits decode stage. Same for multithreaded core with fetch
| buffer per thread. A highly multithreaded core might want to throw
| work away and replay, but again, the PAUSE can be sized to remove most

I believe this mischaracterizes Sean's concern.

Sean was explaining, and rightly so,  that the failed atomic itself would cause the roll back and re-speculation.

He did not suggest the PAUSE would flush the fetch buffer.

If all that PAUSE does is slow the thrashing it performs a valuable operation.

Providing a predictively sized PAUSE delay to avoid the thrashing is extremely performant, likely justifying a Variable PAUSE.

However, the current state of the art is unable to efficiently determine this PAUSE value.

kr...@berkeley.edu

unread,
Dec 17, 2020, 4:04:54 AM12/17/20
to David Horner, RISC-V ISA Dev, krste, Greg Favor, Sean Halle

>>>>> On Wed, 16 Dec 2020 17:02:18 -0800 (PST), David Horner <ds2h...@gmail.com> said:

| @krste
| | | instructions (which makes the common case fast, and I believe is typical in
| | | OoO, but this needs to be verified). In this case, speculation and roll back
| | | repeats, separated by PAUSE number of cycles in the spin loop. This robs
| | | performance from the other harts in the core (and wastes energy).

| | Base PAUSE should have number of stall cycles set to avoid most
| | overhead in this case. There is no reason to flush fetch buffer when
| | PAUSE hits decode stage. Same for multithreaded core with fetch
| | buffer per thread. A highly multithreaded core might want to throw
| | work away and replay, but again, the PAUSE can be sized to remove most

| I believe this mischaracterizes Sean's concern.
| Sean was explaining, and rightly so, that the failed atomic itself would cause the roll back and re-speculation.
| He did not suggest the PAUSE would flush the fetch buffer.

Yes, if the branch checking the atomic result was misspeculated (as
expected here, otherwise the hart was experiencing a lot of
first-attempt contention), then that misspeculation causes a flush of
the instructions speculatively fetched along the lock-obtained path.

The lock-unavailable path includes a PAUSE instruction that prevents
fetch from getting too far ahead. Once the PAUSE releases, the system
will execute the atomic and check with a branch again, and could flush
the pipeline again on next misspeculation. If the PAUSE repeats,
however, there won't be any flushes after first couple iterations as the
branch predictor will train to follow retry path.

| If all that PAUSE does is slow the thrashing it performs a valuable operation.
| Providing a predictively sized PAUSE delay to avoid the thrashing is extremely performant, likely justifying a Variable PAUSE.
| However, the current state of the art is unable to efficiently determine this PAUSE value.

After a couple of iterations, there shouldn't be additional flushing.
PAUSE is just reducing duty cycle of instructions executing in
pipeline for this thread.

Krste

Sean Halle

unread,
Dec 17, 2020, 5:58:02 AM12/17/20
to Krste Asanovic, Greg Favor, RISC-V ISA Dev

Well, this is embarrassing :-)   The lights finally came on.  Thanks for your patience with me.

So, just to sync up, let's see if we have the same understanding..

It will help to have a bit of pseudo code:

while ( ! acquire(lock) )
{  PAUSE
}
do useful stuff

The lock will always be speculatively acquired, and the core will go ahead and execute "do useful stuff"..  but then the cache system informs it that the acquire actually failed, and all those instructions are rolled back.  After rollback, the pipeline restarts at the PAUSE instruction and executes it.  After PAUSE number of cycles then it loops back, the acquire.  The pipeline will speculatively acquire the lock again, then "do useful stuff" again, then the cache system informs of mispeculation again, repeat.   On average, if the fixed PAUSE time is equal to roughly the time to resolve a mis-speculated atomic (which can be approximated as time to access L3, one choice for fixed PAUSE), then you can expect that approximately 50% of the cycles are doing mis-speculated instructions that are rolled back, while 50% of the cycles are spent on PAUSE.  This 50-50 ratio continues until the lock is actually acquired.

If such a loop is used, then this will be a source of performance loss --  50% of cycles are spent on mis-speculated instructions, taking away from other harts in the same core.

In contrast, consider this pseudo-code:

<set initial numIterationsToPause in intelligent way>

while ( ! acquire(lock) )
   {  for( i=0; i < numIterationsToPause; i++)
      { PAUSE
      }
      <increase numIterationsToPause in intelligent way>
}
do useful stuff

Each time through the outer loop, the number of iterations of the inner loop will increase, so the percent of cycles spent paused increases, and the percent cycles spent on mis-speculated instructions decreases.  Thus, this achieves nearly the same effect as variable PAUSE.

Is that what everyone has been trying to tell me?

If so, I agree :-)  Thank you for pointing this out.


Now, not to waste the opportunity, there was just one thing I wanted to follow up on, which was the statement "If the second form is not a big improvement, then this phenomenon is entirely rational behavior on part of the community."  
I realize the point is moot now, but there's an interesting question that I really wanted to hear your thoughts on.  The setup for the question is that the impact, here, depends on workload (what matters is the worst case workload, with heavy contention).  The majority of the community may not encounter worst case workloads, so they won't have direct experience of the issue, and may not care.

So, the question is, if the majority of the community isn't experiencing this performance loss, but a minority experiences it acutely, is it RISC-V foundation policy that majority rules, and so it leaves that minority out in the cold?  What is the calculus on the TSC for making that decision, of whether to adopt an instruction that has significant benefit to a relatively small segment of the community, or reject it and thereby make that minority suffer the realities of modifying tools, modifying kernel, etc, to support the extension on their own?  It's not an easy decision, choosing who does the work of supporting the new instruction (community if adopted, minority if not).  I'm curious about the thinking in the TSC on how to weigh that.

Thanks for the productive discussion, I deeply appreciate it.  I'm on the train now, fully in support of fixed PAUSE.

Best,

Sean


kr...@berkeley.edu

unread,
Dec 17, 2020, 9:40:35 PM12/17/20
to Sean Halle, Krste Asanovic, Greg Favor, RISC-V ISA Dev

>>>>> On Thu, 17 Dec 2020 02:57:18 -0800, Sean Halle <sean...@gmail.com> said:
| Well, this is embarrassing :-)   The lights finally came on.  Thanks for your
| patience with me.

| So, just to sync up, let's see if we have the same understanding..

| It will help to have a bit of pseudo code:

| while ( ! acquire(lock) )
| {  PAUSE
| }
| do useful stuff

| The lock will always be speculatively acquired,

I think it is less confusing to say "the processor will predict the
lock is acquired". The lock is not actually acquired unless the
sequence commits. Speculative lock elision (SLE) is a different
technique that doesn't need to actually acquire the lock to commit the
critical section.

| and the core will go ahead and
| execute "do useful stuff"..  but then the cache system informs it that the
| acquire actually failed, and all those instructions are rolled back.

It is the resolution of the branch checking the value returned by the
acquire that causes rollback, not the cache returning a value.

| After
| rollback, the pipeline restarts at the PAUSE instruction and executes it. 
| After PAUSE number of cycles then it loops back, the acquire.  The pipeline
| will speculatively acquire the lock again, then "do useful stuff" again, then
| the cache system informs of mispeculation again, repeat.   On average, if the
| fixed PAUSE time is equal to roughly the time to resolve a mis-speculated
| atomic (which can be approximated as time to access L3, one choice for fixed
| PAUSE), then you can expect that approximately 50% of the cycles are doing
| mis-speculated instructions that are rolled back, while 50% of the cycles are
| spent on PAUSE.  This 50-50 ratio continues until the lock is actually
| acquired.

| If such a loop is used, then this will be a source of performance loss --  50%
| of cycles are spent on mis-speculated instructions, taking away from other
| harts in the same core.

This is not correct. After one or two fails, the lock will be
predicted to be not acquired. The fetch unit will then speculatively
fetch probably several iterations of the "not acquired" loop into the
fetch buffer. Only one iteration's worth of instructions are
dispatched into the execution pipeline, then the next PAUSE stalls further
dispatch. The one branch that was dispatched will resolve, and it'll
be seen that it had correctly predicted that the lock was not acquired, so
nothing is flushed. Once the PAUSE times out in decode, the next iteration is
dispatched into the pipeline and the next PAUSE causes dispatch
to stall again.

In steady state, when looping without acquiring lock, only a few
instructions are dispatched into pipeline per iteration and none are
flushed as the branch is being predicted correctly. There is no
misspeculation. The duty cycle is probably a few percent at most.

| In contrast, consider this pseudo-code:

| <set initial numIterationsToPause in intelligent way>

| while ( ! acquire(lock) )
|    {  for( i=0; i < numIterationsToPause; i++)
|       { PAUSE
|       }
|       <increase numIterationsToPause in intelligent way>
| }
| do useful stuff

| Each time through the outer loop, the number of iterations of the inner loop
| will increase, so the percent of cycles spent paused increases, and the
| percent cycles spent on mis-speculated instructions decreases.  Thus, this
| achieves nearly the same effect as variable PAUSE.

The advantage of this second form is that you don't generate as much
memory traffic as you back off on polling, but the code runs similar
to above.

| Is that what everyone has been trying to tell me?

You're getting warmer, but hopefully explanation above will help get
you all the way there.

| If so, I agree :-)  Thank you for pointing this out.

| Now, not to waste the opportunity, there was just one thing I wanted to follow
| up on, which was the statement "If the second form is not a big improvement,
| then this phenomenon is entirely rational behavior on part of the
| community."  
| I realize the point is moot now, but there's an interesting question that I
| really wanted to hear your thoughts on.  The setup for the question is that
| the impact, here, depends on workload (what matters is the worst case
| workload, with heavy contention).  The majority of the community may not
| encounter worst case workloads, so they won't have direct experience of the
| issue, and may not care.

| So, the question is, if the majority of the community isn't experiencing this
| performance loss, but a minority experiences it acutely, is it RISC-V
| foundation policy that majority rules, and so it leaves that minority out in
| the cold?  What is the calculus on the TSC for making that decision, of
| whether to adopt an instruction that has significant benefit to a relatively
| small segment of the community, or reject it and thereby make that minority
| suffer the realities of modifying tools, modifying kernel, etc, to support the
| extension on their own?  It's not an easy decision, choosing who does the work
| of supporting the new instruction (community if adopted, minority if not). 
| I'm curious about the thinking in the TSC on how to weigh that.

There is no simple answer here.

At this point, there is more work than community can handle for the
cases that many care about, so question is somewhat moot. Also, the
Foundation has limited resources, and most members naturally work on
projects important to them.

But when defining new instructions and setting up basic tooling, as
many use cases as possible should be considered, if not prioritized,
and we rely on breadth of membership to bring those up. Sometimes a
simple change or loosening of unnecessary mandates allows a
specification to be more widely used.

We have been moving to a model where the ISA specifications are made
as general and flexible as possible, but architecture profiles and
hardware platform specifications can constrain the choices in the ISA
specifications to reduce burden on software ecosystem.

The great thing about an open standard is anyone can push forward the
parts they care about, particularly on platform and software side, but
we do want to share the ISA encoding across all projects to reduce
total effort.

Krste

Sean Halle

unread,
Dec 22, 2020, 2:04:41 PM12/22/20
to Krste Asanovic, Greg Favor, RISC-V ISA Dev

Thank you Krste, for the insightful background on how the TSC makes decisions.  That is a difficult position for a TSC member, taking time out from their own work to dive into what must be many proposals coming in, each of which takes a lot of time to evaluate.

There is just one technical thread I'd like to tie up, if possible.  It is related to the measurements my team of grad students made on Xeon Broadwell back in 2012.  We observed dramatic slowdown as the amount of useful work between lock acquisitions decreased.  We ran an experiment to measure the overhead of acquiring a lock (thus we were not using the PAUSE instruction).  We had a critical section that only one hart at a time could be in.  Each hart had to acquire, increment the count, do a dummy payload, and then release for the next.  When the end count was reached, the program terminated.  The total running time should equal the total number of times the critical section was entered, multiplied by the sum of: the time to enter the critical section plus the cycles of work inside the critical section, plus time to leave.   When the amount of work inside the critical section was large then this equation held.  However, as the amount of work became smaller, a point was reached where the total time deviated and then grew extremely quickly, until it reached what appeared to be livelock -- the program normally finished in seconds.  We left it running for a full day and it didn't finish.

We hypothesized that the compare-exchange instruction was speculatively returning success.  We were not sure whether this was possible, but this explanation fit the data, in the case that overlapping attempts were being rolled back.  There is a stack overflow answer that mentions this patent: https://patents.google.com/patent/US7529914  which "describes a mechanism for loosening the locking requirement. In particular, executing the load/lock part of the operation early, basically as a plain load, and speculating that the associated cache won't be "stolen" in the time between when the load executes and the store commits. If such a stolen cache line does occur, the operation needs to be replayed." (quote from stack overflow answer).  Do you have any thoughts on this?  Or anyone? 

If that is, indeed, a correct hypothesis, then for the discussion about the loop with PAUSE, the conclusion of lost performance should hold, no?  
IE, in the presence of the patent's speculative atomic, the branch predictor will not be able to help us (and could even make things worse):

while ( ! acquire(lock) )
{  PAUSE
}
do useful stuff

The value given to the branch would be the "success" value that (in effect) was speculated, and the code would jump to "do useful stuff" -- then at a later time the atomic would be reported as wrong, and the pipeline drained.  In this case, the branch predictor would be confused :-)  It would see branch taken, and then the same branch not-taken, and then on the next attempt to acquire the lock (the atomic ((nearly)) always speculates the same way in this case) it would be taken again, and then after rollback, the branch goes the other direction again..  It's not clear whether or not the branch predictor can undo prediction updates upon rollback of the atomic..  Either way, tt means that rollback would happen on (nearly) every attempt to get the lock, due to atomic mis-speculation (when dummy payload becomes small, on an 80 hart 4 socket machine, the cache line would be "stolen" often -- which explains the livelock we measured).  The effect of the branch predictor can't remove cycles mis-speculated on the atomic, so best case is roughly 33% (PAUSE plus extra branch mis-speculated PAUSE) and worst case is roughly 66% (two consecutive roll-backs, one PAUSE).

In the end, the following, agreed upon, form of the loop fixes both issues, so the point is moot (just interesting):

<set initial numIterationsToPause in intelligent way>

while ( ! acquire(lock) )
{  for( i=0; i < numIterationsToPause; i++)
      { PAUSE
      }
      <increase numIterationsToPause in intelligent way>
}
do useful stuff


Thanks again for all the time you've put in, I hope it felt of at least some value :-)

Best,

Sean


 

kr...@berkeley.edu

unread,
Dec 22, 2020, 4:49:50 PM12/22/20
to Sean Halle, Krste Asanovic, Greg Favor, RISC-V ISA Dev

The patent describes a system that speculates on predicted-uncontended
locks. If a system was using this patent, then the underlying
predictor in the real hardware would be severely broken if it predicted
a lock acquire that repeatedly failed was uncontended.

I don't know what is happening in the scenario you're reporting. I'd
ask Intel directly.

Krste

>>>>> On Tue, 22 Dec 2020 11:04:00 -0800, Sean Halle <sean...@gmail.com> said:

| Thank you Krste, for the insightful background on how the TSC makes decisions.  That is a difficult position for a TSC member, taking
| time out from their own work to dive into what must be many proposals coming in, each of which takes a lot of time to evaluate.

| There is just one technical thread I'd like to tie up, if possible.  It is related to the measurements my team of grad students made
| on Xeon Broadwell back in 2012.  We observed dramatic slowdown as the amount of useful work between lock acquisitions decreased.  We
| ran an experiment to measure the overhead of acquiring a lock (thus we were not using the PAUSE instruction).  We had a critical
| section that only one hart at a time could be in.  Each hart had to acquire, increment the count, do a dummy payload, and then release
| for the next.  When the end count was reached, the program terminated.  The total running time should equal the total number of times
| the critical section was entered, multiplied by the sum of: the time to enter the critical section plus the cycles of work inside the
| critical section, plus time to leave.   When the amount of work inside the critical section was large then this equation held. 
| However, as the amount of work became smaller, a point was reached where the total time deviated and then grew extremely quickly,
| until it reached what appeared to be livelock -- the program normally finished in seconds.  We left it running for a full day and it
| didn't finish.

| We hypothesized that the compare-exchange instruction was speculatively returning success.  We were not sure whether this was
| possible, but this explanation fit the data, in the case that overlapping attempts were being rolled back.  There is a stack
| overflow answer that mentions this patent: https://patents.google.com/patent/US7529914  which "describes a mechanism for loosening the
| locking requirement. In particular, executing the load/lock part of the operation early, basically as a plain load, and speculating
| that the associated cache won't be "stolen" in the time between when the load executes and the store commits. If such a stolen cache
| line does occur, the operation needs to be replayed." (quote from stack overflow answer).  Do you have any thoughts on this?  Or
| anyone? 

| If that is, indeed, a correct hypothesis, then for the discussion about the loop with PAUSE, the conclusion of lost performance should
| hold, no?  
| IE, in the presence of the patent's speculative atomic, the branch predictor will not be able to help us (and could even make things


Paul Campbell

unread,
Dec 23, 2020, 5:06:21 AM12/23/20
to RISC-V ISA Dev
On Wednesday, 23 December 2020 10:49:31 AM NZDT kr...@berkeley.edu wrote:
> The patent describes a system that speculates on predicted-uncontended
> locks. If a system was using this patent, then the underlying
> predictor in the real hardware would be severely broken if it predicted
> a lock acquire that repeatedly failed was uncontended.

Hmm, makes me thing that a special branch instruction that never predicts
taken might be a useful thing (and likely avoids the patent)

Paul


Jose Renau

unread,
Dec 23, 2020, 2:36:54 PM12/23/20
to kr...@berkeley.edu, Sean Halle, Greg Favor, RISC-V ISA Dev

from my understanding of the patent  https://patents.google.com/patent/US7529914 , this is a way to speedup locks. I mean, if you do nothing in the atomic with a TSO model, you need to wait to perform the load at commit time which is bad for performance. This allows to speculatively break it in two uops, issue the load first. As long as the cache was not invalidated, it is OK not to execute the swap atomically. If the line was invalidated, the swap needs to be re-executed. (This is why contended vs non-contended matters. In a contended lock this will trigger an abort/NUKE and slowdown the core).

Issuing a prefetch with intention to modify would have similar results, and a bit simpler without NUKE. (I just went over the abstract, it may be something else or not in this patent)

Jose Renau
Professor, Computer Science and Engineering. IEEE TCMM Chair
twitter icon  linkedIn icon  https://masc.cse.ucsc.edu
Reply all
Reply to author
Forward
0 new messages