Public review of Fast Track extension Zihintntl

158 views
Skip to first unread message

Andrew Waterman

unread,
Feb 14, 2023, 6:02:35 PM2/14/23
to RISC-V ISA Dev, tech-unp...@lists.riscv.org, tech-a...@lists.riscv.org
We are delighted to announce the start of the public review period for the proposed Fast-Track extension Zihintntl to the RISC-V ISA.  This extension adds non-temporal locality hints, which affect the performance characteristics of memory-access instructions.

The review period begins today, February 14, 2023, and ends on March 31, 2023.

This extension is part of the Unprivileged Specification.

These extensions are described in the PDF spec available at https://drive.google.com/file/d/1QfGFllFivV1cVM899TCRMfpBNwhhZjWp/view?usp=share_link which was generated from the source available in the following GitHub repo: https://github.com/riscv/riscv-isa-manual

To respond to the public review, please either email comments to the public isa-dev mailing list or add issues and/or pull requests to the RISC-V ISA Manual GitHub repo, https://github.com/riscv/riscv-isa-manual. We welcome all input and appreciate your time and effort in helping us by reviewing the specification.

During the public review period, corrections, comments, and suggestions, will be gathered for review by the Unprivileged Spec ISA Committee. Any minor corrections and/or uncontroversial changes will be incorporated into the specification. Any remaining issues or proposed changes will be addressed in the public review summary report. If there are no issues that require incompatible changes to the public review specification, the Unprivileged ISA Committee will recommend the updated specifications be approved and ratified by the RISC-V Technical Steering Committee and the RISC-V Board of Directors.

Thanks to all the contributors for all their hard work.

Andrew Waterman

Vice-Chair, Privileged ISA Committee

MitchAlsup

unread,
Feb 14, 2023, 6:17:01 PM2/14/23
to RISC-V ISA Dev, Andrew Waterman, tech-unp...@lists.riscv.org, tech-a...@lists.riscv.org
Having read the specification, I am left wondering if there is a way for the compiler or application to query the current memory hierarchy and use the query to control the non-temporal nature of the memory references. Reading the spec would indicate no. However, using the proposed instruction substitute, one could put a bit pattern in x6 and use ADD   x0,x0,x6 as the dynamic NT selection instruction.

Andrew Waterman

unread,
Feb 14, 2023, 6:42:49 PM2/14/23
to MitchAlsup, RISC-V ISA Dev, tech-a...@lists.riscv.org
On Tue, Feb 14, 2023 at 3:17 PM 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Having read the specification, I am left wondering if there is a way for the compiler or application to query the current memory hierarchy and use the query to control the non-temporal nature of the memory references. Reading the spec would indicate no. However, using the proposed instruction substitute, one could put a bit pattern in x6 and use ADD   x0,x0,x6 as the dynamic NT selection instruction.

Microarchitecture discovery mechanisms are outside the scope of this specification.

Also, the spec gives guidelines for how portable software (i.e., software that is likely ignorant of the memory hierarchy on which it's currently executing) should take advantage of these instructions.

On Tuesday, February 14, 2023 at 5:02:35 PM UTC-6 Andrew Waterman wrote:
We are delighted to announce the start of the public review period for the proposed Fast-Track extension Zihintntl to the RISC-V ISA.  This extension adds non-temporal locality hints, which affect the performance characteristics of memory-access instructions.

The review period begins today, February 14, 2023, and ends on March 31, 2023.

This extension is part of the Unprivileged Specification.

These extensions are described in the PDF spec available at https://drive.google.com/file/d/1QfGFllFivV1cVM899TCRMfpBNwhhZjWp/view?usp=share_link which was generated from the source available in the following GitHub repo: https://github.com/riscv/riscv-isa-manual

To respond to the public review, please either email comments to the public isa-dev mailing list or add issues and/or pull requests to the RISC-V ISA Manual GitHub repo, https://github.com/riscv/riscv-isa-manual. We welcome all input and appreciate your time and effort in helping us by reviewing the specification.

During the public review period, corrections, comments, and suggestions, will be gathered for review by the Unprivileged Spec ISA Committee. Any minor corrections and/or uncontroversial changes will be incorporated into the specification. Any remaining issues or proposed changes will be addressed in the public review summary report. If there are no issues that require incompatible changes to the public review specification, the Unprivileged ISA Committee will recommend the updated specifications be approved and ratified by the RISC-V Technical Steering Committee and the RISC-V Board of Directors.

Thanks to all the contributors for all their hard work.

Andrew Waterman

Vice-Chair, Privileged ISA Committee

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d80acb79-6c97-4480-ad37-8eee45a56b52n%40groups.riscv.org.

Allen Baum

unread,
Feb 15, 2023, 6:49:23 PM2/15/23
to tech-unp...@lists.riscv.org, and...@sifive.com, MitchAlsup, RISC-V ISA Dev
Well, the use as a hint  would query nothing - it can't - but what it sounds like is that what he is proposing is 
the equivalent of a dynamic rounding mode, so it would select the cache level based on the contents of x6.
I don't see a particular use case for that, really. I


Andrew Waterman

unread,
Feb 15, 2023, 7:08:13 PM2/15/23
to Paul Campbell, RISC-V ISA Dev, tech-a...@lists.riscv.org


On Wed, Feb 15, 2023 at 4:00 PM Paul Campbell <pa...@taniwha.com> wrote:
Thinking about this I suspect there's scope  here for a (probably existing)
covert channel that can be made easier (ie become higher bandwidth) by
manipulating cache state.

It's not a reason for not having such a facility, but it probably is a reason
for having a standard way for making sure that higher access modes have a way
to turn it off

Possibly.  Since these HINTs have no architecturally visible effects, a mechanism to suppress them could be added, without changing any semantics, if this concern comes to fruition.


        Paul


Andrew Waterman

unread,
Feb 15, 2023, 7:21:50 PM2/15/23
to Allen Baum, tech-unp...@lists.riscv.org, MitchAlsup, RISC-V ISA Dev
On Wed, Feb 15, 2023 at 3:49 PM Allen Baum <allen...@esperantotech.com> wrote:
Well, the use as a hint  would query nothing - it can't - but what it sounds like is that what he is proposing is 
the equivalent of a dynamic rounding mode, so it would select the cache level based on the contents of x6.
I don't see a particular use case for that, really. I

I was responding only to the querying part.  The indirect version adds some extra complexity to OOO designs that bind the hint to the following memory access during decode, as the memory access needs to get the x6 operand from somewhere.  (Tricks could be played if x6 were rarely written, but given its role in the calling convention as a temporary register, that isn't the case.)

MitchAlsup

unread,
Feb 15, 2023, 7:23:00 PM2/15/23
to RISC-V ISA Dev, Allen Baum, MitchAlsup, RISC-V ISA Dev, tech-unp...@lists.riscv.org, and...@sifive.com
On Wednesday, February 15, 2023 at 5:49:23 PM UTC-6 Allen Baum wrote:
Well, the use as a hint  would query nothing - it can't - but what it sounds like is that what he is proposing is 
the equivalent of a dynamic rounding mode, so it would select the cache level based on the contents of x6.
I don't see a particular use case for that, really. 

What about an implementation where there is a set of BIG cores and a set of LITTLE cores ?
The BIG cores have different L1 and L2 sizes than the little cores. So, the cache hinting
strategy would be different depending on which core you are running on.

When running on the BIG core you would not want to be hinting::L2 when the data fits in L1,
and vice versa. Say the BIG core had 64KB L1 caches 4-way with a 1MB L2 while the LITTLE 
core has 16KB 2-way L1 caches and 128KB L2 and both share a 8MB L3. An application
would not want to use BIG hints on a LITTLE core, nor vice versa.

On the other hand, if you have just hinted for BIG core and then get context switched to the LITTLE
core, ..... So, the proper thing to hint changes dynamically. {which may lead to hint thrashing.}

But back to the original statement of mine:: I as assuming the application had some way of asking
the GuestOS (or kernel) what the cache configuration was and that the response could be put
in an addressable register which would then perform the hint based on the current configuration.
{An alternative to asking GuestOS would be to query CPUID-equivalent.}

Allen Baum

unread,
Feb 16, 2023, 1:05:51 AM2/16/23
to MitchAlsup, RISC-V ISA Dev, tech-unp...@lists.riscv.org, and...@sifive.com
I can see a case for dynamic selection of cache parameters, for exactly the reason you mention
 (though I don't know how common it is, or how easily avoided)
But: that doesn't work if the parameter is in a general register. 
It would work if it were in a CSR (like rounding mode),
 but when GPR get spilled, and restored into a different core, 
the old value no longer works in the new caching structure, so still impractical.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Derek Hower

unread,
Feb 16, 2023, 10:41:59 AM2/16/23
to isa...@groups.riscv.org

 

We would prefer to see non-temporal accesses as their own instructions, rather than a hint in a two-instruction sequence.

 

  • Instruction fusion won’t be perfect, especially in high-perf designs, so the two-instruction hint will be missed some percentage of the time. In some cases (e.g., when the hint and access straddle a cache line), it may be missed 100% of the time.+
    • This reduces the utility of the hint
    • This adds challenges to DV performance verification
    • This creates scenarios that are likely to be non-ideal. When the hint it missed, a line may be pulled into the cache. When the hint is found, the hardware then must decide what to do with the line already in the cache. Does it push it out to the specified level or keep it where it is? Both options have downsides.

 

This would apply equally to stores, loads, and prefetches.

 

I also wonder if it is wise to provide guidelines on portable usage. Given the breadth of RISC-V designs from microcontrollers to high-perf, those guidelines don’t seem generally applicable. Perhaps they belong in a RISC-V profile, but I would prefer omitting them entirely since performance portability is difficult to achieve in practice.

 

As an alternative, we could consider specifying a data set size rather than a target cache level. That would resolve issues with implementations having vastly different memory hierarchies and leave it up to a design to decide what to do. For example, a non-temporal access could encode a data set size as a power of two between, say, 16KB-1G.

 

-Derek Hower

Qualcomm

 

+ Of course, these problems can be solved with micro-architecture complexity, but why add complexity when the problem is simple to solve in the ISA?

 

Anthony Coulter

unread,
Feb 16, 2023, 12:00:58 PM2/16/23
to dho...@qti.qualcomm.com, isa...@groups.riscv.org
The current non-temporal proposal has the virtue of using very little
encoding space (four r-type ADDs and their compressed variants) while
potentially applying to *all* loads and stores, including vector and
hypervisor operations. It also has the virtue of being a true
hint---the hints are simply ignored by existing systems.

Creating a new set of load/store instructions which behave like the
existing ones except that they are nontemporal would have neither of
these advantages. Existing RISC-V systems would have to throw illegal
instruction exceptions and attempting to duplicate *all* load/store
instructions would be prohibitively expensive.

On the third note, about not providing guidelines for portable usage:
if they can't be used portably, then why define them at all? I
recognize that there are some things, e.g. performance counters, where
it is advantageous to define an interface by allocating precious CSR
numbers without actually specifying what the counters are. But the
whole point of hints is that they can be ignored. These are
instructions that can be used in "shrink-wrapped" software that's
compiled once and distributed to everyone. Surely there are places
in big bloated applications like, say, web browsers where the CPU is
chasing pointers but doesn't want to spoil its cache. A web browser
will have different performance characteristics on different hardware,
but that doesn't mean that people can't experiment with compiler flags
to find something that works pretty well most of the time.

If you don't care about portability, the easiest way to address all of
these concerns is to use the custom instruction space to put some
XLEN-sized load/store instructions in the CUSTOM opcode space, and to
make them fit by truncating the immediate field (perhaps even to length
zero). Then in your documentation, you can specify exactly what these
instructions do and when they should be used. Software using your
CUSTOM instructions will crash when run on CPUs that don't support
them, but that's unavoidable with *any* attempt to create new
load/store instructions that aren't just regular instructions preceded
by hints.

Anthony Coulter

Derek Hower

unread,
Feb 16, 2023, 12:40:22 PM2/16/23
to RISC-V ISA Dev, Anthony Coulter, dho...@qti.qualcomm.com
I get the point about opcode space. There is very little left, and these instructions aren't important enough to dedicate a large chunk to.

That said, specifying these as hints in a sequence risks rendering the extension useless in most markets. You could probably get a benefit on hand-tuned code for a specific microarchitecture, but there is a low probability of that benefit translating broadly due to the issues I mentioned earlier. 

If you want these instructions as proposed to more reliably translate across implementations, you are essentially saying that you're adding standard 48- and 64- bit instructions to the ISA.

-Derek

Simon Horner

unread,
Feb 17, 2023, 10:33:09 PM2/17/23
to Derek Hower, RISC-V ISA Dev, Anthony Coulter
My phones has been hacked by Rn Nicole 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

kr...@sifive.com

unread,
Feb 20, 2023, 7:31:44 PM2/20/23
to Derek Hower, isa...@groups.riscv.org

>>>>> On Thu, 16 Feb 2023 15:41:53 +0000, Derek Hower <dho...@qti.qualcomm.com> said:

| We would prefer to see non-temporal accesses as their own instructions, rather than a hint in a two-instruction sequence.
| ● Instruction fusion won’t be perfect, especially in high-perf designs, so the two-instruction hint will be missed some percentage of the time. In some cases (e.g.,
| when the hint and access straddle a cache line), it may be
| missed 100% of the time.+

A uarch that supports the hint, would wait for the full instruction to
be available to decode and fuse. This would effectively treat it as a
longer >32b instruction and the purported problem should not occur
(though leaving an out to drop the fusion when u-architecturally
inconvenient, but this should be rare, not something as common as an
instruction straddling a cache line or page boundary).

| ○ This reduces the utility of the hint

Not for machines that actually support it. In any case, encoding
space constraints would require a >32b static encoding if it was done
this way, which results in much the same front-end uarch design so
nothing is saved.

| ○ This adds challenges to DV performance verification

If you mean DV for a particular uarch, then the perf behavior should
be part of the spec being verified.

| ○ This creates scenarios that are likely to be non-ideal. When the hint it missed, a line may be pulled into the cache. When the hint is found, the hardware
| then must decide what to do with the line already in the cache. Does it push it out to the specified level or keep it where it is? Both options have
| downsides.

| This would apply equally to stores, loads, and prefetches.

First, hint misses should be very rare in a well-designed uarch.
Second, the hint would do nothing if the line is present in an inner
level - there's no reason to flush it out, though some might look at
proactively cleaning a line that was dirty.

| I also wonder if it is wise to provide guidelines on portable usage. Given the breadth of RISC-V designs from microcontrollers to high-perf, those guidelines don’t
| seem generally applicable. Perhaps they belong in a RISC-V profile, but I would prefer omitting them entirely since performance portability is difficult to achieve in
| practice.

Agreed, but folks would ask if something was not written down, and so
the spec tries to capture the space.

| As an alternative, we could consider specifying a data set size rather than a target cache level. That would resolve issues with implementations having vastly
| different memory hierarchies and leave it up to a design to decide what to do. For example, a non-temporal access could encode a data set size as a power of two
| between, say, 16KB-1G.

This was extensively discussed.

Portable use in general code gets a lot of attention, because people
want it to work, but as you say, it is difficult to get right outside
of a few obvious though useful cases ("won't fit in any cache" or
"keep these highly contended sync variables out of private caches").
The differences in other parts of the uarch mean that capacity is only
one parameter in tuning, along with e.g., prefetch bandwidth/policy
and replacement policy. Capacity encoding would require a lot of bits
to encode and even then doesn't cover private/shared distinction.

A very important case case, that is widely used, is specifically
controlling the hierarchy for highly tuned code for a known machine
(including in high-perf libraries that are linked in to a portable
binary via discovery, e.g, BLAS). Having the hint in terms of
capacities would not help this use case, which is why the guidance is
given how to map the hints to specific levels in a given uarch.

Krste

| -Derek Hower

| Qualcomm

| + Of course, these problems can be solved with micro-architecture complexity, but why add complexity when the problem is simple to solve in the ISA?

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/BN0PR02MB798471B0922C7590C45239E6F1A09%40BN0PR02MB7984.namprd02.prod.outlook.com.

Derek Hower

unread,
Feb 22, 2023, 1:05:11 PM2/22/23
to RISC-V ISA Dev, kr...@sifive.com, isa...@groups.riscv.org, Derek Hower
On Monday, February 20, 2023 at 7:31:44 PM UTC-5 kr...@sifive.com wrote:

>>>>> On Thu, 16 Feb 2023 15:41:53 +0000, Derek Hower <dho...@qti.qualcomm.com> said:

| We would prefer to see non-temporal accesses as their own instructions, rather than a hint in a two-instruction sequence.
| ● Instruction fusion won’t be perfect, especially in high-perf designs, so the two-instruction hint will be missed some percentage of the time. In some cases (e.g.,
| when the hint and access straddle a cache line), it may be
| missed 100% of the time.+

A uarch that supports the hint, would wait for the full instruction to
be available to decode and fuse. This would effectively treat it as a
longer >32b instruction and the purported problem should not occur
(though leaving an out to drop the fusion when u-architecturally
inconvenient, but this should be rare, not something as common as an
instruction straddling a cache line or page boundary).

Are there other instances already in RISC-V where there is effectively a 64b encoding?
 


| ○ This reduces the utility of the hint

Not for machines that actually support it. In any case, encoding
space constraints would require a >32b static encoding if it was done
this way, which results in much the same front-end uarch design so
nothing is saved.

That's only true if you keep all the same options for an non-temporal access. A smaller immediate, for example, would free space for other options.
 


| ○ This adds challenges to DV performance verification

If you mean DV for a particular uarch, then the perf behavior should
be part of the spec being verified.

| ○ This creates scenarios that are likely to be non-ideal. When the hint it missed, a line may be pulled into the cache. When the hint is found, the hardware
| then must decide what to do with the line already in the cache. Does it push it out to the specified level or keep it where it is? Both options have
| downsides.

| This would apply equally to stores, loads, and prefetches.

First, hint misses should be very rare in a well-designed uarch.

Again, that all depends on how much complexity you want to dedicate to the feature.
 

Second, the hint would do nothing if the line is present in an inner
level - there's no reason to flush it out, though some might look at
proactively cleaning a line that was dirty.

| I also wonder if it is wise to provide guidelines on portable usage. Given the breadth of RISC-V designs from microcontrollers to high-perf, those guidelines don’t
| seem generally applicable. Perhaps they belong in a RISC-V profile, but I would prefer omitting them entirely since performance portability is difficult to achieve in
| practice.

Agreed, but folks would ask if something was not written down, and so
the spec tries to capture the space.

Perhaps it could be added as a note instead of the main body?
 


| As an alternative, we could consider specifying a data set size rather than a target cache level. That would resolve issues with implementations having vastly
| different memory hierarchies and leave it up to a design to decide what to do. For example, a non-temporal access could encode a data set size as a power of two
| between, say, 16KB-1G.

This was extensively discussed.

Portable use in general code gets a lot of attention, because people
want it to work, but as you say, it is difficult to get right outside
of a few obvious though useful cases ("won't fit in any cache" or
"keep these highly contended sync variables out of private caches").

I would say that even these are hard to get right generally:

* It's hard to say "won't fit in any cache" when you consider code that would run both on a server with 128MB of cache and an IoT device with 32K.
* Sometimes you want highly contended sync variables in a private cache -- e.g., a system where the first shared cache is very far away. Furthermore, it might not even be possible to do an atomic op in a shared cache on some (most?) uarchs.
 

The differences in other parts of the uarch mean that capacity is only
one parameter in tuning, along with e.g., prefetch bandwidth/policy
and replacement policy. Capacity encoding would require a lot of bits
to encode and even then doesn't cover private/shared distinction.

A very important case case, that is widely used, is specifically
controlling the hierarchy for highly tuned code for a known machine
(including in high-perf libraries that are linked in to a portable
binary via discovery, e.g, BLAS). Having the hint in terms of
capacities would not help this use case, which is why the guidance is
given how to map the hints to specific levels in a given uarch.

That makes sense.

Allen Baum

unread,
Feb 23, 2023, 10:35:32 AM2/23/23
to Derek Hower, RISC-V ISA Dev, kr...@sifive.com
Responding to "Are there other instances already in RISC-V where there is effectively a 64b encoding?"
That is a bit tricky to answer.
In one sense, any pair of ops that can be macro fused fall into that category.
What makes this a bit different is that if not macro fused, and an interrupt (or trap, say on a page boundary) occurs, it operates differently.
The result, however, is still architecturally identical (but not in timing).
That might be problematic for data leakage, like so many other timing dependencies in code - but that's an entirely separate problem

Derek Hower

unread,
Feb 23, 2023, 10:47:21 AM2/23/23
to Allen Baum, RISC-V ISA Dev, kr...@sifive.com

Another difference compared to other documented suggestions for macro fusion is that the Zihintntl fusion is essentially mandated. If you don’t fuse, you aren’t actually implementing the Zihintntl extension.

 

-Derek

 

From: Allen Baum <allen...@esperantotech.com>
Sent: Thursday, February 23, 2023 10:35 AM
To: Derek Hower <dho...@qti.qualcomm.com>
Cc: RISC-V ISA Dev <isa...@groups.riscv.org>; kr...@sifive.com <kr...@sifive.com>
Subject: Re: [isa-dev] Public review of Fast Track extension Zihintntl

 

WARNING: This email originated from outside of Qualcomm. Please be wary of any links or attachments, and do not enable macros.

Allen Baum

unread,
Feb 23, 2023, 8:57:57 PM2/23/23
to Derek Hower, RISC-V ISA Dev, kr...@sifive.com
That is certainly not my reading. It is permitted, obviously, but not mandated.
Mccro-fusion is not enough to ensure that the hint isn't treat as a noop, e.g. 
if there is a page fault on a load that follows an ntlhint,then EPC will point to the load, and the hint will not be reexecuted upon a ret instruction.
A smart trap handler could check if the precious instruction was an ntlhint, and return to that 
-  but even that may not be correct if the store is the target of a jump or branch that skips over the inthint.

kr...@sifive.com

unread,
Feb 24, 2023, 3:59:15 AM2/24/23
to Derek Hower, Allen Baum, RISC-V ISA Dev, kr...@sifive.com

The implementation of fusion for Zihintntl is very straightforward
compared to other possible macro fusions. The front-end decode
process just has to propagate microarch hint information across to the
following instruction, and the hint can be retired from pipeline at
that point even if the following instruction is not present yet. The
purported issue with failed hints at runtime doesn't really occur.
The uarch state represented by the hint propagating between
instructions does not add new complexity in the presence of i-cache/TLB
misses, branch mispredicts, synchronous exceptions, or asynch
interrupts as it is handled the same way as other state in pipeline
flushes. An I-cache/TLB miss just means you have some dangling uarch
state in the fetch buffer, but that is true for 32b instructions
straddling I-cache/TLB lines. There should be no hint loss. A branch
mispredict flushes younger instructions which will cause front-end to
re-execute the correct path, building any hint on that path as it
goes, so again there is no hint loss.

A trap from either a synchronous exception or an asynch interrupt
could occur on either the hint or the memory instruction. In the
latter case, yes, the hint will be ignored at restart, but I don't
understand why folks are worried about this case. These are
relatively rare events and the trap and associated context swap will
have caused far more memory hierarchy churn than a single ignored
hint, whose effect will be lost in the noise.

I'll leave further implementation details as an exercise for the
reader, but it is really not complex for either a simple in-order
pipeline or a high-performance front-end. If this is the only fusion
being done, the main added complexity might actually be updating the
instructions-retired counter correctly.

Trying to cram all the useful hint cases into a 32b encoding will take
a bunch of code points. I don't view this as a good use of encoding
space, given that these hints will only be used for a small fraction
of the static instructions in a binary. Given how straightforward
implementing hint fusion is, I also don't see a strong case to add
longer instruction encodings just for hinted memory operations.

We allow implementations to say they support Zihintntl even if they
handle as NOP. These are hints, and implementations are free to
ignore them. While some folk might believe we should have a stronger
requirement before allowing an implementation to claim they've
implemented Zihintntl, this doesn't make sense given that it is only a
performance tweak and there is a huge diversity of uarchs that vary
widely in performance for many other reasons.

Krste

Derek Hower

unread,
Feb 27, 2023, 2:24:59 PM2/27/23
to RISC-V ISA Dev, kr...@sifive.com, Allen Baum, RISC-V ISA Dev, Derek Hower


We allow implementations to say they support Zihintntl even if they
handle as NOP. These are hints, and implementations are free to
ignore them. While some folk might believe we should have a stronger
requirement before allowing an implementation to claim they've
implemented Zihintntl, this doesn't make sense given that it is only a
performance tweak and there is a huge diversity of uarchs that vary
widely in performance for many other reasons.

If NOP is a valid implementation, then I presume every existing and future RISC-V automatically supports Zihintntl since the opcode is already defined as a NOP. While that won't break any code from a functional standpoint, it certainly makes it confusing from a performance standpoint.

Given how implementation-dependent this is, why standardize the hint at all? In legacy ISAs, it makes sense to have standard non-temporal operations since custom instructions aren't an option. At the specification level, it's understood that the effects of legacy non-temporal operations will vary widely, but at the practical level there is a specific expectation about what will happen. For example, in the use case you mentioned earlier about an optimized BLAS library, software is written with the knowledge of how a non-temporal operation is implemented even though it is using a standard encoding with vague behavior.

If, like RISC-V, these legacy ISAs could be extended, would we still have standard non-temporal operations? Custom instructions will be a better match to the specific uarch since they can match the actual memory hierarchy and are a fine solution for implementation-specific code. Custom instructions can't be used in generic "performance portable" software, but it's arguable such a thing exists anyway.

-Derek

Krste Asanovic

unread,
Feb 27, 2023, 2:46:07 PM2/27/23
to Derek Hower, RISC-V ISA Dev, Allen Baum
On Feb 27, 2023, at 11:24 AM, Derek Hower <dho...@qti.qualcomm.com> wrote:


We allow implementations to say they support Zihintntl even if they
handle as NOP. These are hints, and implementations are free to
ignore them. While some folk might believe we should have a stronger
requirement before allowing an implementation to claim they've
implemented Zihintntl, this doesn't make sense given that it is only a
performance tweak and there is a huge diversity of uarchs that vary
widely in performance for many other reasons.

If NOP is a valid implementation, then I presume every existing and future RISC-V automatically supports Zihintntl since the opcode is already defined as a NOP. While that won't break any code from a functional standpoint, it certainly makes it confusing from a performance standpoint.

Given how implementation-dependent this is, why standardize the hint at all?

In legacy ISAs, it makes sense to have standard non-temporal operations since custom instructions aren't an option. At the specification level, it's understood that the effects of legacy non-temporal operations will vary widely, but at the practical level there is a specific expectation about what will happen. For example, in the use case you mentioned earlier about an optimized BLAS library, software is written with the knowledge of how a non-temporal operation is implemented even though it is using a standard encoding with vague behavior.

If, like RISC-V, these legacy ISAs could be extended, would we still have standard non-temporal operations? Custom instructions will be a better match to the specific uarch since they can match the actual memory hierarchy and are a fine solution for implementation-specific code. Custom instructions can't be used in generic "performance portable" software, but it's arguable such a thing exists anyway.

Standard temporal hints are useful in RISC-V for the same reason they are useful in legacy architectures.

Even if you don’t believe in the performance portability of temporal-hinted code, the standard allows use of standard language bindings, assembly code, debug/trace tools, autotune code generators,  etc., rather than having a custom toolchain for every implementation.

However, I also believe there is some level of performance portability possible for some types of code, especially when targeting a class of implementations with similar if not identical microarchitecture.

The standard hints obviously do not restrict the ability to add custom hints if desired.

Krste

Bruce Hoult

unread,
Feb 27, 2023, 5:05:19 PM2/27/23
to Derek Hower, isa...@groups.riscv.org
I don't see this as instruction fusion as the first instruction (the non-temporal hint) does not have any functionality of its own.

It could be implemented simply by the instruction decoder setting a single bit "non-temporal" field (and a couple more bits for the type) in the decoder when the hint instruction is seen, and every other instruction type clearing it. Like the V extension vtype it is effectively appended to the decoded instruction AS IF it was an extra field in the instruction in the first place.


--

Allen Baum

unread,
Feb 28, 2023, 9:40:15 AM2/28/23
to Bruce Hoult, Derek Hower, isa...@groups.riscv.org
The architectural complexity of this has been carefully minimized by the precise semantics of the prefixing
Bruce pointed out that this extends the normal Load/Store ops with a non-temporal bit, just as Vector ops are extended by the values of its CSRs.
The difference is that the hint sets the bit, instead of a CSR op, and that the bit is ephemeral - it can get reset after the next instruction retires.
So, this is designed to be very simple to implement from a pipeline perspect. 
The more difficult part is implementing the non-temporal semantics, not the prefix semantics.

Reply all
Reply to author
Forward
0 new messages