Question on RISC-V SIMD Extension

689 views
Skip to first unread message

Jeff Jacobson

unread,
Jun 18, 2021, 2:29:27 PM6/18/21
to isa...@groups.riscv.org
Hey all,

Given David & Andrew’s distaste for SIMD (https://www.sigarch.org/simd-instructions-considered-harmful/) and the explosion in the number of instructions this entails, coupled with the vector-length-agnostic (VLA) approach in the vector extension, I’m a little surprised that:

1. The SIMD extension is not also VLA (this can be done, see ARM’s SVE)
2. RISC-V is starting with an MMX-style (64-bit vector, circa 1997) approach, which has been explored and moved past in the industry for lack of parallelism.

This seems almost guaranteed to cause RISC-V to (at least) double the number of vector instructions to achieve wider vector lengths (which was one of the problems with SIMD David & Andrew point-out). Firstly because the world is already using vectors wider than 64-bit, and second because the vector-length has been hard-coded into the ISA. 

The raison d’être for SIMD/vectors is to capture parallelism, after all.

With the writing so clearly on the wall, can somebody explain why 64-bit vectors in GPR’s is the best choice for a forward-looking architecture?
Honestly, it kinda feels like we’re repeating history, rather than learning from it.

Also, considering that modern 64-bit CPU hardware often already has a 128-bit data-path to memory (to load unaligned 64-bit values without penalties) use of a 64-bit SIMD vector seems unfortunate, since it ir leaves performance on the table at the high-performance section of the market.

The SIMD extension itself lacks any sort of explanation of the rationale and wisdom behind choosing 64-bit vectors as the best path for SIMD in RISC-V.
I’m hoping some of the people here can help fill in the blanks?

~Jeff

Alex Solomatnikov

unread,
Jun 18, 2021, 3:22:18 PM6/18/21
to Jeff Jacobson, RISC-V ISA Dev
Maybe, the SIMD extension is meant only for 64-bit as a low cost option and anything above should be RVV?

Alex

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/F06B9E80-CE79-4BBA-897F-BD15DE9DFF83%40gmail.com.

Jeff Jacobson

unread,
Jun 18, 2021, 4:00:27 PM6/18/21
to Alex Solomatnikov, RISC-V ISA Dev, chc...@andestech.com
Well, if you wanted to eschew VLA and architect the vector length, the P-extension could have allowed for pairs or quads of GPR’s. 
This would have enabled wider SIMD vectors at the same implementation cost as 64-bit SIMD vectors.

Alternatively, if you embrace the VLA approach to SIMD (where the vector length is not architected), the hardware vector length is immaterial, and a low-cost implementation could keep 64-bit registers without preventing more performant implementations.

Not advocating either, just observing cost is actually orthogonal to the architectural choice.

Which still leaves the question of the rationale behind the architectural choice.

~Jeff



Nick Knight

unread,
Jun 18, 2021, 4:00:33 PM6/18/21
to Alex Solomatnikov, Jeff Jacobson, RISC-V ISA Dev
Hi Jeff,

I assume you are talking about the RISC-V "P-extension"?

Take a look at the introduction of the current draft specification:

Digital Signal Processing (DSP), has emerged as an important technology for modern electronic systems. A wide range of modern applications employ DSP algorithms to solve problems in their particular domains, including sensor fusion, servo motor control, audio decode/encode, speech synthesis and coding, MPEG4 decode, medical imaging, computer vision, embedded control, robotics, human interface, etc.

The proposed P instruction set extension increases the DSP algorithm processing capabilities of the RISC-V CPU IP products. With the addition of the RISC-V P instruction set extension, the RISC-V CPUs can now run these various DSP applications with lower power and higher performance.

It seems that what was proposed as the "Packed SIMD" extension has evolved into an embedded/DSP extension.

Actually, if you look at the Git history, you can see this evolution was abrupt: the task group kicked off with this proposal, contributed by Andes Tech:
And as you can see from the diff with the current version
most of the changes have been superficial.

This is not meant as a criticism, just an observation. (I haven't participated in the P-extension task group beyond reading their Github.)

Best,
Nick Knight
 

Jeff Jacobson

unread,
Jun 18, 2021, 4:15:25 PM6/18/21
to Nick Knight, Alex Solomatnikov, RISC-V ISA Dev, chc...@andestech.com
Nick,

Yes, the P-extension. 
I saw the intro you mention, but it doesn’t offer any rationale for the architectural choices, it just talks about the application domains.

I’m assuming wider vector sizes and/or SIMD-VLA approaches were considered, and somehow it was decided a 64-bit fixed vector length was somehow preferable.

But there is no mention of why fixed 64-bit SIMD vectors are the best choice for the RISC-V architecture, which one would expect since it’s a counter-intuitive choice.
Still hoping to understand the wisdom behind this.

~Jeff



Craig Topper

unread,
Jun 18, 2021, 4:21:38 PM6/18/21
to Jeff Jacobson, Nick Knight, Alex Solomatnikov, RISC-V ISA Dev, chc...@andestech.com
Slide 9 from this presentation talks about using GPRs vs a separate register file. https://riscv.org/wp-content/uploads/2018/05/09.05.2018-9.15-9.30am-RISCV201805-Andes-proposed-P-extension.pdf

"GPR-based SIMD is a more efficient, low power DSP solution for embedded systems running applications in various domains such as audio/speech decoding and processing, IoT sensor data processing, wearable fitness devices, etc.”

I haven’t participated in P-extension myself beyond some reviews of llvm patches.

Nick Knight

unread,
Jun 18, 2021, 4:28:02 PM6/18/21
to Craig Topper, Jeff Jacobson, Alex Solomatnikov, RISC-V ISA Dev, chchang
Craig,

I think Jeff is asking why the P-extension architecture doesn't group GPRs into longer logical vectors (like V extension does with LMUL). I don't know the answer.

I see Jeff has CC-ed Dr. Chuan-Hua Chang, who I believe is the task group chair, and likely the best person to explain their design rationale.

Best,
Nick

Jeff Jacobson

unread,
Jun 18, 2021, 4:38:44 PM6/18/21
to Nick Knight, Craig Topper, Alex Solomatnikov, RISC-V ISA Dev, chchang
Nick,

Right. 

GPR-based SIMD has distinct advantages, but if you can pair GPR’s for RV32I (which RISC-V does), you can also pair them for RV64I.
This would have permitted designs the option to have greater parallelism with fewer dynamic instructions (better energy efficiency and better performance).

I’m not understanding why architecturally preventing this is the best choice for the RISC-V architecture.

~Jeff



Jim Wilson

unread,
Jun 18, 2021, 4:40:36 PM6/18/21
to Jeff Jacobson, Nick Knight, Alex Solomatnikov, RISC-V ISA Dev, Chuanhua Chang
On Fri, Jun 18, 2021 at 1:15 PM Jeff Jacobson <jeffjac...@gmail.com> wrote:
Yes, the P-extension. 
I saw the intro you mention, but it doesn’t offer any rationale for the architectural choices, it just talks about the application domains.

I haven't been following the P group either, but this is effectively the Andes NDS32 DSP instruction set ported over to RISC-V.  The architectural choices were likely made years ago inside Andes when they designed this extension.  On the positive side, this is a commercially proven design, as Andes has shipped billions of processors using the NDS32 version of this design, and there is already a code base ported to it.

See 
and there is a line near the bottom that mentions the P extension.

Jim

Jeff Jacobson

unread,
Jun 19, 2021, 11:27:12 AM6/19/21
to Jim Wilson, Nick Knight, Alex Solomatnikov, RISC-V ISA Dev, Chuanhua Chang
On the positive side, this is a commercially proven design, as Andes has shipped billions of processors using the NDS32 version of this design, and there is already a code base ported to it.

Fair point, but this raises the question of whether RISC-V is simply a copy/paste of existing architectures without IP encumbrances, or whether it will strive to be better than that. 

“Better” sounds, well, better...

~Jeff



Kevin Cameron

unread,
Jun 19, 2021, 2:55:13 PM6/19/21
to isa...@groups.riscv.org
RISC-V is the commoditization of the RISC idea which was relevant somewhere back in the 1980s, and is mostly surviving because ARM limped into doing phones and SmartPhones this century.

RISC is a bad architecture once it starts missing cache (as is CISC), and fixed register sets make multi-threading expensive, about the only reason it survives beyond the smallest machines is that the compiler writers can understand it and build tools for it.

However, I'll be happy if it kills off X86/ARM, and hoping I can help make it do that ;-)

Kev.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Allen Baum

unread,
Jun 20, 2021, 2:04:51 AM6/20/21
to Kevin Cameron, RISC-V ISA Dev
You're saying "RISC-V is simply a copy/paste of existing architectures without IP encumbrances" like it's a bad thing.
Reinventing the wheel is not the best way to make forward progress.
And your idea of "better" may not be the marketplaces idea of "better"
This is a proven design with wide market acceptance - and it's effectively being gifted to RISC-V. 
Don't look a gift horse in the mouth and all that.

The problem with widening SIMD beyond a single register width is multiplying the GPR read and write ports - or adding  HW sequencing . 
The former can be done with fixed even/odd register pairs with somewhat reasonable cost - but wider than that, is not worthwhile. 
 If you're going to sequence the SIMD width, expanding each iteration into a separate instruction has a performance cost only if it expands code beyond the cache capacity. 
This will lead to more Icache misses, and we can measure the effects to see if they are actually significant. 
Until that's done, support for widening isn't "better"

And afterward, if it turns out that a HW sequencer is better? There are still we might deal with that.

Jeff Jacobson

unread,
Jun 21, 2021, 3:41:23 PM6/21/21
to Allen Baum, Kevin Cameron, RISC-V ISA Dev
Allen,

It’s not *my* idea of better. 128-bit SIMD isn’t reinventing the wheel. It’s been done before, and judged successful by the marketplace. 64-bit SIMD was attempted in the marketplace and was largely discarded in favor of 128-bit SIMD that includes floating-point. 

Also, 64-bit integer-only SIMD seems a bit out of context with the rest of the RISC-V architecture, which is heading in the direction of supporting modern first-class computing infrastructure, such as virtual memory, nested virtualization, reliability, dynamically-compiled languages, etc. 

You say widening SIMD beyond a single register is a problem, but the P-extension to RV32I does exactly this. The P-extension to RV64I doesn’t, and any P-extension for RV128I (by extension) would appear to simply be wasteful.

Finally, you write: If you're going to sequence the SIMD width, expanding each iteration into a separate instruction has a performance cost only if it expands code beyond the cache capacity. 

This is (1) not true, and (2) drastically over-simplifies the complexities of computer systems.

Firstly, consider that any 128-bit operation that crosses the 64-bit boundary incurs additional costs beyond simply issuing a 64-bit instruction twice. WEXT, BPICK, BIRTREV, INSB are some examples. To make these work with 64-bit SIMD vectors requires fiddly-software.

Secondly, even for code where you can simply double the number of instructions, this requires the front and back-end of the CPU to both process twice as many instructions, which is worse for energy efficiency than decoding a single instruction and processing it twice in the back-end. Lower energy efficiency places a limit on CPU performance for a given system configuration, either thermally or electrically. 

Better energy efficiency is objectively better. One of the principal benefit SIMD brings to computer architecture is energy-efficiency, by permitting N data items to be processed for a single instruction decoded. Another benefit is area efficiency, which benefits both cost and energy-efficiency. On a variable-length instruction machine (such as RISC-V), decoding multiple instruction per cycle has O(N^2) complexity, so the argument can also be made that wider vectors can reduce area (and thus cost) over processing more instructions per cycle or doubling the CPU frequency (where a higher CPU frequency increases area [cost] and/or operating voltage).

~Jeff




Craig Topper

unread,
Jun 21, 2021, 3:51:05 PM6/21/21
to Jeff Jacobson, Allen Baum, Kevin Cameron, RISC-V ISA Dev
Correct me if I’m wrong, but my understanding is the P extension for RV32 uses pairs of registers for a subset of instructions that consume or produce 64-bit integer values. This is the Zpsfoperand sub-extension that is listed as optional for RV32. The bulk of the P extension SIMD instructions operate on 32-bit vectors for RV32 right?

Jan O

unread,
Jun 21, 2021, 4:19:29 PM6/21/21
to RISC-V ISA Dev, craig....@sifive.com, Allen Baum, Kevin Cameron, RISC-V ISA Dev, jeffjac...@gmail.com
>1. The SIMD extension is not also VLA (this can be done, see ARM’s SVE)
>2. RISC-V is starting with an MMX-style (64-bit vector, circa 1997) approach, which has been explored and moved past in the industry for lack of parallelism.

>This seems almost guaranteed to cause RISC-V to (at least) double the number of vector instructions to achieve wider vector lengths (which was one of the problems with SIMD David & Andrew point-out). Firstly because the world is already using >vectors wider than 64-bit, and second because the vector-length has been hard-coded into the ISA. 

>With the writing so clearly on the wall, can somebody explain why 64-bit vectors in GPR’s is the best choice for a forward-looking architecture?
>Honestly, it kinda feels like we’re repeating history, rather than learning from it.

>Also, considering that modern 64-bit CPU hardware often already has a 128-bit data-path to memory (to load unaligned 64-bit values without penalties) use of a 64-bit SIMD vector seems unfortunate, since it ir leaves performance on the table at the >high-performance section of the market.

>The SIMD extension itself lacks any sort of explanation of the rationale and wisdom behind choosing 64-bit vectors as the best path for SIMD in RISC-V.

What you are asking for is RVV, which is going into ratification soon. https://github.com/riscv/riscv-v-spec

P extension is designed for microcontroller/embedded niche, where classic simd in register results in highest energy/area efficiency. Hence it's fixed point only.

Jeff Jacobson

unread,
Jun 21, 2021, 4:28:59 PM6/21/21
to Craig Topper, Allen Baum, Kevin Cameron, RISC-V ISA Dev
Craig,

Yes, RC32-P extension uses pairs of 32-bit registers, and operates on 64-bit vectors of 8/16/32-bit integer values. 
The RC64-P extension uses individual 64-bit registers, and operates on 64-bit vectors of 8/16/32-bit integer values.

This could have alternatively been defined to use pairs of 64-bit registers and operate on 128-bit vectors of 8/16/32-bit integer values, but wasn’t. 
I was curious why this was the case, since it seems counter-intuitive, but perhaps this was just the path of least resistance. 

~Jeff

Craig Topper

unread,
Jun 21, 2021, 4:44:53 PM6/21/21
to Jeff Jacobson, Allen Baum, Kevin Cameron, RISC-V ISA Dev
Jeff,

Most P instructions operate on XLEN-bit vectors of 8/16/32-bit integer values. See for example the definition of ADD16. For RV32 it process elements 1..0 and for RV64 is processes elements 3..0.

Rd.H[x] = Rs1.H[x] + Rs2.H[x];
for RV32: x=1…0,
for RV64: x=3…0



MULR64 on the other hand uses a pair of GPRs and is documented showing 2 registers updated for RV32 and one for RV64.

RV32:

Mresult = CONCAT(1`b0,Rs1) u* CONCAT(1`b0,Rs2);
R[Rd(4,1).1(0)][31:0] = Mresult[63:32];
R[Rd(4,1).0(0)][31:0] = Mresult[31:0];

RV64:

Rd = Mresult[63:0];
Mresult = CONCAT(1`b0,Rs1.W[0]) u* CONCAT(1`b0,Rs2.W[0]);


~Craig

Jeff Jacobson

unread,
Jun 21, 2021, 5:21:32 PM6/21/21
to Jan O, RISC-V ISA Dev, craig....@sifive.com, Allen Baum, Kevin Cameron
Jan,

Yes, the V-extension certainly permits wider vectors and FP, which is all good. 

But I don’t believe the V-extension allows mixing of different-sized vector elements or “horizontal” operations across the vector like most SIMD architectures do. This can be very useful for fields like DSP and Neural Networks, where data size and load/store bandwidth are often constraining factors.

> P extension is designed for microcontroller/embedded niche, where classic simd in register results in highest energy/area efficiency.

The P-extension certainly hails from a microcontroller environment, but SIMD (speaking broadly/generally) has advantages over Cray-style vectors (V-extension) for many applications, particularly where high efficiency and/or performance (relative to system capabilities) are important.

So I guess I’m questioning whether constraining RISC-V to either micro-controller SIMD or Cray-Vectors (best-suited to compiler autovectorization) is really best for the architecture in the long-run.

The V-extension is architected so that further enhancements to support autovectorization (such as predication, scatter/gather) could be added without disrupting the architecture. 

While the current P-extension could easily be extended to add FP, it seems architecturally fixed to a relatively low-performance and/or power-inefficient solution-space, but virtue of a short and fixed vector-length. This hints at potentially needing a high-performance SIMD at some point in the future which would be incompatible with the current P-extension. 

Since RISC-V has embraced vector length agnosticism (VLA) in the V-extension, it seems a shame not to allow it in the P-extension. A cost-sensitive implementation could limit the vector length to 64-bits (as it is today) but the architecture as a whole would still allow beefier implementations to support wider vector lengths, which would give the P-extension legs for the long-term, and allow it to apply to higher-performance and/or higher energy-efficiency situations.

~Jeff

Andrew Waterman

unread,
Jun 21, 2021, 5:36:48 PM6/21/21
to Jeff Jacobson, Jan O, RISC-V ISA Dev, craig....@sifive.com, Allen Baum, Kevin Cameron
On Mon, Jun 21, 2021 at 2:21 PM Jeff Jacobson <jeffjac...@gmail.com> wrote:
Jan,

Yes, the V-extension certainly permits wider vectors and FP, which is all good. 

But I don’t believe the V-extension allows mixing of different-sized vector elements or “horizontal” operations across the vector like most SIMD architectures do.

The V extension supports both mixed-width operations and a limited set of "horizontal" operations.

Jeff Jacobson

unread,
Jun 21, 2021, 5:40:50 PM6/21/21
to Andrew Waterman, Jan O, RISC-V ISA Dev, craig....@sifive.com, Allen Baum, Kevin Cameron
Andrew,

Thanks!  Apparently I missed these. I will go hunt for these.

~Jeff



Reply all
Reply to author
Forward
0 new messages