DSP Instruction

239 views
Skip to first unread message

Frohic Food

unread,
Oct 7, 2017, 12:19:51 PM10/7/17
to RISC-V ISA Dev
It's common for embedded processors to take on DSP work.  I wonder what's the suggested way of implementing these.

Specifically I'm thinking of:
  • Address calculation (auto increment/decrement)
    • Is there a recommended order for fusing instructions which modify addresses?
  • Suggested instructions for fusing mutliplies and accumulates?
  • Hardware looping 
    • Again, are there recommendations to reduce the overhead to fuse branch instructions for tight loops?
  • Bit reversed addressing
    • Is there a recommended way of implementing this?
Thanks

John Burke

unread,
Oct 27, 2017, 11:39:35 AM10/27/17
to RISC-V ISA Dev
Hi Frohic,

Not sure if any of this would be recommended, but some of what you are asking is present in pulpino.  As for fused instructions, I know that I've seen them in Rocket and BOOM, so that may answer that end.

-- John

Silviu Chiricescu

unread,
Oct 27, 2017, 4:11:19 PM10/27/17
to Frohic Food, RISC-V ISA Dev
You should also look into the vector extension proposal - it might have all the elements that you are looking for.

-silviu

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5fb64d92-409e-4e30-90c1-87eff817a7f0%40groups.riscv.org.

Clifford Wolf

unread,
Oct 28, 2017, 5:19:41 AM10/28/17
to Frohic Food, RISC-V ISA Dev
On Sat, Oct 07, 2017 at 09:19:50AM -0700, Frohic Food wrote:
> - Bit reversed addressing
> - Is there a recommended way of implementing this?

The B standard extension draft proposal contains a GREV (generalized
bit reverse) instruction that can be used for that.

Frohic Food

unread,
Oct 28, 2017, 7:38:14 AM10/28/17
to Clifford Wolf, RISC-V ISA Dev
Thanks all,

Bit reversed addressing: Where can I access the draft B extension proposal?  That sounds very interesting.

I noted that Pulpino had a loop (although I've not studied it yet), and I expect implementations to fuse so instructions, but unless there is some semi-official list of suggested instructions to fuse, different implementations may choose to implement fuse different instructions.  For instance one implementation may choose to pre-increment an address and another post-increment.  I.e.

Implementation A might fuse post-incrementing addresses
 lw x1,x2
 addi x2, x2, 4

Whereas implementation B might fuse pre-incrementing addresses
 addi x2, x2, 4
 lw x1,x2

Software engineers need to know how to write the code so that if the code is running on an implementation with fused instructions, it can exploit them; hardware engineers need to know whether they should implement pre or post incrementing addresses (for example)

Clifford Wolf

unread,
Oct 28, 2017, 8:00:22 AM10/28/17
to Frohic Food, RISC-V ISA Dev
Hi,

On Sat, Oct 28, 2017 at 12:38:12PM +0100, Frohic Food wrote:
> Bit reversed addressing: Where can I access the draft B extension
> proposal? That sounds very interesting.

You need to be a memeber of the RISC-V Foundation and join the Bit
Manipulation Work Group.

The GREV instruction performs the following operation:

uint32_t grev32(uint32_t x, int k)
{
if (k & 1) x = ((x & 0x55555555) << 1) | ((x & 0xAAAAAAAA) >> 1);
if (k & 2) x = ((x & 0x33333333) << 2) | ((x & 0xCCCCCCCC) >> 2);
if (k & 4) x = ((x & 0x0F0F0F0F) << 4) | ((x & 0xF0F0F0F0) >> 4);
if (k & 8) x = ((x & 0x00FF00FF) << 8) | ((x & 0xFF00FF00) >> 8);
if (k & 16) x = ((x & 0x0000FFFF) << 16) | ((x & 0xFFFF0000) >> 16);
return x;
}

It will possibly be an immediate-only instruction (with k being the
immediate). So it's possible there will be a GREVI opcode in the B
extension, but no GREV opcode.

regards,
- clifford

Michael Clark

unread,
Oct 28, 2017, 4:19:50 PM10/28/17
to Frohic Food, Clifford Wolf, RISC-V ISA Dev


> On 29/10/2017, at 12:38 AM, Frohic Food <fro...@gmail.com> wrote:
>
> Thanks all,
>
> Bit reversed addressing: Where can I access the draft B extension proposal? That sounds very interesting.
>
> I noted that Pulpino had a loop (although I've not studied it yet), and I expect implementations to fuse so instructions, but unless there is some semi-official list of suggested instructions to fuse, different implementations may choose to implement fuse different instructions. For instance one implementation may choose to pre-increment an address and another post-increment. I.e.
>
> Implementation A might fuse post-incrementing addresses
> lw x1,x2
> addi x2, x2, 4
>
> Whereas implementation B might fuse pre-incrementing addresses
> addi x2, x2, 4
> lw x1,x2
>
> Software engineers need to know how to write the code so that if the code is running on an implementation with fused instructions, it can exploit them; hardware engineers need to know whether they should implement pre or post incrementing addresses (for example)

Yes. I’ve already noticed that arbitrary instruction scheduling can break many macro-op fusion candidates; given the requirement of an adjacency constraint for simplified decode of macro-op fusion pairs.

There will eventually need to be an official list of macro-op fusion pairs so that compilers can schedule them appropriately.

Chandler Carruth, one of the core LLVM developers mentioned at the RISC-V LLVM developers meeting that instruction scheduling was already hard enough and that pseudo instructions expanded in the assembler might be the way to go for macro-op fusion.There are macro-op fusion passes in the compiler but it’s a tough problem to maintain a code order through all of the passes due to the pass ordering problem. Interesting feedback from one of the core LLVM compiler developers…

The current no-brainer is a ZEXT.W pseudo to match SEXT.W. There are already several other assembler pseudo instructions that expand into more than one instruction. Making macro-op fusion pseudo instructions at the assembler level would prevent code motion from interleaving independent instructions (a competing objective function from an optimisation perspective).

> On Sat, Oct 28, 2017 at 10:19 AM, Clifford Wolf <clif...@clifford.at <mailto:clif...@clifford.at>> wrote:
> On Sat, Oct 07, 2017 at 09:19:50AM -0700, Frohic Food wrote:
> > - Bit reversed addressing
> > - Is there a recommended way of implementing this?
>
> The B standard extension draft proposal contains a GREV (generalized
> bit reverse) instruction that can be used for that.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org <mailto:isa-dev+u...@groups.riscv.org>.
> To post to this group, send email to isa...@groups.riscv.org <mailto:isa...@groups.riscv.org>.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/ <https://groups.google.com/a/groups.riscv.org/group/isa-dev/>.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPQXEtpCvQ74a3XA-dPjZYRcEv9H7cgpGoN9t5%3DwdmBJg53Oqg%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPQXEtpCvQ74a3XA-dPjZYRcEv9H7cgpGoN9t5%3DwdmBJg53Oqg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Michael Clark

unread,
Oct 28, 2017, 4:20:23 PM10/28/17
to Frohic Food, RISC-V ISA Dev
Saturated arithmetic.

Alex Bradbury

unread,
Oct 28, 2017, 4:52:55 PM10/28/17
to Michael Clark, Frohic Food, Clifford Wolf, RISC-V ISA Dev
for On 28 October 2017 at 21:19, Michael Clark <michae...@mac.com> wrote:
>
>
>> On 29/10/2017, at 12:38 AM, Frohic Food <fro...@gmail.com> wrote:
>>
>> Thanks all,
>>
>> Bit reversed addressing: Where can I access the draft B extension proposal? That sounds very interesting.
>>
>> I noted that Pulpino had a loop (although I've not studied it yet), and I expect implementations to fuse so instructions, but unless there is some semi-official list of suggested instructions to fuse, different implementations may choose to implement fuse different instructions. For instance one implementation may choose to pre-increment an address and another post-increment. I.e.
>>
>> Implementation A might fuse post-incrementing addresses
>> lw x1,x2
>> addi x2, x2, 4
>>
>> Whereas implementation B might fuse pre-incrementing addresses
>> addi x2, x2, 4
>> lw x1,x2
>>
>> Software engineers need to know how to write the code so that if the code is running on an implementation with fused instructions, it can exploit them; hardware engineers need to know whether they should implement pre or post incrementing addresses (for example)
>
> Yes. I’ve already noticed that arbitrary instruction scheduling can break many macro-op fusion candidates; given the requirement of an adjacency constraint for simplified decode of macro-op fusion pairs.
>
> There will eventually need to be an official list of macro-op fusion pairs so that compilers can schedule them appropriately.
>
> Chandler Carruth, one of the core LLVM developers mentioned at the RISC-V LLVM developers meeting that instruction scheduling was already hard enough and that pseudo instructions expanded in the assembler might be the way to go for macro-op fusion.There are macro-op fusion passes in the compiler but it’s a tough problem to maintain a code order through all of the passes due to the pass ordering problem. Interesting feedback from one of the core LLVM compiler developers…
>
> The current no-brainer is a ZEXT.W pseudo to match SEXT.W. There are already several other assembler pseudo instructions that expand into more than one instruction. Making macro-op fusion pseudo instructions at the assembler level would prevent code motion from interleaving independent instructions (a competing objective function from an optimisation perspective).

Certainly both options are available to the compiler writer. Once we
get to that point, we can see which works the best in LLVM. Anyone
interested in compiler support for macro-op fusion in LLVM may want to
read lib/Target/AArch64/AArch64MacroOpFusion.cpp,
lib/CodeGen/MacroFusion.cpp, the review thread at
https://reviews.llvm.org/D34144.

The issue of knowing which instruction pairs might be fused to your
target seems relevant to the previously mentioned idea of core
"profiles" which aims to describe common design points. It's been
mentioned a few times on the lists, but I don't think there have been
any concrete proposals yet.

As it hasn't been mentioned in this thread yet, see
https://arxiv.org/pdf/1607.02318.pdf for a discussion of macro-op
fusion for RISC-V.

Best,

Alex

Jacob Bachmeyer

unread,
Oct 28, 2017, 10:51:53 PM10/28/17
to Clifford Wolf, Frohic Food, RISC-V ISA Dev
GREVI is interesting, but OP-IMM is full, unless you plan to encode it
around the shift instructions, which could work using bits 27~30 as a
"funct4" field. In this encoding, "k" would fit into the "shamt" field,
even on RV128. I like GREV/GREVI, since it also provides BSWAP using
the same opcode if I understand correctly. If RVB can be kept small,
encoding all of RVB into this space might be possible with room left over.

-- Jacob

Guy Lemieux

unread,
Oct 28, 2017, 11:33:16 PM10/28/17
to Frohic Food, RISC-V ISA Dev
our lightweight vector extensions (LVE) support memory-to-memory
streaming operations with auto increment/decrement of src1/src2/dest
address pointers and hardware iteration counting. LVE can also do
fixed-point dot products (multiply + reduction-addition) in 1
instruction. the LVE re-uses the RISC V ALU, so it is extremely
lightweight. the memory-to-memory operations must take place in an
on-chip scratchpad memory that is tightly coupled with the CPU. LVE is
accessible through instructions added to the RISC V assembler and
intrinsics in C. we have observed performance of 12x on dot products,
for example.

there is not yet any support for bit reversed addressing. this
shouldn't be difficult to add.

for even more performance, we have a set of custom vector instructions
implemented by our MXP coprocessor. there can be multiple parallel
ALUs in MXP, and it supports subword SIMD for even higher performance
on 8b and 16b data. the MXP is loosely coupled to the host CPU,
implements its own internal scratchpad which is fully observable and
writable from the host CPU, and can run on a different clock than the
host CPU. it supports user-defined custom vector instructions. and
multiple different hosts (RISC V, ARM, MicroBlaze, Nios, can also add
MIPS etc). it does not require any changes to the compiler, but like
LVE we have an optimized RISC V mode that makes changes to the
assembler. we have observed speedups of ~1,000 using our vector
instructions, and ~10,000x by adding custom vector instructions. we've
created custom vector instructions for CNNs that can issue 2,000
operations per clock cycle; we've measured a soft processor system
running at 150MHz as >600x faster than the hard ARM core running at
666MHz.

here is info on LVE that I presented at the 4th RISC V Workshop:
https://riscv.org/wp-content/uploads/2016/07/Tue1515_VectorBlox_ORCA_with_LVE.pdf
https://www.youtube.com/watch?v=4f_rRNV-yzM&feature=youtu.be

g
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5fb64d92-409e-4e30-90c1-87eff817a7f0%40groups.riscv.org.

Michael Clark

unread,
Oct 28, 2017, 11:45:57 PM10/28/17
to Jacob Bachmeyer, Clifford Wolf, Frohic Food, RISC-V ISA Dev
Another approach is sharing the GREV functional unit for bit reverse and byte swap with the immediate multiplexed in hardware. The area is similar and opcode space could be conserved.

i.e. Substitute BREV for GREV and add BSWAP to this block diagram and multiplex the GREV immediate depending on function (CLZ/BSWAP/BREV):

- https://github.com/rv8-io/rv8/blob/master/doc/src/bitmanip.png

The question is how often will GREVI be used where k is not 31 or 24. I guess it could exist as a distinct instruction even if there are opcodes for BSWAP and BREV (which could also be pseudos for the GREVI values 31 and 24). My big question is do we make CLZ efficient given its popularity assuming we can re-use the functional units. The machine code sequence is too complex to macro-fuse given the need for a temporary register. I do hope we have CLZ/CTZ instructions and not 5 cycle instruction sequences.

However we shouldn’t be presuposing implementation approaches but at the same time they should be considered.

The first 9 of this set of instructions below maps precisely to compiler builtins and will magically make a lot of code faster, with the exception of the last 3, all of them have builtins or lifts.

## Candidate instructions for B extension (Bit Manipulation)

Opcode Long Name Description
RLL[W] rd,rs1,rs2 Rotate Left Logical Rotate bits in rs1 left by the amount in rs2
RRL[W] rd,rs1,rs2 Rotate Right Logical Rotate bits in rs1 right by the amount in rs2
RLLI[W] rd,rs1,shamt Rotate Left Logical Immediate Rotate bits in rs1 left by the immediate
RRLI[W] rd,rs1,shamt Rotate Right Logical Immediate Rotate bits in rs1 right by the immediate
BCLZ[W] rd,rs1 Bit Count Leading Zeros Count leading zero bits in rs1
BCTZ[W] rd,rs1 Bit Count Trailing Zeros Count trailing zero bits in rs1
BCNT[W] rd,rs1 Bit Count Count number of bits set in rs1
BREV[W] rd,rs1 Bit Reverse Reverse bits in rs1
BSWAP[W] rd,rs1 Byte Swap Swap byte order in rs1
BEXT[W] rd,rs1,rs2 Bit Extract Gather LSB justified bits to rd from rs1 using extract mask in rs2
BDEP[W] rd,rs1,rs2 Bit Deposit Scatter LSB justified bits from rs2 to rd using deposit mask in rs2
GREVI[W] rd,rs1,imm7 Generalized Reverse Generalized Reverse of bits in rs1 accoring to mask in imm7

One could possibly remove a rotate direction at the expense of a subtract for the non immediate form (which is certainly less frequent) but this list is pretty minimal compared to the large variety of rarely used bit manipulation instructions that are out there on other ISAs. I’d consider this list a relatively minimal list.

I’m mostly interested in the first 9 as a lot of code is already using the builtins/lifts…

The only field extract is BEXT. They all suffer from the same problem which is the limited space for an immediate. BEXT/BDEP are most useful in loops where the masks are loaded into registers.

Clifford Wolf

unread,
Oct 29, 2017, 7:44:13 AM10/29/17
to Jacob Bachmeyer, Frohic Food, RISC-V ISA Dev
On Sat, Oct 28, 2017 at 09:51:47PM -0500, Jacob Bachmeyer wrote:
> GREVI is interesting, but OP-IMM is full, unless you plan to encode
> it around the shift instructions

this is exactly what is being proposed.

Clifford Wolf

unread,
Oct 29, 2017, 7:47:46 AM10/29/17
to Michael Clark, Jacob Bachmeyer, Frohic Food, RISC-V ISA Dev
On Sun, Oct 29, 2017 at 04:45:49PM +1300, Michael Clark wrote:
> The question is how often will GREVI be used where k is not 31 or 24.

potentially a lot if you do things like arbitrary bit permutations:
http://nbviewer.jupyter.org/url/svn.clifford.at/handicraft/2017/permsyn/data.ipynb

Reply all
Reply to author
Forward
0 new messages