RISC V ISA for embedded systems

903 views
Skip to first unread message

Amila Chandimal Jayawickrama

unread,
Nov 10, 2016, 12:33:28 AM11/10/16
to RISC-V ISA Dev
Is there an existing customized ISA for embedded systems based on RISC V (RISC V embedded ISA)?

Samuel Falvo II

unread,
Nov 10, 2016, 12:36:53 AM11/10/16
to Amila Chandimal Jayawickrama, RISC-V ISA Dev

RV32EC is probably what you're looking for.  16 GPRs, and 16-bit compressed instruction set.  I don't know if anyone is making such a processor though.

Otherwise, most folks just use RV32IM(C) or RV64IM(C) it seems.


On Nov 9, 2016 9:33 PM, "Amila Chandimal Jayawickrama" <amila...@gmail.com> wrote:
Is there an existing customized ISA for embedded systems based on RISC V (RISC V embedded ISA)?

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a4c828b1-69a5-468a-8046-d93ca031e761%40groups.riscv.org.

Peter Ashenden

unread,
Nov 10, 2016, 2:08:42 AM11/10/16
to Samuel Falvo II, Amila Chandimal Jayawickrama, RISC-V ISA Dev
At ASTC (www.astc-design.com), we have an implementation of RV32EC as a synthesizable IP core intended for small embedded applications, such as smart sensors and IoT. Feature set:
  • RISCV RV32E base instruction set, compliant with RISCV User-Level ISA Version 2.1.
  • RVC standard 16-bit compressed instructions for common RV32 instructions, for reduced code size.
  • Optional "M" standard extension for integer multiplication and division instructions.
  • Provision for application-specific instruction set extensions, eg for DSP operations.
  • Simple machine-mode privileged architecture with direct physical addressing of memory, compliant with RISCV Privileged Architecture Version 1.9.1.
  • Machine-mode timer and timer comparison.
  • 20 extended interrupts, plus timer and software interrupts.
  • All interrupts and exceptions vectored for fast interrupt response.
  • Wait-for-interrupt supporting clock gating for low-power idle state.
  • 2-stage pipeline comprising fetch and execute stages. Most instructions complete in one clock cycle.
  • Tightly-coupled memory interfaces for ASIC ROM and SRAM memories.
  • AHB-Lite interface for extended memory and memory-mapped I/O.
  • Firmware and virtual prototype development supported by ASTC’s VLAB system-level design tools.
Feel free to contact me for more info.

Cheers,

Peter Ashenden, CTO IC Design, ASTC

On Thu, Nov 10, 2016 at 4:06 PM Samuel Falvo II <sam....@gmail.com> wrote:

RV32EC is probably what you're looking for.  16 GPRs, and 16-bit compressed instruction set.  I don't know if anyone is making such a processor though.

Otherwise, most folks just use RV32IM(C) or RV64IM(C) it seems.

On Nov 9, 2016 9:33 PM, "Amila Chandimal Jayawickrama" <amila...@gmail.com> wrote:
Is there an existing customized ISA for embedded systems based on RISC V (RISC V embedded ISA)?

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Stefan O'Rear

unread,
Nov 10, 2016, 2:32:19 AM11/10/16
to Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama, RISC-V ISA Dev
On Wed, Nov 9, 2016 at 11:08 PM, Peter Ashenden
<peter.a...@gmail.com> wrote:
> Simple machine-mode privileged architecture with direct physical addressing
> of memory, compliant with RISCV Privileged Architecture Version 1.9.1.
> Machine-mode timer and timer comparison.

> Feel free to contact me for more info.

If any of y'all have time and inclination I'd be very interested to
see a brief writing of what's good/bad/ugly about implementing 1.9.1
from a deeply embedded perspective. I and some others have been doing
initial exploration for a streamlined privilege architecture that
reduces complexity in medium and large systems, but I'd like a better
idea of what the main complexity drivers are for very small systems.

-s

Peter Ashenden

unread,
Nov 10, 2016, 3:49:54 AM11/10/16
to Stefan O'Rear, Samuel Falvo II, Amila Chandimal Jayawickrama, RISC-V ISA Dev
I found implementation of the machine-mode privileged architecture in our core to be quite straightforward. Our simple 2-stage pipeline, makes it easy, but I wouldn't imagine it would be much more complicated in a deeper pipeline (for machine-mode only). Dealing with exception handling worked out fairly nicely. I haven't done any actual measurements on apportionment of gate count to the privileged architecture versus the rest of the core, but my guess in that it's pretty small.

PA

Rogier Brussee

unread,
Nov 10, 2016, 7:13:25 AM11/10/16
to RISC-V ISA Dev
Mainly as a comment on the RVC spec I developed a standalone fixed width16 bit length ISA, Xcondensed, that is essentially RVC but 
without having to implement the 32 bit wide ISA as well as the 16 bit wide ISA. 

Xcondensed  is designed to use the first three quadrants only, which makes it
possible to use in combination with the 32 bit wide ISA and then have essentially the same code compression characteristics (especially for a 32 bit
processor without soft floating point) as it simply adopts the same encoding for all instructions that according to the benchmarks in the RVC spec
bring 85% of the compression, and of the the remaining 15% and at least 10% should be covered by alternative and additional 16 bit wide instructions, 
with the exception of floating point load and store instructions.  I also expect that most of the remaining incompressible 32 bit width instructions 
can be emulated in just two 16 bit width instructions.

The main platforms where I think it would be useful to use Xcondensed as a standalone ISA is as an alternative to  RVE32IC, RVE32IMC, RV32IC or RV32IMC.
with only M and perhaps timer and instruction counter support, which is why I am reacting to your post. However, you can get quite far cramming in instructions: 
just to show its possible I crammed in an instruction set that is functionally equivalent to  RV64IMAFD (i.e. RV64G) with MUSH support.

Like the RVC instructions in the Xcondensed ISA maps 1-1 to the 32bit wide ISA.  This means that, if Xcondensed is used as a code compression 
mechanism and a 32 bit ISA decoder is also available, it can be used by only modifying the assembler. As a standalone instruction set it will
need compiler support, but this should be an easy modification of the existing support for RV and RVC because by design it is as similar as 
possible give the width constraints. 

See this thread on this list: 


and this spreadsheet which contains the complete suggested Xcondensed ISA, opcode formats. With CC licence. 


Rogier




Op donderdag 10 november 2016 06:33:28 UTC+1 schreef Amila Chandimal Jayawickrama:

Bruce Hoult

unread,
Dec 11, 2016, 4:22:42 PM12/11/16
to RISC-V ISA Dev, peter.a...@gmail.com, sam....@gmail.com, amila...@gmail.com
I was thinking that in embedded CPUs which are typically 32 bit and don't implement hardware FP, there are a LOT of 16 bit instruction encodings going unused where the load/store double/quad/SF/DF go.

Maybe an extension to decrease code size a bit more by shortening function prologues and epilogues? Either load/store pair, or a full-on push/pop multiple?

Those kinds of instructions are very nasty to implement on wide-dispatch out of order CPUs -- but they wouldn't be present there. On a single issue in-order embedded CPU a little hardware sequencer possibly doesn't cost much at all.

There is a proposal to use "millicode" functions to do this, and it's even available in gcc with -msave-restore, but it has quite a big speed hit, especially if you don't hit (or don't have) branch prediction.

Sober Liu

unread,
Dec 11, 2016, 8:51:14 PM12/11/16
to Bruce Hoult, RISC-V ISA Dev, peter.a...@gmail.com, sam....@gmail.com, amila...@gmail.com
Being a embedded system was the reason for me to ask for 32bits address range with RV64.
We don't want to use RV32 as we see RV64 get much better perf. result then RV32 from dhrystone and coremark benchmark.
As for HW cost, I think double register wide and more registers is worthy comparing with the benefits from perf. gain:
- 64bits register and load/store bandwidth will get double perf. for memcpy and easier to trigger gcc inline;
- Much less register spilling, e.g., for soft-floating library.
- MUL/DIV handling.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/152f6675-6a55-4ab8-a191-54cb2f231f0f%40groups.riscv.org.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Andrew Waterman

unread,
Dec 11, 2016, 9:02:47 PM12/11/16
to Sober Liu, Bruce Hoult, RISC-V ISA Dev, peter.a...@gmail.com, sam....@gmail.com, amila...@gmail.com
While RV64 can meaningfully improve performance on applications that
make use of the wider registers, the effect is usually less pronounced
than for e.g. ARM and x86, where the register count was also doubled.
I'd recommend you benchmark your applications of interest, rather than
making the decision based upon the microbenchmarks (and share your
results if you can!).
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2fe9338d0be5498d810f61a035fd2e72%40HKMAIL101.nvidia.com.

Sober Liu

unread,
Dec 11, 2016, 10:26:51 PM12/11/16
to Andrew Waterman, Bruce Hoult, RISC-V ISA Dev, peter.a...@gmail.com, sam....@gmail.com, amila...@gmail.com
In our case we have a SOC inside many riscv cores will be instant. There are separated cores instead of multi-hart.
There cases including following tasks:
- Soft floating point math. Obviously RV64 get much better than RV32 for double floating point.
- R-B tree handling with 64bits key-start and key-end. For each node, we need to compare 2 64bits key.
- Memory move for some video/image blocks.

Maybe our cases are a little bit special but here we are sure that RV64 get much better perf. than RV32.
BTW, the R-B tree node contains pointer like pLeft, pRight and pParent. We see much better cache perf. if pointer size is 32 bits.

Rogier Brussee

unread,
Dec 12, 2016, 4:44:29 AM12/12/16
to RISC-V ISA Dev, peter.a...@gmail.com, sam....@gmail.com, amila...@gmail.com

Trying to use the slots for double/quad/fp single/fp double/ slots of the three quadrants reserved for the 16 bit wide ISA for the most general purpose (i.e. mostly embedded) to come up with a bare bones but complete instruction set (i.e. usable for embedded) is exactly what Xcondensed is about.  While it defines a bare bones fp ISA this is mainly to show that one can.  The core of Xcondensed is the 32 bit processor IMA part. However, I felt it was essential to be able to use Xcondensed as the code compression ISA for both 32 bit and 64 bit processors as an addition to the  32 bit wide ISA ensuring that efforts and benefits in in compiler, linker, etc infrastructure are shared by the whole ecosystem.

In fact the main point of the Xcondensed exercise is to show that RISCV  *is missing the opportunity to have a minimal dense  fixed width 16 bit wide stand alone base instruction set that can be extended with a variable width ISA, maps each instruction to to the existing 32 bit IMA ISA*. In fact, I tried to show that the Cv1.9 extension can be changed  into such a core keeping sane instruction encoding and with relatively minor (but unfortunately incompatible) changes giving confidence that little if anything is sacrificed on code compression. 

I am "secretly" hoping somebody with more experience and with more incentive than I runs with that idea and pushes to change the Cv2.0 extension in that direction and finds Xcondensed a useful starting point. 

However, I do not really know if such a fixed width base instruction set would have real world advantages over the (variable width) combination of the 32 bit ISA with v1.9 like 16 bit C extension ISA which clearly has a headstart. I would be interested to know.

Ciao
Rogier

Op zondag 11 december 2016 22:22:42 UTC+1 schreef Bruce Hoult:

Jacob Bachmeyer

unread,
Dec 12, 2016, 6:27:29 PM12/12/16
to Bruce Hoult, RISC-V ISA Dev, peter.a...@gmail.com, sam....@gmail.com, amila...@gmail.com
Bruce Hoult wrote:
> Maybe an extension to decrease code size a bit more by shortening
> function prologues and epilogues? Either load/store pair, or a full-on
> push/pop multiple?

Generally speaking, load/store pair is an almost ideal candidate for an
expanded notion of macro-op fusion. If the processor can perform
LOAD-PAIR/STORE-PAIR more efficiently than two LOAD/STORE operations,
then the instruction decoder can recognize an appropriate sequence and
emit LOAD-PAIR or STORE-PAIR into the pipeline, even though those do not
exist in the ISA.

I suggest that advice be given along the lines of "when loading or
saving multiple registers, order the operations in a single monotonic
sequence by architectural register number" in the assembler programming
guide. Implementations can then recognize those sequences and
internally execute a LOAD-MULTIPLE/STORE-MULTIPLE, except that the
source/destination addresses would be fully general, a neat trick that
we otherwise would not be able to support.

This is orthogonal to the hardware-assisted context-save that I
previously proposed and am currently revising for a new proposal with
less baggage.

-- Jacob

Shumpei Kawasaki

unread,
Dec 12, 2016, 7:48:45 PM12/12/16
to Bruce Hoult, RISC-V ISA Dev, peter.a...@gmail.com, sam....@gmail.com, amila...@gmail.com, Oleg Endo

Bruce, 

A friend Oleg ran code size benchmarks using RISC-V GCC and compared the results with Cortex M series CPUs. I attache the results here. If someone has evaluated RISC-V LLVM compilers against Cortex M's we are interested in hearing the results. 

Considering ARM worked on its GCC for code size for years and the RV32GC and RV64GC GCC still nascent, I feel 15-19% code size increase over Cortex M is good. In addition RISC-V shows good density for the future embedded workloads (e.g. robotics numerical computing). RISC-V C extension I feel is a fine solution and more compiler optimizations might yield even higher density. The RISC-V C-Extension dedicates 75% of the opcode space to 16-bit format and25% for 32-bit or larger encodings. Having reviewed SH, Thumb, MIPS-16 encoding and their shortcomings I believe that C-Extension might be a fine solution. We might consider waiting till the dust settles on their GCC and LLVM improvements. 

In 1989, inspired by "quantitative approach" advocated by Dr. Patterson and Dr. Hennessy, 21 of my friends and I spent a year designing the 16-bit fixed-length RISC ISA for Hitachi SH architecture. We ran experiments on C compilation. Richard Stallman before he became a big cheese assisted us on evaluation of this ISA. Stallman gave Hitachi a go-ahead on the ISA and introduced Michael Tiemann to Hitachi and Hitachi became the 1st large customer of Cygnus Support which later became Red Hat. US patent office granted a broad patent claims to the US patent 5682545 A "Microcomputer having 16 bit fixed length instruction format" which Hitachi filed  in 1995. ARM and MIPS adopted the 16-bit fixed-length ISAs, ARM Thumb and MIPS-16. Hitachi demanded ARM licensees to pay significant patent license fees according to the rumors around ARM employees. With Thumb 2 ARM worked around the patent. Since October, 2014, when the the last of Hitachi patents expired, ARM restarted Thumb. 

I wanted to share this history as the 16-bit fixed-length history discussion in C-Extension gives ARM and MIPS too much excessive credits. In my opinion ARM was not a company of architects when it begun. It is a company sold RTL. ARM however learned early on ISA change, or ISA bifurcation would be an expensive preposition. Others including Hitachi never learned this. 

Shumpei

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
RISC-V Code Size Comparison with Cortex M .pdf

Stefan O'Rear

unread,
Dec 12, 2016, 10:15:58 PM12/12/16
to Jacob Bachmeyer, Bruce Hoult, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
On Mon, Dec 12, 2016 at 3:27 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Bruce Hoult wrote:
>>
>> Maybe an extension to >>>decrease code size<<< a bit more by shortening function
>> prologues and epilogues? Either load/store pair, or a full-on push/pop
>> multiple?
>
>
> Generally speaking, load/store pair is an almost ideal candidate for an
> expanded notion of macro-op fusion. If the processor can perform

Macro-op fusion does not address code size at all. (Millicode does,
as does subroutine creation at the compiler and source-code levels;
I'm guessing a functional return address stack pays large dividends in
systems that are _mostly_ optimized for size.)

-s

Jacob Bachmeyer

unread,
Dec 12, 2016, 11:01:42 PM12/12/16
to Stefan O'Rear, Bruce Hoult, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
Oops, I had missed that the goal was to decrease code size. You are
correct that macro-op fusion has no effect there.

Perhaps an extension including last-access-relative memory accesses that
could be efficiently compressed with "restartable critical section"
semantics? (A partial "commit" may be visible to another thread or an
ISR, but after a trap, execution continues at the first instruction in
the group. Yes, this a way to deal with the hidden state introduced at
"last-access-relative".)

-- Jacob

Bruce Hoult

unread,
Dec 12, 2016, 11:12:18 PM12/12/16
to jcb6...@gmail.com, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
I suppose a push/pop multiple instruction can be implemented by suppressing instruction fetch and making the next opcode be the current opcode with a a bit cleared for whichever register was pushed/popped. When all bits are cleared, continue resume normal instruction fetch.

Remembering we're talking specifically about small single issue in order machines here.

The same goes for a store/load pair. STP R8 (for example) can merely store R8 and auto-magically make the next opcode be a STW R9 (setting the low bit) instead of fetching an instruction.

Michael Clark

unread,
Dec 13, 2016, 2:23:21 AM12/13/16
to Rogier Brussee, RISC-V ISA Dev
Hi Roger,

This is quite interesting. The idea of distinct encodings for the same ISA is an interesting concept. I can imagine some interesting use cases.

You might be interested in a tool I have written. We can machine generate disassembler, interpreter and documentation from spreadsheet like input. I am working on an assembler at the moment. With the addition of the assembler we could get the RISC-V gcc or clang backend to use a different assembler and output for the RISC-V ISA binary encoding variant. The tool uses the RISC-V opcodes format and supports 16, 32, 48 and 64-bit instruction sizes.


I’m working on JIT and assembler at the moment. Assuming this is just a difference in encoding then it should require a very small amount of code changes to the toolchain from assembler down (once we’ve finished the assembler).


We might hit a speed bump with the relocation types and the linker. I am most likely only going to support static link in the short term, which should be fine for embedded.

There are two files I have not machine generated, but they are already in a format that can be easily machine generated as they are highly uniform and match the metadata. The bit encoder/decoders for the operands are machine generated but the codecs are not. I should machine generate these too:


I hadn’t seen your mail but fortuitously I had switched to the assembler and JIT as I’ve reached as far as I can get on implementing the documented side of the standard parts of the privileged spec.

Michael.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Dec 13, 2016, 9:35:41 PM12/13/16
to Bruce Hoult, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
Yes, but the encoding space consumed affects the entire ISA.

> The same goes for a store/load pair. STP R8 (for example) can merely
> store R8 and auto-magically make the next opcode be a STW R9 (setting
> the low bit) instead of fetching an instruction.

Hmm, could the opcodes for LD/SD/LQ/SQ be reused for load/store multiple
on RV32E? So "SD x8, <address aligned to 64-bits>" on RV32E would be
executed as "SW x8, <addr>; SW x9, <addr+4>" while "LQ x8, <address
aligned to 128-bits>" would be "LW x8, <addr>; LW x9, <addr+4>; LW x10,
<addr+8>; LW x11, <addr+12>". The alignment requirements could be
relaxed, at the expense of either more complex hardware or a monitor
trap. Since RV32E has only 16 registers, the entire register file could
be saved or loaded with only four 32-bit instructions.

I am less sure about adding this to the RV*I bases, since limiting it to
RV32E avoids the problem of the same opcode having very different
meanings in different ISA widths, although it does introduce a new
issue--RV32E would no longer be 100% upwards compatible to RV32I.


-- Jacob

Andrew Waterman

unread,
Dec 13, 2016, 10:56:01 PM12/13/16
to jcb6...@gmail.com, Bruce Hoult, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
Load-multiple/store-multiple were a conscious omission, and not for opcode space reasons. They would violate the property that RVC instructions are 1:1 with RV instructions, and would increase implementation complexity more than the rest of RVC.

The save/restore millicode routines get nearly all the code size benefit, at slight performance cost for simple implementations, but likely at higher performance for complex implementations.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Dec 13, 2016, 11:16:40 PM12/13/16
to Andrew Waterman, Bruce Hoult, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
Andrew Waterman wrote:
> Load-multiple/store-multiple were a conscious omission, and not for
> opcode space reasons. They would violate the property that RVC
> instructions are 1:1 with RV instructions, and would increase
> implementation complexity more than the rest of RVC.

I was not suggesting making those part of RVC; rather I was suggesting
using the (32-bit) LD/SD/LQ/SQ opcodes for load/store pair/quad _on_
_RV32E_. The catch is that RV32E would no longer be 100% upwards
compatible to RV*I, so it was more or less a strawman.

> The save/restore millicode routines get nearly all the code size
> benefit, at slight performance cost for simple implementations,
> but likely at higher performance for complex implementations.

How does a complex implementation get higher performance from millicode
routines? How can that be optimized? Is the performance improvement
from better use of instruction cache lines? Does this scale enough that
desktops could plausibly have use for millicode?


-- Jacob

Andrew Waterman

unread,
Dec 13, 2016, 11:47:29 PM12/13/16
to Jacob Bachmeyer, Bruce Hoult, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
On Tue, Dec 13, 2016 at 8:16 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Andrew Waterman wrote:
>>
>> Load-multiple/store-multiple were a conscious omission, and not for opcode
>> space reasons. They would violate the property that RVC instructions are 1:1
>> with RV instructions, and would increase implementation complexity more than
>> the rest of RVC.
>
>
> I was not suggesting making those part of RVC; rather I was suggesting using
> the (32-bit) LD/SD/LQ/SQ opcodes for load/store pair/quad _on_ _RV32E_. The
> catch is that RV32E would no longer be 100% upwards compatible to RV*I, so
> it was more or less a strawman.

Ah, I see.

>
>> The save/restore millicode routines get nearly all the code size benefit,
>> at slight performance cost for simple implementations, but likely at higher
>> performance for complex implementations.
>
>
> How does a complex implementation get higher performance from millicode
> routines? How can that be optimized? Is the performance improvement from
> better use of instruction cache lines? Does this scale enough that desktops
> could plausibly have use for millicode?

[Engage hand-waving] My intuition is that high-performance processors
will be optimized to handle procedure calls/returns, loads, and stores
very efficiently, and that's all the millicode routines are. These
processors tend to be decoupled enough that the burst of loads/stores
will queue up, and not interlock the pipeline.

By contrast, even dynamically scheduled superscalars are quite likely
to implement LDM/STM by interlocking decode and serially emitting
micro-ops. It's of course possible to do better, but it costs
hardware and complexity, whereas the millicode routines reuse
already-optimized pathways.

>
>
> -- Jacob
>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5850C7A5.30006%40gmail.com.

Cesar Eduardo Barros

unread,
Dec 14, 2016, 5:17:17 AM12/14/16
to Andrew Waterman, Jacob Bachmeyer, Bruce Hoult, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
One moment. Are we talking in this thread about load/store *pair*, or
load/store *multiple*? I'd expect the implementation considerations to
be significantly different.

I'm not a hardware designer, but I'd expect a decoder to be able to
easily emit the two micro-ops from a load/store *pair* in parallel,
while for a load/store *multiple* I'd expect it to pause the decoding
and sit in a loop emiting one load/store micro-op per cycle.

Also, a non-standard load pair/store pair extension does not need to be
limited to RV32E; it can also be available for RV32I, and even emulated
if the core doesn't have it. The encoding is so obvious, that I'd expect
everyone who implements it to use the same encoding: load/store with
funct3=011, which on RV64 is LD/SD but on RV32 has no meaning, and
require the pair to start with an even-numbered register.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Christopher Celio

unread,
Dec 14, 2016, 5:26:18 AM12/14/16
to Cesar Eduardo Barros, Andrew Waterman, Jacob Bachmeyer, Bruce Hoult, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
> Also, a non-standard load pair/store pair extension does not need to be limited to RV32E; it can also be available for RV32I, and even emulated if the core doesn't have it. The encoding is so obvious, that I'd expect everyone who implements it to use the same encoding: load/store with funct3=011, which on RV64 is LD/SD but on RV32 has no meaning, and require the pair to start with an even-numbered register.

There is no point in a 4-byte load-pair instruction -- you can effect the same outcome by using two 2-byte loads using macro-op fusion. And you avoid needless fragmentation of the ecosystem.

- https://arxiv.org/abs/1607.02318

As for load-multiple, Andrew has already mentioned it can be handled as fast as possible using millicode routines.

-Chris
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/b9234218-18e7-0cf4-f940-5ee3bd662941%40cesarb.eti.br.

Bruce Hoult

unread,
Jan 5, 2017, 9:50:24 AM1/5/17
to Christopher Celio, Cesar Eduardo Barros, Andrew Waterman, Jacob Bachmeyer, Stefan O'Rear, RISC-V ISA Dev, Peter Ashenden, Samuel Falvo II, Amila Chandimal Jayawickrama
(back from vacation)

Yes, naturally load/store pair would need to be 16 bit instructions, or there is no point. Which means there is no room for two register fields, which means the pair would need to be adjacent -- which then means you need one fewer bit for the register field, if you also require the lower numbered register to be (say) even.

> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
Reply all
Reply to author
Forward
0 new messages