Architecture CALL / RETURN instructions

Robert Finch

unread,

Nov 7, 2024, 3:56:03 AM11/7/24

to RISC-V ISA Dev

Does RISCV have anything resembling an architecture call / return instruction? It would allow using the instruction set of a different architecture, or allow RISCV to be used from another architecture. I think this is just two instructions ( call and return) that do not use a lot of opcode space. The exact mechanics of switching architectures would not need to be fully defined. One approach might be to use a buffer for storing register contents.

Guy Lemieux

unread,

Nov 7, 2024, 3:59:19 AM11/7/24

to Robert Finch, RISC-V ISA Dev

A TG has been formed to do this for the custom opcodes only:

https://github.com/riscv-admin/composable-custom-extensions

which is being built out of prior work done by the SoftCPU SIG, written up here:

https://github.com/grayresearch/CX

Guy

On Thu, Nov 7, 2024 at 12:56 AM Robert Finch <robf...@gmail.com> wrote:

Does RISCV have anything resembling an architecture call / return instruction? It would allow using the instruction set of a different architecture, or allow RISCV to be used from another architecture. I think this is just two instructions ( call and return) that do not use a lot of opcode space. The exact mechanics of switching architectures would not need to be fully defined. One approach might be to use a buffer for storing register contents.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6064aa44-db22-4e12-9eaa-a75eaaddaf52n%40groups.riscv.org.

Robert Finch

unread,

Nov 7, 2024, 4:39:23 AM11/7/24

to RISC-V ISA Dev, Guy Lemieux, RISC-V ISA Dev, Robert Finch

That is in the direction I was thinking. Thanks for the references. However, it appears to be somewhat RISCV centric. It also seems to be pretty complex, I have not read the whole thing yet though. I was thinking that for instance, that multiple CPU types could be stored on a flash memory device and selected in an FPGA via a call instruction. One of my current projects has an i386 similar core along with a native core. The i386 does not know anything about CSRs, neither does the native core. Because the cores do not know anything about the structure of another core, the interface has to be somewhat abstract / bland.

Guy Lemieux

unread,

Nov 7, 2024, 10:44:07 AM11/7/24

to Robert Finch, RISC-V ISA Dev

it's complicated because it has to be. what we're doing is actually minimal and practical.

what you're trying to do is virtually impossible without setting some ground rules, like what happens to the register contents of the old ISA, what happens to anything running in the "background", and any interrupt responses etc. your ground rules might be pretty harsh which is more like "shutdown this processor and its ISA completely, and boot this other processor ISA". that's pretty easy to do on an FPGA -- probably easier by having two actual processors. good luck :-)

guy

BGB

unread,

Nov 8, 2024, 2:16:13 AM11/8/24

to Guy Lemieux, Robert Finch, RISC-V ISA Dev

On 11/7/2024 9:43 AM, Guy Lemieux wrote:
> it's complicated because it has to be. what we're doing is actually
> minimal and practical.
>
> what you're trying to do is virtually impossible without setting some
> ground rules, like what happens to the register contents of the old
> ISA, what happens to anything running in the "background", and any
> interrupt responses etc. your ground rules might be pretty harsh which
> is more like "shutdown this processor and its ISA completely, and boot
> this other processor ISA". that's pretty easy to do on an FPGA --
> probably easier by having two actual processors. good luck :-)
>

In my case, I have a mechanism to call between different ISA modes in my
CPU core, but granted:
Both ISA's share the same register space, with defined mapping rules;
There is enough overlap between them that things aren't too weird;
All other significant parts of the architecture are shared;
It is essentially a single CPU core with multiple sets of decoders.

In my case, the inter-ISA call/return was implemented by using tag bits
in the function-pointer register and link-register (conceptually also
tied to the PC register, sorta).

Can't claim to be strictly original in this approach.

Granted, would be a much uglier issue if the ISA's were entirely
dissimilar or ran on different processors. Not really sure how this
would be approached.

The cost on an Inter-ISA call is essentially that it requires a pipeline
flush if the mode changes (it may be branch-predicted if no mode change
will happen). There is some tagging on cache lines in the L1 instruction
cache, but this should not matter (unless the same line were being run
as two different ISAs, it would not matter).

Setting up a function pointer for the target mode often has to be done
manually. In my ISA, it is possible to use "LEAT.B (PC, Disp), Rn" to
capture a tagged function pointer, but currently no direct equivalent
exists in RV land (AUIPC+ADD will give an untagged pointer).

However, for the link register it is implicit (for code in RV64 mode, it
doesn't poke at the link-register contents, so doesn't notice that mode
tagging is being used). Trying to jump to the link register implicitly
restores the mode captured in the link register.

So:
( 0): 0=Same ISA, 1=Inter-ISA (always 1 for Link-Register values)
(47: 1): PC Address
(63:48): Mode (6-bit) and Status (T, S, and U bits).
High bits are ignored if LSB is clear.

With Modes:
000000: Baseline (No WEX)
000001: Baseline (WEX)
000010: RV64GC
000011: RV64G + XG3RV (new)
000100: XG2 (No WEX)
000101: XG2 (WEX)
000110: XG2RV (No WEX)
000111: XG2RV (WEX)
rest: Unused/Reserved for now.

Where:
Baseline:
Original ISA mode;
16/32/64/96 bit instruction encodings;
Nominally 32 GPRs.
Has 16-bit instructions:
For the most part, can only access R0..R15;
Encoding and layout kinda resembles the SuperH ISA.
RV64GC:
Should be obvious enough
RV64G + XG3RV:
Newer mode (still very experimental).
Has the RV64G encoding space.
XG3 is a bit-repacked version of my ISA.
Mostly to make the encoding less ugly;
And to be able to exist in the same encoding space.
Like with RV, XG3 relies on hardware superscalar.
Internally, is repacked into a modified XG2RV during Fetch.
Allowed reusing the existing instruction decoders.
XG2:
A modified form of Baseline
Expanded register fields to 6 bits via XOR trickery;
Expanded immediate fields in similar ways;
The 16-bit instructions are N/E in this mode.
XG2RV:
XG2 encoding, but using the RISC-V register space.

As is, register mappings:
RV:
X0: ZZR (Zero)
X1: LR (RA)
X2: SP
X3: GP (GBR)
X4: TP (R4)
X5..X13: R5..R13
X14/X15: R2/R3
X16..X31: R16..R31
F0..F31: R32..R63
Baseline / XG2:
R0: DLR (N/E in RV)
R1: DHR (N/E in RV)
R2/R3 (X14/X15)
R4..R13 (X4..X13)
R14 (N/E in RV)
R15 (SP)
...
XG2RV/XG3RV:
X0..X31: Same as RV
X32..X63: R32..R63 / F0..F31
Both of these modes use the RV64 LP64 ABI.

Earlier on, X4 was mapped to TBR, but I ended up changing this when
trying to get RV64 Linux ELF binaries to work, realizing in this case
that the libc implementations try to set up and manage TP themselves.

XG2RV has not seen much use, as by itself it lacks a compelling use-case:
There has been little real reason to use it by itself over XG2;
The need for mode changing to interoperate with RV64 code was still a
hassle;
...

XG3RV seems a bit more promising:
Direct inter-operation with RV64G is possible without the use of
function-pointer tagging;
Both RV64G and XG3RV can coexist in the same encoding space;
The compiler can freely mix/match instructions from both ISAs (at
present, the output from my compiler is a confetti mix of both ISAs).

Also XG3RV seems to have succeeded in giving "performance that doesn't
suck", which also giving interop with RV64G that doesn't suck.

Can note, general instruction encoding scheme I ended up going with:
XXXX-oooooo-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (3R)
iiii-iiiiii-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (3RI)
iiii-iiiiii-iiiiii-aZZZ-nnnnnn-bY-YYPw (2RI Imm16)
iiii-iiiiii-iiiiii-aZZZ-jjjjjj-bY-YYPw (~ JAL, +/- 16MB)
Where:
X/Y/Z: Opcode
n=Rd, m=Rs1, o=Rs2
i=Immed
Pw: 00/01: Predicated (currently unused), 10=XG3 Op, 11=RV Op

This ended up replacing the idea I posted about a little over a month
ago; I ended up redesigning things in a way that allowed me to leverage
my existing decoders.

It lacks the concept of WEX (explicit bundle encoding), as for XG3 I
went over to in-order superscalar. It does still have jumbo prefixes,
but uses a different set of prefixes than those used for RV64+Jx
encodings (and thus far, the two types of jumbo prefixes are not
interchangeable).

XG3RV would still have interop hassles with RV64GC though.

This hassle can be avoided via the RV64+Jumbo scheme, which has full
compatibility with RV64GC, but isn't quite as good in terms of performance;
Trying to use it as a generic 64-register ISA is worse on average than
using it as a 32 register ISA (one generally needs to keep integer
values in X registers and FPU values in F registers otherwise
code-density is negatively effected).

In contrast, XG3RV, with has native 6 bit register fields, and gets
better code density and performance when used as a flat 64 register
space (relative gains over the prior Jumbo-Prefix extension are much
smaller if used as a 32 register ISA). I suspect this is the major
"practical" difference here (otherwise, the functionality from my own
ISA that is being used in the case of Doom, is largely already present
in the 'B' extension).

Can note that XG3RV is within an 8% performance delta relative to XG2,
so there may not be much more to gain here. Thus far, my compiler isn't
using the full feature set of this mode; and some features (such as
predication) I am considering leaving as optional.

Where, I can note for ".text" size in Doom (along with fps at start of
E1M1, at 50 MHz):
XG2 (BGBCC): 289K, 25 fps
XG3RV (BGBCC): 320K, 23 fps
RV64G+Jx (BGBCC): 360K, 20 fps (*1)
RV64GC (GCC ): 393K, -- (*2)
RV64G (BGBCC): 438K, 12 fps (*3)
RV64G (GCC ): 445K, 17 fps

*1: Jumbo prefixes (expanded immediate and displacement fields),
register-indexed load/store, Zba instructions, and load/store pair.

*2: Builds but Doom crashes on start-up, may need more debugging.

*3: My compiler isn't doing so hot here...

I suspect I may be the near the limits of how much speed I can get out
of this (at least, for programs like Doom and similar).

Might also make sense to do similar comparisons for Quake.

Side note, in the past had also experimented with faster-clocked scalar
cores, but generally have had better results at slightly lower
clock-speeds (which generally allowed for things like bigger and better
performing L1 caches). Say: 32K L1 cache with consistent access latency,
beats a 4K L1 cache that needs to stall whenever there is a memory RAW
hazard or similar, ... What MHz gives, L1 misses and RAW hazards takes away.

Most attempts to move to more MHz almost invariably hurting performance
more than is gained.

But, that said, going much bigger than 32K of L1 cache also doesn't gain
much.

> guy
>
>
> On Thu, Nov 7, 2024 at 1:39 AM Robert Finch <robf...@gmail.com
> <mailto:robf...@gmail.com>> wrote:
>
> That is in the direction I was thinking. Thanks for the references.
> However, it appears to be somewhat RISCV centric. It also seems to
> be pretty complex, I have not read the whole thing yet though. I was
> thinking that for instance, that multiple CPU types could be stored
> on a flash memory device and selected in an FPGA via a call
> instruction. One of my current projects has an i386 similar core
> along with a native core. The i386 does not know anything about
> CSRs, neither does the native core. Because the cores do not know
> anything about the structure of another core, the interface has to
> be somewhat abstract / bland.
>
>
> On Thursday, November 7, 2024 at 3:59:19 AM UTC-5 Guy Lemieux wrote:
>
> A TG has been formed to do this for the custom opcodes only:
> https://github.com/riscv-admin/composable-custom-extensions
> <https://github.com/riscv-admin/composable-custom-extensions>
>
> which is being built out of prior work done by the SoftCPU SIG,
> written up here:

> https://github.com/grayresearch/CX <https://github.com/
> grayresearch/CX>

>
> Guy
>
>
> On Thu, Nov 7, 2024 at 12:56 AM Robert Finch <robf...@gmail.com>
> wrote:
>
> Does RISCV have anything resembling an architecture call /
> return instruction? It would allow using the instruction set
> of a different architecture, or allow RISCV to be used from
> another architecture. I think this is just two instructions
> ( call and return) that do not use a lot of opcode space.
> The exact mechanics of switching architectures would not
> need to be fully defined. One approach might be to use a
> buffer for storing register contents.
>
>
> --
> You received this message because you are subscribed to the
> Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/6064aa44-db22-4e12-9eaa-

> a75eaaddaf52n%40groups.riscv.org <https://groups.google.com/
> a/groups.riscv.org/d/msgid/isa-dev/6064aa44-db22-4e12-9eaa-
> a75eaaddaf52n%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.

> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-dev/

> CAEasn%2BCE56dYHPXAE_FD%2BQy6MV0A9qYv3zeAoOg6E8Jg%3D4%2BRLw%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAEasn%2BCE56dYHPXAE_FD%2BQy6MV0A9qYv3zeAoOg6E8Jg%3D4%2BRLw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Robert Lipe

unread,

Nov 8, 2024, 12:40:19 PM11/8/24

to RISC-V ISA Dev, Robert Finch

Since this is inherently quite specific to your processor, can you just [ab]use an ECALL as a general purpose trap and then build whatever kind of API/calling convention (struct regs or jmp_buf, sweetened to taste) and have each side of the compiler aware of this? You're kind of doing a ring transition/syscall that's just preserving state and jumping, just like an exception (of which ECALL is one) would have to do anyway, so it might not be the most grotesque imaginable fit.

The referenced spec work seems more applicable to architectures that are at least somewhat similar, like RV32[E] and RV64 or variations of endianness, etc. If you're trying to link 8085 and RV64 into the same image and jump between them in a custom FPGA or device, it's not like you need industry consensus on how it's implemented, do you?

"Not all things worth doing are worth doing well."

BGB

unread,

Nov 8, 2024, 4:05:41 PM11/8/24

to Robert Lipe, RISC-V ISA Dev, Robert Finch

On 11/8/2024 11:40 AM, Robert Lipe wrote:
> Since this is inherently quite specific to your processor, can you just
> [ab]use an ECALL as a general purpose trap and then build whatever kind
> of API/calling convention (struct regs or jmp_buf, sweetened to taste)
> and have each side of the compiler aware of this? You're kind of doing a
> ring transition/syscall that's just preserving state and jumping, just
> like an exception (of which ECALL is one) would have to do anyway, so it
> might not be the most grotesque imaginable fit.
>

Yeah, it doesn't need to be traditional system calls...

Ironically, in my case, it gets a little wonky:
My OS originally used a vaguely COM-like system call mechanism;
But, later tried adding Linux style syscalls (mostly in a stalled
attempt to get binaries built using either GNU libc or Musl-libc
working), but thus far without much success (stuff generally dies in the
ELF dynamic linking stage, and it is being too much of a pain to debug).

Where, say, my system calls (in RV Mode):
X10: Object Handle
X11: Method/Syscall Number
X12: Address of Return Value
X13: Address of Argument List
X14..X16: Unused
X17: Set to -1 (more recent requirement).

Object handle was 0 (NULL) for plain OS syscalls.
Would be non-zero for an inter-task method call.
With my ISA, it is similar, just using R4..R7 instead.

The syscall handler uses the ISA mode to know which registers to use.
So, a syscall from RV/XG3RV/XG2RV mode will need to use X10..X17.
Currently, code in my own ISA can't invoke Linux style syscalls.

With method calls, the syscall number essentially encodes an index into
the object's VTable (both ends assumed to have a matching VTable). The
object in the caller's task essentially has a vtable filled with wrapper
stubs which will perform a syscall to the corresponding method number,
capturing the argument list into a temporary buffer (also with a space
for the return value, a pointer to this buffer provided by the task
context).

Whether or not the client-side object has any data, or is merely a stub,
will depend on the object in question (these object instances are local
to the client task).

Linux Syscalls:
X10..X15: System Call Arguments
X16: -
X17: System Call Number

I ended up tweaking the system calls slightly, say:
If X17>=0, Interpret as Linux style Syscall
If X17==-1, Use my original mechanism
Else: Unknown

Could have used 0, except 0 is a valid Linux system call.
Linux style syscalls here use X10 for the return value.

OS syscalls were generally assumed to have access to userland memory,
whereas for inter-task method calls, the assumption would be that
"GlobalAlloc" memory is used, where GlobalAlloc memory is assumed to be
visible to all processes (by default, other memory is assumed to
process-local).

But, yeah, in theory, could glue all sorts of stuff together with COM
style interfaces.

The destination need not necessarily be the same ISA, nor even
necessarily on the same machine. Though, no networking support as of
yet, though there is basic socket support, essentially being handled
internally by treating everything as-if it were IPv6 (including IPv4 and
AF_UNIX sockets).

Not used sockets much as of yet, as they would have higher overhead and
effective latency vs the use of COM style calls.

Or, pretty much any other mechanism could be devised.

In my case:
Memory management and file IO and similar are done via plain syscalls;
UI/Graphics, Sound/MIDI, and OpenGL, are done via COM-style interfaces.

When one wants an interface to an object, a system call is used to fetch
the interface.

Specifics differ slightly from actual COM:
Rather than always using a GUID, a pair of 64 bit numbers is used;
They may be interpreted as one of:
A pair of FOURCC's;
A pair of EIGHTCC's;
A SIXTEENCC;
A GUID.

It is generally possible to detect which it is by looking at the values.
In my case, FOURCC's and EIGHTCC's are generally used for public / OS
APIs. Traditional COM would use GUIDs for everything, but here I am
assuming GUIDs would primarily be for private / non-OS APIs.

( Decided to leave out an explanation for how I had implemented the
OpenGL API on top of this... )

...

> The referenced spec work seems more applicable to architectures that are
> at least somewhat similar, like RV32[E] and RV64 or variations of
> endianness, etc. If you're trying to link 8085 and RV64 into the same
> image and jump between them in a custom FPGA or device, it's not like
> you need industry consensus on how it's implemented, do you?
>

Ironically, FWIW, mixing RV32 and RV64 would be more of a stretch than
what I am doing with my mechanisms (since the difference in pointer
sizes and struct layouts would make direct data sharing impractical in
many cases).

For things further out, something different would be needed.

Things like different endianess would likely make direct interaction
between the ISA's impractical.

Some lighter-weight interfaces, namely calls with pointer-tagging, only
really make sense if both ends have the same data layouts and compatible
ABIs (or if one can do ABI translation via a thunk).

Using a thunk is a little easier at least if one has ABI variants where
the thunk generation does not need to know argument types (and so can
merely shuffle the registers around in a predefined manner).

> "Not all things worth doing are worth doing well."
>
> On Thursday, November 7, 2024 at 2:56:03 AM UTC-6 Robert Finch wrote:
>
> Does RISCV have anything resembling an architecture call / return
> instruction? It would allow using the instruction set of a different
> architecture, or allow RISCV to be used from another architecture. I
> think this is just two instructions ( call and return) that do not
> use a lot of opcode space. The exact mechanics of switching
> architectures would not need to be fully defined. One approach might
> be to use a buffer for storing register contents.
>
>

> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.
> To view this discussion visit https://groups.google.com/a/

> groups.riscv.org/d/msgid/isa-dev/bd14bc71-
> deca-4028-8c53-24c6d72af15en%40groups.riscv.org <https://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/bd14bc71-
> deca-4028-8c53-24c6d72af15en%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

Reply all

Reply to author

Forward