RISC-V macro operation fusion + out of order execution

Juha Manninen

unread,

Feb 19, 2023, 7:51:42 AM2/19/23

to RISC-V ISA Dev

Q: How does the macro operation fusion system work when a CPU implements out of order execution? I guess they are not exclusive.

Is it so that the out of order logic deals with combined macro commands?

Before that it must then be a compiler's responsibility to place potentially fusable commands after each other making the task easy for the CPU.

If that is true, it reminds me of Intel's Itanium VLIW CPU which delegated the ordering and joining of commands to a compiler. It failed because of reasons I still don't quite understand.

I am puzzled and my question is honest. The macro operation fusion appears like a super brilliant system to me.

Background:

I have followed RISC-V development for few years already. Recently I watched Erik Engheim's presentation in YouTube :

https://www.youtube.com/watch?v=L9jvLsvkmdM

Still didn't find an answer for my question. If this is a wrong place to ask, or if the question is already answered somewhere, please forward me to a proper place.

I am a 60 year old man, still following the development of CPUs and programming languages, and I participate in a FOSS SW project.

Regards,

Juha Manninen

MitchAlsup

unread,

Feb 19, 2023, 1:18:34 PM2/19/23

to RISC-V ISA Dev, Juha Manninen

On Sunday, February 19, 2023 at 6:51:42 AM UTC-6 Juha Manninen wrote:

Q: How does the macro operation fusion system work when a CPU implements out of order execution?

A fused instruction is considered as if it were 1 instruction and passes through the instruction queueing structure and onto calculation and memory reference resources (maybe) passing through the Reorder BUffer before final retirement.

I guess they are not exclusive.

Should be orthogonal.

Is it so that the out of order logic deals with combined macro commands?

That is why the instructions got fused--because there is a function unit that can handle both instruction at once.

Before that it must then be a compiler's responsibility to place potentially fusable commands after each other making the task easy for the CPU.

Instruction scheduling has fallen out of vogue with the rise of the Great-Big Out-Of-Order machine implementations. The HW is generally better at this that the code scheduler (certain caveats apply).

If that is true, it reminds me of Intel's Itanium VLIW CPU which delegated the ordering and joining of commands to a compiler. It failed because of reasons I still don't quite understand.

Itanic failed because there were too many architects combined between Intel and HP,......then they missed all semblance of schedules, power dissipation, and performance. Other than that, the Edsel was a great success.......

BGB

unread,

Feb 19, 2023, 6:19:56 PM2/19/23

to isa...@groups.riscv.org

On 2/19/2023 6:51 AM, Juha Manninen wrote:
> Q: How does the macro operation fusion system work when a CPU
> implements out of order execution? I guess they are not exclusive.
> Is it so that the out of order logic deals with combined macro commands?
> Before that it must then be a compiler's responsibility to place
> potentially fusable commands after each other making the task easy for
> the CPU.
>

Could be done via an OoO CPU.

It could also be done in fetch/decode (similar to Superscalar) by
special-case recognition of a pair of instructions (and then treating
them like a single larger instruction).

Drawback is mostly that special-case pattern recognition is kind of
awkward and fairly expensive (and doesn't scale up very well).

> If that is true, it reminds me of Intel's Itanium VLIW CPU which
> delegated the ordering and joining of commands to a compiler. It failed
> because of reasons I still don't quite understand.
> I am puzzled and my question is honest. The macro operation fusion
> appears like a super brilliant system to me.
>

I am having moderate success with my ISA design (full custom), which is
(in some ways) sort of like a watered down Itanium...

Trying to deal with pipeline scheduling and instruction bundling in the
compiler is kind of a massive pain though (but, one would still need
this part to work well for optimizing for in-order superscalar to be
effective).

Some of the more powerful techniques for getting good performance out of
C code are currently out of reach of my compiler (and were apparently
for Intel as well). Otherwise if one tries to write code in way that
perform well on narrow x86 or 32-bit ARM machines, it performs poorly on
a VLIW. One needs to modify the coding style some, ironically enough,
leading to code that would perform poorly on 32-bit x86 or ARM.

But, one also has to use some things conservatively, say, if one goes
and modulo-schedules all of the loops, one may lose more performance due
to the harm this causes to code density, than one gains from the
"faster" loop.

So, one has to have additional information, say:
How often is this code likely to be executed;
Average case for how long the loop spins when the function is called;
...
And, typically, the compiler has no way of knowing any of this.

One area where Itanium suffered was in terms of large instruction words,
and cases where it was necessary to fill spots in bundles with NOPs,
etc. This combined to result in notoriously bad code density for IA-64.

In my case, I am using 32-bit instruction words with a daisy-chain
encoding. So, for VLIW style code, code-density remains more on par with
a more traditional fixed-length RISC (though, the ISA also has 16-bit
ops for use in size-optimized code).

Design was mostly influenced by SuperH and TMS320 and similar.

Besides this, IA-64 systems were expensive; things might have gone
differently had it been priced more competitive with x86.

Decided to leave out a more detailed comparison between the various ISAs.

> Background:
> I have followed RISC-V development for few years already. Recently I
> watched Erik Engheim's presentation in YouTube :
> https://www.youtube.com/watch?v=L9jvLsvkmdM
> Still didn't find an answer for my question. If this is a wrong place to
> ask, or if the question is already answered somewhere, please forward me
> to a proper place.
>

I didn't really start looking much at RISC-V until after I had started
working on my own thing. But, granted, RISC-V has existed for a lot
longer than my ISA project.

I see it as OK, as both projects are trying to target different parts of
the ISA design space.

And, as noted, there are a lot of use-cases where RISC-V makes a lot
more sense.

For higher end processors, I suspect if Moore's law comes to an end,
this is going to be "not ideal" for OoO; there will be more pressure to
get more performance relative to area and more performance per Watt,
which is an area where VLIW has an advantage.

Even if OoO remains on top in an instructions-per-clock sense, it will
do so at the cost of needing more area per core and more power usage, ...

But, for lower-end processors and microcontrollers, a RISC-style ISA and
small in-order cores are pretty hard to beat.

I expect this to be the strong area for RISC-V, whereas on higher end, I
am less certain.

I suspect that, on the higher end, both ARM and x86-64 are in a bad spot
in terms of perf/area and perf/watt. This may eventually catch up with
them (enough to offset their advantages in other areas).

> I am a 60 year old man, still following the development of CPUs and
> programming languages, and I participate in a FOSS SW project.
>

In my case, I am still in the twilight of my 30s...

Itanium was sort of the "hot new thing" back when I was in high-school,
but after graduating high-school I ended up getting an Athlon64 based PC
(to replace the former Athlon Thunderbird running Win2K I had been using
during HS).

Admittedly, I first started getting into programming language and
compiler design when I was in high-school. At the time, writing a Scheme
interpreter, then later writing a JavaScript interpreter; then during
college classes, hacking the JS interpreter into a makeshift C compiler.

But, much of a lifetime later, my C compiler still kinda sucks (and
still doesn't support compiling C++, but this seems like too much
effort). But, still mostly works in a basic sense.

I didn't start getting into ISA design and FPGAs until a little more
recently (but was still early 30s when I started on this project).

What the future holds for this, I don't know.

I am not making any money on any of this (nor is it clear how I would do
so), which kind of makes things difficult in the sense of people being
expected to do things that pay towards their cost of living and similar...

But, at the same time, I also like to be able to do things that I find
"actually interesting", whether or not they are necessarily all that
useful a "cost of living" sense.

> Regards,
> Juha Manninen
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1989ae99-4716-4880-bd83-ce60dee95801n%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1989ae99-4716-4880-bd83-ce60dee95801n%40groups.riscv.org?utm_medium=email&utm_source=footer>.

Juha Manninen

unread,

Feb 24, 2023, 9:04:00 AM2/24/23

to RISC-V ISA Dev

Thanks for the answers.

I understood that a CPU implementation may require the potentially fusable instructions be placed successively by a compiler, but some other implementation may find them by OoO design.

In any case a compiler should plan for the fusion. As Erik Engheim's presentation shows, fusable instructions must have the same destination register. By assigning the registers in a certain way allowed fusing 3 instructions together.

On Monday, February 20, 2023 at 1:19:56 AM UTC+2 BGB wrote:

One area where Itanium suffered was in terms of large instruction words,
and cases where it was necessary to fill spots in bundles with NOPs,
etc. This combined to result in notoriously bad code density for IA-64.

OK, yes.

For higher end processors, I suspect if Moore's law comes to an end,
this is going to be "not ideal" for OoO; there will be more pressure to
get more performance relative to area and more performance per Watt,
which is an area where VLIW has an advantage.

Even if OoO remains on top in an instructions-per-clock sense, it will
do so at the cost of needing more area per core and more power usage, ...

Interesting. VLIW may come back.

In addition to Moore's law there is also (Niklaus) Wirth's law :

"Software is getting slower more rapidly than hardware is becoming faster."

I am still amazed how all that extra speed and memory are consumed so quickly.

This is going out of topic though ...

Juha Manninen

MitchAlsup

unread,

Feb 24, 2023, 10:27:45 AM2/24/23

to RISC-V ISA Dev, Juha Manninen

On Friday, February 24, 2023 at 8:04:00 AM UTC-6 Juha Manninen wrote:

Thanks for the answers.

I understood that a CPU implementation may require the potentially fusable instructions be placed successively by a compiler, but some other implementation may find them by OoO design.

Fusing successive instructions is BigO( n )

Fusing non-successive instructions is at least BigO( n^3 )

Instruction fusing is orthogonal to OoO execution.

In any case a compiler should plan for the fusion. As Erik Engheim's presentation shows, fusable instructions must have the same destination register.

And the result of the leading instruction is consumed as an operand to the trailing.

By assigning the registers in a certain way allowed fusing 3 instructions together.

On Monday, February 20, 2023 at 1:19:56 AM UTC+2 BGB wrote:
One area where Itanium suffered was in terms of large instruction words,
and cases where it was necessary to fill spots in bundles with NOPs,
etc. This combined to result in notoriously bad code density for IA-64.

OK, yes.

Yes, NOPs are bad for code density.

For higher end processors, I suspect if Moore's law comes to an end,
this is going to be "not ideal" for OoO; there will be more pressure to
get more performance relative to area and more performance per Watt,
which is an area where VLIW has an advantage.

Even if OoO remains on top in an instructions-per-clock sense, it will
do so at the cost of needing more area per core and more power usage, ...

Interesting. VLIW may come back.

I seriously doubt that.

In addition to Moore's law there is also (Niklaus) Wirth's law :
"Software is getting slower more rapidly than hardware is becoming faster."

In 1981 I worked for a minicomputer company who's compiler ran at 100,000 lines (Fortran) per second: 150ns clock; 0.3 I/C

In 2023, even with computers 1,000+× faster, it is hard to achieve 100,000 lines per second compiling Fortran, 200ps clock, 2.0 I/C

BGB

unread,

Feb 24, 2023, 10:04:02 PM2/24/23

to isa...@groups.riscv.org

On 2/24/2023 9:27 AM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>
>
> On Friday, February 24, 2023 at 8:04:00 AM UTC-6 Juha Manninen wrote:
>
> Thanks for the answers.
>
> I understood that a CPU implementation may require the potentially
> fusable instructions be placed successively by a compiler, but some
> other implementation may find them by OoO design.
>
>
> Fusing successive instructions is BigO( n )
> Fusing non-successive instructions is at least BigO( n^3 )
>
> Instruction fusing is orthogonal to OoO execution.
>

Yep.
Not fusing instructions, and instead having more powerful instructions,
is O(1).

Bundling and jumbo encodings are also fairly cheap if encoded with
simple bit patterns. However, that ship has already sailed for RISC-V
(at least not without effectively nuking the 'C' extension; or having a
CPU with multiple operating modes; ...).

> In any case a compiler should plan for the fusion. As Erik Engheim's
> presentation shows, fusable instructions must have the same
> destination register.
>
>
> And the result of the leading instruction is consumed as an operand to
> the trailing.
>

Yes, and some pattern recognition in terms of which pairs of ops may be
fused.

The logic for op-fusing is likely to be a fair bit more expensive though
than that for superscalar. For superscalar, one merely needs to classify
ops in terms of whether they are valid as a prefix or suffix, and check
for non-colliding registers.

For fusion, one needs a whole bunch of specific-op equality checks.
Checking things for equality tends to be significantly more expensive
than using bit-patterns as an index into a case block to lookup 3 or 4
flag bits.

> By assigning the registers in a certain way allowed fusing 3
> instructions together.
>
>
> On Monday, February 20, 2023 at 1:19:56 AM UTC+2 BGB wrote:
>
> One area where Itanium suffered was in terms of large
> instruction words,
> and cases where it was necessary to fill spots in bundles with
> NOPs,
> etc. This combined to result in notoriously bad code density for
> IA-64.
>
>
> OK, yes.
>
>
> Yes, NOPs are bad for code density.
>

Avoiding an IA-64 style encoding scheme results in significantly fewer NOPs.

Also having a CPU that also supports interlock checks and stalls.
While an interlock stall is still a wasted clock cycle, it is at least
better than padding with a NOP.

Someone could have a cheaper CPU by avoiding both interlocks and
register forwarding (so, say, every instruction effectively has a
3-cycle latency or similar). However, this would be pretty terrible for
code density in a lot of code (NOPs, NOPs everywhere...).

I can note that early on in my ISA designs, I had considered possibly
using a more IA-64 like approach to the encoding (for a CPU that would
have had Thumb-style mode switches between a SuperH style ISA and an
IA-64 style ISA), but then ended up quickly dropping this idea.

Later on, this transitioned to a 16/32 encoding with the ability to
daisy chain the 32-bit ops. Kept interlocks and forwarding, and
eliminated branch delay slots (even if several of the ISAs which had
influenced the design had used branch delay slots).

>
> For higher end processors, I suspect if Moore's law comes to an
> end,
> this is going to be "not ideal" for OoO; there will be more
> pressure to
> get more performance relative to area and more performance per
> Watt,
> which is an area where VLIW has an advantage.
>
> Even if OoO remains on top in an instructions-per-clock sense,
> it will
> do so at the cost of needing more area per core and more power
> usage, ...
>
>
> Interesting. VLIW may come back.
>
>
> I seriously doubt that.
>

It is either that or stagnation in terms of processor speed and power
once Moore's Law hits the wall.

If the vise is pressing down in terms of inability to increase
transistor density further, and the desire for "more speed" continues,
this may create incentive to migrate to designs which can give more
speed with fewer transistors.

This may become a big enough issue to offset things like legacy software
compatibility in many use-cases (and, for the more important cases,
through the use of emulators).

For example, one can observe that the Apple ecosystem has already
survived 4 major ISA transitions (6502->M68K->PPC->x86->ARM). So, in the
longer term, switching between ISAs may not be too big of an issue.

Similarly, given enough years or decades, people may be better able to
make use of manycore processors, and a larger number of cores, even if
each core (individually) can't match an OoO core in terms of
single-threaded performance, may start to win out.

In the next 5 or 10 years or so, I don't really expect much to happen
(x86 will likely see its market share eroded somewhat by ARM or similar,
but this is all I really expect on this timeframe). Where, the "VLIW
revolution" would likely be closer to 20+ years (say, with gains in
terms of transistor density likely dying off some time during the next
decade or so).

>
> In addition to Moore's law there is also (Niklaus) Wirth's law :

> "/Software is getting slower more rapidly than hardware is
> becoming faster./"

>
>
> In 1981 I worked for a minicomputer company who's compiler ran at
> 100,000 lines (Fortran) per second: 150ns clock; 0.3 I/C
> In 2023, even with computers 1,000+× faster, it is hard to achieve
> 100,000 lines per second compiling Fortran, 200ps clock, 2.0 I/C
>

Also basically true.
There are probably a few orders of magnitude that can be squeezed out
just through small scale optimizations.

Though, I have found in some ways that there can be significant
differences between MHz values and what one could linearly extrapolate
in terms of performance.

For example:
My desktop PC (Ryzen 2700X) is only around 3x faster than a roughly 20
year old laptop in terms of CPU MHz;
By many measurements, the difference between them is quite considerably
larger than 3x (and, scarily enough, by some metrics, log-scale
performance on the laptop is closer to my BJX2 Core than it is to my
desktop PC...);
By some metrics, a RasPi2 is also faster than the laptop, despite the
"on paper" stats implying that the laptop should be faster (laptop
having a higher clock-speed, OoO vs 2-wide in-order superscalar, ...).

However, some other benchmarks show a much closer to linear correlation
with MHz.

Also nevermind if my PC has 96 times more RAM, 225 times more HDD space,
..., than the laptop.

Yet it still bogs down sometimes, and Firefox likes to eat RAM at levels
that would have seemed absurd in decades past, ...

Whereas, my custom CPU core can often execute command-line commands and
do other things in less time than it takes for a 60Hz screen refresh
(when interacting with the simulations, it is sometimes useful to count
the speed of events relative to how long it takes for the VGA refresh to
pass over it).

Whereas, my PC seemingly sometimes struggles to change input focus from
one program to another (or respond to other user input in the GUI) in a
good part of a second...

One can also often see the lag between typing something, and it actually
appearing on screen (more so if one sets a "Windows Explorer" window to
"tiles" mode to see thumbnails, or open an image in "Windows Photo
Viewer", the latter basically lagging out the whole Windows UI whenever
it is active, ...).

But, in an ideal world, trying to view a JPEG or similar wouldn't cause
ones' PC to start lagging until they close it.

...

> I am still amazed how all that extra speed and memory are consumed
> so quickly.
> This is going out of topic though ...
>
>
> Juha Manninen
>

> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/83585dae-6143-4609-b55e-d7862042396an%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/83585dae-6143-4609-b55e-d7862042396an%40groups.riscv.org?utm_medium=email&utm_source=footer>.

Bruce Hoult

unread,

Feb 27, 2023, 4:50:58 PM2/27/23

to Juha Manninen, RISC-V ISA Dev

On Mon, Feb 20, 2023 at 1:51 AM Juha Manninen <juha.ma...@gmail.com> wrote:

Q: How does the macro operation fusion system work when a CPU implements out of order execution? I guess they are not exclusive.

Simply treat the combined instructions as one longer instruction that the hardware implements even though the ISA doesn't formally contain it. OoO or not doesn't make any difference.

Macro-op fusion in RISC-V seems to have caught the imagination of the public. Quite ironically, since it is present in modern Arm and x86 CPUs (fusing a compare or other arithmetic with an immediately following conditional branch) but is not actually implemented in any RISC-V CPU I'm aware of. SiFive's U74 links a forward conditional branch over one instruction with that instruction as they proceed down the two execution pipelines, squashing the 2nd instruction if the branch ends up being taken (so no branch mispredict penalty in either case) but that's something different.

Macro-op fusion seems like a cute idea, as it effectively adds complex instructions that complex CPUs can take advantage of, but simple CPUs don't have to know exist. It isn't implemented in small CPUs (by definition). But it's not really useful in big OoO CPUs either. Which leaves medium-sized in-order CPUs that want to do something a bit more, but don't want to go OoO.

I feel as if the advent of dual-issue in-order CPUs with both "early" and "late" ALUs (SiFive 7-series, WD SWeRV, Arm A55) have pretty much addressed the same issue in a more flexible and general way.

I am a 60 year old man, still following the development of CPUs and programming languages, and I participate in a FOSS SW project.

I'm the same age. There are plenty of others here the same or older. Mitch, who has replied to you a couple of times in this thread, has a few years on us. Don't get the idea you are old!!

Philipp Tomsich

unread,

Feb 27, 2023, 4:57:21 PM2/27/23

to Bruce Hoult, Juha Manninen, RISC-V ISA Dev

On Mon, 27 Feb 2023 at 22:50, Bruce Hoult <br...@hoult.org> wrote:
>
> On Mon, Feb 20, 2023 at 1:51 AM Juha Manninen <juha.ma...@gmail.com> wrote:
>>
>> Q: How does the macro operation fusion system work when a CPU implements out of order execution? I guess they are not exclusive.
>
>
> Simply treat the combined instructions as one longer instruction that the hardware implements even though the ISA doesn't formally contain it. OoO or not doesn't make any difference.
>
> Macro-op fusion in RISC-V seems to have caught the imagination of the public. Quite ironically, since it is present in modern Arm and x86 CPUs (fusing a compare or other arithmetic with an immediately following conditional branch) but is not actually implemented in any RISC-V CPU I'm aware of. SiFive's U74 links a forward conditional branch over one instruction with that instruction as they proceed down the two execution pipelines, squashing the 2nd instruction if the branch ends up being taken (so no branch mispredict penalty in either case) but that's something different.

At least one CPU implementing instruction fusion (and the associated
fusion patterns) for RISC-V is documented on the GCC mailing list:
https://patchwork.ozlabs.org/project/gcc/list/?series=327993

>
> Macro-op fusion seems like a cute idea, as it effectively adds complex instructions that complex CPUs can take advantage of, but simple CPUs don't have to know exist. It isn't implemented in small CPUs (by definition). But it's not really useful in big OoO CPUs either. Which leaves medium-sized in-order CPUs that want to do something a bit more, but don't want to go OoO.
>
> I feel as if the advent of dual-issue in-order CPUs with both "early" and "late" ALUs (SiFive 7-series, WD SWeRV, Arm A55) have pretty much addressed the same issue in a more flexible and general way.
>
>
>>
>> I am a 60 year old man, still following the development of CPUs and programming languages, and I participate in a FOSS SW project.
>
>
> I'm the same age. There are plenty of others here the same or older. Mitch, who has replied to you a couple of times in this thread, has a few years on us. Don't get the idea you are old!!
>

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkx7R-8R3Pe9c%2BF6ofiu1KUJp_UTKbd6zkOzC5%3DJXs8WRQ%40mail.gmail.com.

Bruce Hoult

unread,

Feb 27, 2023, 5:11:30 PM2/27/23

to Philipp Tomsich, Juha Manninen, RISC-V ISA Dev

I stand corrected! I've always regarded Veyron as out of my price range :-)

Is there any word of fusion being implemented in the other super-wide cores such as MIPS, Rivos, Tenstorrent?

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAAeLtUCwPm9XoLfEQ_9%2BtxmYGM0nkTUufNrtQW%3DwM%3DA5y3FEbA%40mail.gmail.com.

MitchAlsup

unread,

Feb 27, 2023, 5:15:27 PM2/27/23

to RISC-V ISA Dev, Bruce Hoult, RISC-V ISA Dev, Juha Manninen

On Monday, February 27, 2023 at 3:50:58 PM UTC-6 Bruce Hoult wrote:

On Mon, Feb 20, 2023 at 1:51 AM Juha Manninen <juha.ma...@gmail.com> wrote:
Q: How does the macro operation fusion system work when a CPU implements out of order execution? I guess they are not exclusive.

<snip>
Macro-op fusion in RISC-V seems to have caught the imagination of the public. Quite ironically, since it is present in modern Arm and x86 CPUs (fusing a compare or other arithmetic with an immediately following conditional branch) but is not actually implemented in any RISC-V CPU I'm aware of. <snip>

Instruction fusing is a sign of 1 of 2 things: Either a) the instruction set is not powerful enough, or b) you ran out of bits to cram both/all into one (1) instruction container, and sometimes a third thing; c) the ISA was developed prior to pipelining and/or OoO processor design style evolved and the designers have to put up with all the sins of the past.

A lot of the time, there is a subtle tradeoff between (a) and (b) MIPS and RISC-V taking the compare-and-branch as an instruction, sacrificing branch target displacement, but also penalizing compare to immediate (not 0).

<snip>

I am a 60 year old man, still following the development of CPUs and programming languages, and I participate in a FOSS SW project.

I'm the same age. There are plenty of others here the same or older. Mitch, who has replied to you a couple of times in this thread, has a few years on us. Don't get the idea you are old!!

I am 70 and still working on processor designs. You are only old when you really don't want to work on things you are good at.

Juha Manninen

unread,

Mar 4, 2023, 10:26:27 AM3/4/23

to RISC-V ISA Dev, RISC-V ISA Dev

On Monday, February 27, 2023 at 11:57:21 PM UTC+2 Philipp Tomsich wrote:

At least one CPU implementing instruction fusion (and the associated
fusion patterns) for RISC-V is documented on the GCC mailing list:
https://patchwork.ozlabs.org/project/gcc/list/?series=327993

OK, cool. One question comes to mind:

Is the logic for keeping fusible sequences together really custom-made for a certain CPU implementation?

Do future implementations from other CPU manufacturers require a different logic?

I thought the rules for fusing instructions are generic and apply to all RISC-V CPUs.

Juha Manninen

MitchAlsup

unread,

Mar 4, 2023, 2:27:04 PM3/4/23

to RISC-V ISA Dev, Juha Manninen, RISC-V ISA Dev

On Saturday, March 4, 2023 at 9:26:27 AM UTC-6 Juha Manninen wrote:

On Monday, February 27, 2023 at 11:57:21 PM UTC+2 Philipp Tomsich wrote:
At least one CPU implementing instruction fusion (and the associated
fusion patterns) for RISC-V is documented on the GCC mailing list:
https://patchwork.ozlabs.org/project/gcc/list/?series=327993

OK, cool. One question comes to mind:
Is the logic for keeping fusible sequences together really custom-made for a certain CPU implementation?

Yes, and no::

Yes:: consider that you have an AGEN unit capable of 3-input addition; here you would

fuse {slli rd,rs,3; add rd,rd,rbase; ld rld,disp(rd)} into a single instruction}

No:: consider that you have a compare-and-branch unit and a data path capable of

routing an immediate to the conditional-branch; here everyone should fuse

{li rd,17; ble rd,rs,label} into a single instruction.

Yes: not every implementation will have 3-input adder in AGEN

No: every implementation will have the ability to route an immediate to bcond.

Do future implementations from other CPU manufacturers require a different logic?

This depends entirely on how the function units are constructed and how data-path

routing is constructed; AND whether decode logic can identify the situations.

I thought the rules for fusing instructions are generic and apply to all RISC-V CPUs.

Consider: a branch unit capable of comparing a floating point operand against zero

and then branching:: here you would fuse {fle.d rd,ft1,ft2; bnez rd,label}; other

implementations may choose not to have fp stuff in the compare-and-branch unit.

Juha Manninen

L Peter Deutsch

unread,

Mar 4, 2023, 2:37:04 PM3/4/23

to Juha Manninen, RISC-V ISA Dev

> I thought the rules for fusing instructions are generic and apply to all
> RISC-V CPUs.

Unless I've missed something, instruction fusion is not part of any
RISC-V *specification*. Spec development is guided by experience in
implementation, but fusion is definitely an implementation construct.
(FWIW, that's why I'm a fan of JIT compilation, which can potentially
generate code optimized to the specific implementation.)

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

BGB

unread,

Mar 4, 2023, 5:50:37 PM3/4/23

to isa...@groups.riscv.org

On 3/4/2023 1:36 PM, L Peter Deutsch wrote:
>> I thought the rules for fusing instructions are generic and apply to all
>> RISC-V CPUs.
>
> Unless I've missed something, instruction fusion is not part of any
> RISC-V *specification*. Spec development is guided by experience in
> implementation, but fusion is definitely an implementation construct.
> (FWIW, that's why I'm a fan of JIT compilation, which can potentially
> generate code optimized to the specific implementation.)
>

JIT can work, and can also help to separate a higher level "distribution
format" from the more specialized native ISA used by a given machine. A
bytecode format could work here, apart from the seeming inability of
anyone to develop and adopt a bytecode format that "doesn't suck".

Ideally, such a format should be able to:
Efficiently express C and C++ code;
Capable of supporting dynamic linking;
Be only modestly "higher level" than an actual machine ISA;
Able to easily express (relatively) free-form control flow;
Does not assume use of:
A garbage collector;
A managed type-system;
Any other heavyweight runtime facilities.
...

Forms it can take:
(A) RPN / Stack Machine
Poplular, but requires more work from the backend;
Typically converted into SSA or similar internally.
(B) SSA Form:
Awkward, also requires a backend that can deal effectively with SSA.
(C) Large virtual register machine:
Works relatively well;
Typically, every local variable is assigned to a "register";
One might easily have 256 or 512 virtual registers to work with;
Backend deals with the stack-frame and mapping to CPU registers.

Difference between B and C is that B assumes that every variable is
immutable and variables between paths need to be merged between control
paths via "phi" operators or similar.

In the case of C, each variable is assumed to be mutable and may be
assigned to as needed. This also allows for a simpler backend (which may
essentially "cache" the virtual registers in CPU registers; and then
freely evict them back into stack-frame memory or similar at the end of
each basic block). It is also possible to make tradeoffs depending on if
an interpreter is needed, and what style of interpreter one wants to use.

One example of a possible instruction format for such an interpreter
could be, say (32-bit words, 256 logical registers):
00000000..7FFFFFFF: Single 32-bit word
0zddssii: Load/Store Constant Index (or field offset)
1zddsstt: Load/Store Variable Index
2zddssii: d = s op Imm (ALU ops)
3zddsstt: d = s op t (ALU ops)
...
80000000..FFFFFFFF: Two or more 32-bit words.

This would burn through the 32-bit space pretty fast (mandating the use
of 64 bit instructions for less common cases), though one doesn't
necessarily need good code density in such an IR. With 256 logical
registers, nearly all functions can be mapped directly to these
registers. Most likely, the register size would be assumed to be the
same size as the pointer size (probably 64 bits on a 64 bit target; on a
32-bit target, would likely use pairs for 64 bit types).

While it may be tempting to try to gloss over 32 vs 64 bit pointers; in
my experience trying to gloss over this when compiling C code is "non
trivial", so the IR would likely still needed to assume either 32 or 64
bit pointers on the target.

Otherwise, for a CPU, I would more assume having multiple tiers of
implementation, and making some features optional at lower-stat tiers,
but potentially required for higher-stat tiers.

For example, one could have a low-end profile, say:
Aligned-only memory access;
Reg+Disp as the only address mode;
No multiply or divide, shift is optional;
...
Say, this would make more sense for small microcontrollers.

A baseline profile:
Memory access allows any alignment;
Reg+Disp as the only address mode;
Has multiply and shift (but no divide or modulo);
MMU is optional;
...
Say, this would make more sense for medium microcontrollers.

An expanded profile:
Memory access allows any alignment;
Has indexed addressing modes and similar;
Has multiply, shift, divide, an FPU, ...;
Has an MMU, ...
...

Say, this would make more sense for more general use.

Then maybe some performance-oriented profiles with SIMD, various more
specialized ISA extensions, and similar.

As-is, last I have heard there is still not even an extension in RISC-V
for indexed load/store, which in my experience tends to be the second
most common addressing mode after fixed displacement.

A general-use ISA without indexed addressing personally seems misguided
IMO. Its costs (such as potentially needing to get a 3rd register read
port from somewhere) seem mostly moot by the time one has an MMU and FPU
(and almost entirely moot by the time one has superscalar; as by then
the register file already needs to have 4 or 6 read ports in any case).

Meanwhile, opcode fusion is an expensive feature that doesn't scale
well. Not an ideal approach for either small or large implementations.

L Peter Deutsch

unread,

Mar 4, 2023, 6:21:16 PM3/4/23

to BGB, isa...@groups.riscv.org

> JIT can work, and can also help to separate a higher level "distribution
> format" from the more specialized native ISA used by a given machine.

UNCOL. Go look up UNCOL. There's a lot of history around this.

> A bytecode format could work here, apart from the seeming inability of
> anyone to develop and adopt a bytecode format that "doesn't suck".

The JVM was a decent try, but its code was tied too tightly to the Java
language.

I think WebAssembly actually could have met most of the requirements, except
that unlike RISC-V, they apparently have an inadequate development and
review process, which has led to two very, very bad mistakes (IMO) in spec
version 2.0:

* 2.0 includes facilities very similar to a subset of the current (IMO
immature) draft RISC-V P extension. There are proposals floating around for
a much better WebAssembly vector facility, but there is no process for
*removing* anything from the spec, and the entire spec is mandatory --
there is no concept of RISC-V style modularization.

* 1.0 only required dynamic memory allocation at module instantiation time;
in 2.0, key data structures that were allocated at load or module
instantiation time in 1.0 can be resized dynamically. Again, implementing
the entire spec is mandatory. At least there's no requirement for
fine-grained memory management, and nothing that requires garbage
collection.

WebAssembly has some fine properties, but at this point, sadly, I consider
it a near miss.

WebAssembly's bytecode encoding is also inefficient, but just for my own
entertainment, I developed a different encoding of the same semantics that
was about 45% smaller and didn't require significant additional machine
state. I'm not even sure how important that is these days, though.

MitchAlsup

unread,

Mar 4, 2023, 6:48:40 PM3/4/23

to RISC-V ISA Dev, BGB

On Saturday, March 4, 2023 at 4:50:37 PM UTC-6 BGB wrote:

On 3/4/2023 1:36 PM, L Peter Deutsch wrote:
JIT can work, and can also help to separate a higher level "distribution
format" from the more specialized native ISA used by a given machine. A
bytecode format could work here, apart from the seeming inability of
anyone to develop and adopt a bytecode format that "doesn't suck".

Is LLVM IR to high or "sucks to much" ??

<snip>

While it may be tempting to try to gloss over 32 vs 64 bit pointers; in
my experience trying to gloss over this when compiling C code is "non
trivial", so the IR would likely still needed to assume either 32 or 64
bit pointers on the target.

No reason to gloss over displacement widths or pointer widths.

Otherwise, for a CPU, I would more assume having multiple tiers of
implementation, and making some features optional at lower-stat tiers,
but potentially required for higher-stat tiers.

<snip>

As-is, last I have heard there is still not even an extension in RISC-V
for indexed load/store, which in my experience tends to be the second
most common addressing mode after fixed displacement.

This is a RISC-V problem not inherently a JIT problem

A general-use ISA without indexed addressing personally seems misguided
IMO. Its costs (such as potentially needing to get a 3rd register read
port from somewhere) seem mostly moot by the time one has an MMU and FPU
(and almost entirely moot by the time one has superscalar; as by then
the register file already needs to have 4 or 6 read ports in any case).

As soon as yo have FMAC (y = x×y+z) you have the 3-register operand

requirement in DECODE. [Rbase+Rindex<<scale+DISP] only has a 2-register

operand requirement in DECODE. AGEN much receive 3-Operands, but

one of them does not arrive from the register file or need forwarding.

Meanwhile, opcode fusion is an expensive feature that doesn't scale
well. Not an ideal approach for either small or large implementations.

Whereas a well rounded ISA does not have these problems..........

L Peter Deutsch

unread,

Mar 4, 2023, 6:59:19 PM3/4/23

to MitchAlsup, isa...@groups.riscv.org, cr8...@gmail.com

> Is LLVM IR to high or "sucks to much" ??

It's a hole in my technical knowledge. No opinion until I educate myself.
Thanks for reminding me it's something I should be familiar with.

BGB

unread,

Mar 4, 2023, 7:54:28 PM3/4/23

to L Peter Deutsch, isa...@groups.riscv.org

On 3/4/2023 5:21 PM, L Peter Deutsch wrote:
>> JIT can work, and can also help to separate a higher level "distribution
>> format" from the more specialized native ISA used by a given machine.
>
> UNCOL. Go look up UNCOL. There's a lot of history around this.
>

Apparently it didn't amount to much, most later attempts also didn't
amount to much.

>> A bytecode format could work here, apart from the seeming inability of
>> anyone to develop and adopt a bytecode format that "doesn't suck".
>
> The JVM was a decent try, but its code was tied too tightly to the Java
> language.
>

Yeah.

JVM bytecode is almost useless for C code though.
Something like .NET would be better.

However, .NET assumes a GC and a bunch of other stuff that is irrelevant
and needless dead weight for a language like C.

I once attempted something I called "FRBC" (for "Fast Register
ByteCode"), which was modestly successful, but the design was
over-engineered and an interpreter backend for it ended up impractically
large and complicated (it had too many cases where X*Y*Z combinatorial
interactions between features came up to bite me).

Basically, it had 3 types of registers, too many primitive types, overly
bit-packed variable length instructions, ... this did not go well.
Ideally, one wants only a single type of "register" and also a fairly
minimal type system at the level of the bytecode IR.

It was at least "reasonably fast" for an interpreter, though I had later
figured out how to make stack-oriented interpreters fast. However, these
interpreters require an internal translation step; whereas direct
interpretation of a stack oriented bytecode is slow.

A 3-register bytecode can still get better performance for a direct
interpreter (mostly due to needing to execute a smaller number of
instructions); and for a direct interpreter, a major selling point is
mostly to try to limit code complexity and memory footprint from the
interpreter.

Parts of its compiler backend were later reused for my SH-4/BJX1 and
later BJX-2 compiler backend, because it wasn't a huge jump to go from
such an IR to a real machine ISA.

However, a machine ISA also has some drawbacks that make it not ideal as
an interpreter bytecode or as an IR for a JIT/AOT backend (and an
emulator is not the same as a compiler).

One possible goal could be something "vaguely similar", but reducing the
amount of needless complexities.

> I think WebAssembly actually could have met most of the requirements, except
> that unlike RISC-V, they apparently have an inadequate development and
> review process, which has led to two very, very bad mistakes (IMO) in spec
> version 2.0:
>
> * 2.0 includes facilities very similar to a subset of the current (IMO
> immature) draft RISC-V P extension. There are proposals floating around for
> a much better WebAssembly vector facility, but there is no process for
> *removing* anything from the spec, and the entire spec is mandatory --
> there is no concept of RISC-V style modularization.
>
> * 1.0 only required dynamic memory allocation at module instantiation time;
> in 2.0, key data structures that were allocated at load or module
> instantiation time in 1.0 can be resized dynamically. Again, implementing
> the entire spec is mandatory. At least there's no requirement for
> fine-grained memory management, and nothing that requires garbage
> collection.
>
> WebAssembly has some fine properties, but at this point, sadly, I consider
> it a near miss.
>

WASM's structure is weird.

I would have rather assumed that all the basic-blocks are interconnected
via labels and an if-goto mechanism. Where labels are explicit, and will
terminate one block and start another.

Last I checked, the design for WASM also made no provisions for separate
compilation or dynamic linking of modules.

> WebAssembly's bytecode encoding is also inefficient, but just for my own
> entertainment, I developed a different encoding of the same semantics that
> was about 45% smaller and didn't require significant additional machine
> state. I'm not even sure how important that is these days, though.
>

Yeah, probably doesn't matter that much. For an IR format, one can also
mostly "solve" the issue of lackluster code-density via LZ compression
and "lump caching", since with many interpreter and JIT designs, one
might only need to look at the IR once.

Unlike an actual CPU, one does have to care a bit more about how much
bit twiddling is necessary to unpack and interpret instructions.

And, for a JIT or AOT backend, one uses the bytecode once and then
doesn't need to look at it anymore. So, as long as it is not "absurd",
everything is good...

As for my scheme of using 32/64 bit ops, and simplistic bit packing.
This will allow for a semi-efficient direct interpreter, say:
opw=*cs++;
op=(opw>>24)&255;
if(opw>>31)
opw2=*cs++;
rd=(opw>>16)&255;
rs=(opw>> 8)&255;
rt=(opw>> 0)&255;
switch(op)
{
...
case 0x20:
regs[rd]=regs[rs]+rt;
break;
case 0x21:
regs[rd]=regs[rs]-rt;
break;
...
case 0x30:
regs[rd]=regs[rs]+regs[rt];
break;
case 0x31:
regs[rd]=regs[rs]-regs[rt];
break;
...
}

Though, one could also make a case for bytes:
op=cs[3]; rd=cs[2];
rs=cs[1]; rt=cs[0];
if(op&0x80)
{
//meaning depends on main opcode
xi0=op[4]; xi1=op[5];
xi2=op[6]; xi3=op[7];
cs+=8;
}else
{
cs+=4;
}
switch(op)
{
...
}

At least on my ISA, discrete byte loads would have some drawbacks vs
loading 32-bit values and then extracting fields using shifts and masks.

Note that the management of "registers" wouldn't actually be that much
different than how local variables were managed in the JVM. Each
function would likely have a header that encodes how many it needs, with
the first N being used to pass arguments (with '...' arguments being
folded into an array which is passed as an extra argument).

For a direct interpreter, the "registers" would likely be managed as a
linear stack-like structure (adjust the offset when calling a function,
copy over the arguments, and use it as an array; restoring the old
offset when the current function returns).

L Peter Deutsch

unread,

Mar 4, 2023, 8:37:51 PM3/4/23

to BGB, isa...@groups.riscv.org

> Last I checked, the design for WASM also made no provisions for separate
> compilation or dynamic linking of modules.

I believe this isn't correct. WASM is based on instantiable modules; each
module is compiled separately. Modules can import and export both data and
functions; if a computation isn't in progress, I believe it's even possible
to unload a module and replace it. It's up to the embedding environment as
to how manage loading and instantiation: separate from the WASM definition,
there is a Web standard API for managing WASM modules from JS in a browser.
To some extent, WASM binaries are actually more like .o libraries than like
executables.

> Yeah, probably doesn't matter that much. For an IR format, one can also
> mostly "solve" the issue of lackluster code-density via LZ compression
> and "lump caching", since with many interpreter and JIT designs, one
> might only need to look at the IR once.

Or at least not very often. My SOWA (Space Optimized WebAssembly)
experiment wasn't quite space competitive with LZ compression, but it was a
lot simpler to decode (everything byte- not bit-oriented, no back-reference
tables, ...)

MitchAlsup

unread,

Mar 4, 2023, 8:54:40 PM3/4/23

to RISC-V ISA Dev, BGB, isa...@groups.riscv.org, L Peter Deutsch

On Saturday, March 4, 2023 at 6:54:28 PM UTC-6 BGB wrote:

On 3/4/2023 5:21 PM, L Peter Deutsch wrote:

<snip>

As for my scheme of using 32/64 bit ops, and simplistic bit packing.
This will allow for a semi-efficient direct interpreter, say:
opw=*cs++;
op=(opw>>24)&255;
if(opw>>31)
opw2=*cs++;
rd=(opw>>16)&255;
rs=(opw>> 8)&255;
rt=(opw>> 0)&255;

Whatever intermediate that is available, or yet to be developed, should have

the ability to directly express bit-fields !! So the above code would/could be

written::

opw = *cs++;

op = opw<31..24>;

if( opw<31>)

opw2 = *cs++;

rd = opw<23..16>;

rs = opw<15..8>;

rt = opw<7..0>;

<snip>

At least on my ISA, discrete byte loads would have some drawbacks vs
loading 32-bit values and then extracting fields using shifts and masks.

signed b = word<23..16>;

unsigned d = word<15..8>;

These are actually single instructions in my ISA>

BGB

unread,

Mar 4, 2023, 11:48:38 PM3/4/23

to isa...@groups.riscv.org

On 3/4/2023 5:48 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>
>
> On Saturday, March 4, 2023 at 4:50:37 PM UTC-6 BGB wrote:
>
> On 3/4/2023 1:36 PM, L Peter Deutsch wrote:
> JIT can work, and can also help to separate a higher level
> "distribution
> format" from the more specialized native ISA used by a given machine. A
> bytecode format could work here, apart from the seeming inability of
> anyone to develop and adopt a bytecode format that "doesn't suck".
>
>
> Is LLVM IR to high or "sucks to much" ??
>

Unless you are using LLVM, then LLVM IR is very painful to try to work
with. The text format is "slightly less evil" in some ways than
"bitcode", but still not ideal as a starting point (using a text-based
format and parser is needless overhead).

The bitcode is less a bytecode format, and more a binary pickling scheme
for LLVM's C++ classes being passed off as a bytecode.

For a small target machine, the memory overhead for LLVM would basically
also make it unusable.

If going the SSA route, something almost like SPIR-V might make sense as
a starting point, but as I see it, SSA is itself a big up front cost.

Unlike either RPN or a straightforward 3-register bytecode, SSA is also
not particularly usable for an interpreter; and only really makes sense
if one assumes always using a full compiler as a backend; but then it is
easy enough to convert a non-SSA bytecode into SSA form.

Something more like JVM bytecode makes sense for an interpreter, but
directly interpreting a design like JVM bytecode will give lackluster
performance. It needs a lot of operations to get much done, and a direct
interpreter is mostly bottle-necked by how quickly it can spin in a loop
and dispatch bytecode instructions.

Something like .NET bytecode makes less sense for a direct interpreter
due to the lack of type information for the instructions (trying to pass
types on the stack and deal with this for each operation, will ruin
performance).

So, say:
JVM like IR: OK for interpreter and compiler;
But, bad for most any language "not Java".
.NET like IR: Better for compiler, bad for interpreter;
SSA form: Good for compiler, bad for interpreter.
LLVM IR: OK for LLVM, bad for everything not LLVM.
Three Register IR (non SSA): OK for compiler, good for interpreter.

> <snip>
> While it may be tempting to try to gloss over 32 vs 64 bit pointers; in
> my experience trying to gloss over this when compiling C code is "non
> trivial", so the IR would likely still needed to assume either 32 or 64
> bit pointers on the target.
>
> No reason to gloss over displacement widths or pointer widths.
>

It is tempting to try to make a bytecode that works "natively" on both
32 bit and 64 bit machines.

But, making this work with typical C code and header files is basically
no go.

So, C is a leaky abstraction, and there is basically no real way within
the existing language to entirely sidestep things like "sizeof(void *)"
and "sizeof(long)" and similar.

There was once also an attempt by EFI ByteCode to address this, but had
some severe limitations and EBC was apparently mostly DOA (while still
not general purpose enough to be useful outside of trying to use it for
hardware drivers).

>
> Otherwise, for a CPU, I would more assume having multiple tiers of
> implementation, and making some features optional at lower-stat tiers,
> but potentially required for higher-stat tiers.
>
> <snip>
> As-is, last I have heard there is still not even an extension in RISC-V
> for indexed load/store, which in my experience tends to be the second
> most common addressing mode after fixed displacement.
>
>
> This is a RISC-V problem not inherently a JIT problem
>

Yeah.

>
> A general-use ISA without indexed addressing personally seems misguided
> IMO. Its costs (such as potentially needing to get a 3rd register read
> port from somewhere) seem mostly moot by the time one has an MMU and
> FPU
> (and almost entirely moot by the time one has superscalar; as by then
> the register file already needs to have 4 or 6 read ports in any case).
>
> As soon as yo have FMAC (y = x×y+z) you have the 3-register operand
> requirement in DECODE. [Rbase+Rindex<<scale+DISP] only has a 2-register
> operand requirement in DECODE. AGEN much receive 3-Operands, but
> one of them does not arrive from the register file or need forwarding.
>

In my case, AGU was handled in the EX1 (with immediate values handled
internally as a special register), and so things like memory store
needed 3 register ports.

So, as noted (in my case):
RISC-like profile:
3R1W register file;
Only one op at a time.
2-wide: 4R2W
But, MEM eats both lanes
Only being able to do "ALU | ALU" is lame...
3-wide: 6R3W
MEM eats 3rd lane.
"ALU | MEM", Fairly common
"ALU | ALU | ALU", Occasionally happens.

So, I went with 3-wide because, it wasn't too much more expensive than
2-wide, but was in many ways significantly more useful.

For the "GPU Profile" core, the idea was to drop to a 6R2W register
file, with a limit of 2 instructions per clock. The use-case for the
last 2 register ports being exclusively to supply a 3rd port for the
other instructions. This allows memory ops and MAC/FMAC and similar to
not eat the second lane, and also still has enough ports for the current
set of 128-bit SIMD operations.

My current thinking would be also that this mode would be hard-wired to
use my "XG2" ISA mode (fixed-length 32-bit instructions with 64 GPRs).

Similarly, stripping off most other ISA features not directly relevant
to 3D rendering and similar. So, for example, it probably would still
keep the "LDTEX" instruction (Loads a texel value from a
block-compressed texture), but a lot of other stuff would go away.

If I had a dedicated GPU core, this might be reason to disable LDTEX in
the main core(s); it being a bit niche, and not useful for much other
than drawing texture-mapped spans (and the mechanism for this
instruction isn't entirely free either).

There are possible expansions to the SIMD ISA, such as possible debate
for whether to add, say:
Pxxx.H Rm, Imm52, Rn
Pxxx.F Xm, Imm52, Xn
SIMD ops, which use a 4x FP13 (S.E5.F7) immediate.

As-is, it is necessary to load a 4x Binary16 constant:
JLDI Imm64, Rn //64b as 4x Binary16
PLDCH Imm64, Xn //64b 4x FP16 convert to 4x Binary32
Which burns an extra clock cycle.

But, this is only really likely to make a difference for neural nets,
which is a bit niche (and would be partially counteracted in this case
by no longer being able to put vector shuffles in parallel with the SIMD
ops in the case of Binary16).

Encoding would be, essentially, sticking two Jumbo_Imm prefixes onto a
SIMD op; with the decoder then interpreting the presence of the jumbo
prefixes to mean to decode it as an immediate-form rather than as a 3R
register form (likely extending the mechanism for floating-point
immediate values to be able to handle SIMD vector immediate cases and
similar as well).

Interestingly, with the new FPGA board (with an XC7A200T-1), there is
enough space for two CPU cores and a GPU core. So, roughly triple core;
almost but not quite enough LUTs to go quad-core. Will see how long this
lasts...

Though, still working on trying to get the DDR3 RAM module and similar
working. And it seems like the XC7A200T -1 is more prone to fail timing
than the XC7A100T -1; which was in turn more prone to fail timing than
the XC7S50 -1 (at least, until the FPGAs start to run out of space or
have "routing congestion" warnings).

I guess, more space to put stuff into, means longer paths, and more "net
delay"...

Can put more stuff in the FPGA, but can't have more than 16K of L1
cache... Can at least sort of counter-balance the 16K L1's by now being
able to have 512K of L2 cache (still throwing most of the block RAM in
the FPGA at the L2 cache).

Would have gotten a nice speedup with a -2 FPGA, but the board I got was
a little cheaper and only has a -1 speed grade.

Then again, it is possibly all a little moot; as it had been a good
number of months since I last ran any testing on actual hardware, and it
seems general stability on actual FPGAs had declined slightly during
this time (stuff is annoyingly crash-prone at the moment; much more so
than in the simulations).

TBD if doubling the size of the L2 cache will counter-balance the RAM
being roughly 50% slower in this case. Minimum CAS latency effectively
goes from 3 cycles to 6 cycles; but the RAM chip also goes from 128MB to
256MB. Mostly it is that DDR3 was designed for a higher minimum clock
speed than DDR2 (was designed for 667MHz, not 50MHz...).

Could in theory use Vivado MIG, but meh, still don't want to deal with
AXI or similar.

>
> Meanwhile, opcode fusion is an expensive feature that doesn't scale
> well. Not an ideal approach for either small or large implementations.
>
> Whereas a well rounded ISA does not have these problems..........
>

Yeah.

I am not needing to fuse any opcodes in BJX2...

Then again, I have a different set of issues...

> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org

> <mailto:isa-dev+u...@groups.riscv.org>.

> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/88dd08d6-f4b6-4e71-8876-17ea5b3ebc6bn%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/88dd08d6-f4b6-4e71-8876-17ea5b3ebc6bn%40groups.riscv.org?utm_medium=email&utm_source=footer>.

BGB

unread,

Mar 5, 2023, 12:52:41 AM3/5/23

to L Peter Deutsch, isa...@groups.riscv.org

On 3/4/2023 7:37 PM, L Peter Deutsch wrote:
>> Last I checked, the design for WASM also made no provisions for separate
>> compilation or dynamic linking of modules.
>
> I believe this isn't correct. WASM is based on instantiable modules; each
> module is compiled separately. Modules can import and export both data and
> functions; if a computation isn't in progress, I believe it's even possible
> to unload a module and replace it. It's up to the embedding environment as
> to how manage loading and instantiation: separate from the WASM definition,
> there is a Web standard API for managing WASM modules from JS in a browser.
> To some extent, WASM binaries are actually more like .o libraries than like
> executables.
>

OK, it has been a while since I really looked at it or messed with it
much. I think at the time I was messing with it (a few years back), it
was IIRC working mostly by using a blob of JS to unpack the WASM into JS
("asm.js") and then feed it into the browser's eval mechanism.

IIRC, it had also worked by handling C functions by putting the function
references into a big array and then calling via this array; with
another big array representing the entire RAM space for the program; ...

Then again, all this was mostly before I got caught up in my ISA design
project (a number of years ago), so it is possible things have changed
since then.

>> Yeah, probably doesn't matter that much. For an IR format, one can also
>> mostly "solve" the issue of lackluster code-density via LZ compression
>> and "lump caching", since with many interpreter and JIT designs, one
>> might only need to look at the IR once.
>
> Or at least not very often. My SOWA (Space Optimized WebAssembly)
> experiment wasn't quite space competitive with LZ compression, but it was a
> lot simpler to decode (everything byte- not bit-oriented, no back-reference
> tables, ...)
>

I was thinking more like LZ4 or my own RP2 compression.
RP2 is roughly in a similar category to LZ4, but often compresses
slightly better while having fairly similar decode speeds; both roughly
2..3 GB/s on my Ryzen. Both are byte-oriented formats (no entropy coding).

In my case, I am using LZ4 for binaries, mostly as decompressing stuff
on load is faster than reading more data from an SDcard. This is much
more obvious in simulations; which are generally around 300x slower than
real-time.

I am using LZ4 mostly because it seemed to work better with binary code
in my ISA than my RP2 scheme (which tends to be better with most other
data; but worse with program binaries).

Or, essentially, the binaries are a hacked and LZ4 compressed variant of
PE/COFF.

An Huffman encoded format (like Deflate) is both somewhat more
complicated and would likely be slower to decode than the time spent
reading the data from the SDcard or similar.

Actually, the pagefile used LZ compression for similar reasons (though
this uses an LZ encoder which was more aggressively designed for "quick
and dirty" LZ compression).

But, in this case, an LZ compressed IR could likely be decoded in
chunks, with the JIT or AOT compiling the chunks, and then discarding
them afterwards. Binary could be either based on a TLV format (possibly
similar to the RIFF AVI format), or maybe built on something similar to
the Quake WAD2 format (I am using a few variants of this format in my
projects as well in a few random places).

Possibly it could also be built around a 32K chunking scheme (with the
bytecode for smaller functions being lumped together into a shared
chunk); managed using a similar scheme to what the Doom Engine uses for
managing WAD lumps.

Unclear if an AOT would directly use the bytecode ops to drive an
instruction emitter, or would first build up an array of ASM
instructions which is then handed off to the assembler.

Say:
struct ScratchOp_s {
short nmid; //mnemonic
byte fmid; //instruction form
byte rd; //dest (CPU register)
byte rs; //source A (CPU register)
byte rt; //source B (CPU register)
int imm; //immediate (or label / symbol)
};

Though, the interpretation of this structure would depend some on the
target ISA, and things like labels would exist as pseudo-instructions.
The emitter back-end would then figure out which machine instruction to
emit (keyed by nmid and fmid), fill in the appropriate fields, and then
emit the instruction into the output section (as a working buffer).

Possibly, the use of a single monolithic section buffer could be avoided
by using an array of "page buffers" (avoids needing to "realloc()"
buffers that are possibly 100s of kB or more).

With the AOT likely operating one function at a time.

Could possibly pull this off without too unreasonable of a memory footprint.

...

MitchAlsup

unread,

Mar 29, 2023, 7:45:45 PM3/29/23

to RISC-V ISA Dev, L Peter Deutsch, isa...@groups.riscv.org, BGB

Sorry for the length of the reply::

On Saturday, March 4, 2023 at 7:37:51 PM UTC-6 L Peter Deutsch wrote:

> Last I checked, the design for WASM also made no provisions for separate
> compilation or dynamic linking of modules.

I believe this isn't correct. WASM is based on instantiable modules; each
module is compiled separately. Modules can import and export both data and
functions; if a computation isn't in progress, I believe it's even possible
to unload a module and replace it.

Compilation units and Modules can be linked to form larger modules.

Modules use a symbol table to expose names they resolve to other modules, and

use a Global Offset Table (GOT) to consume names exported by other modules in

a position independent (PIC) fashion. Each module ends up with its own GOT and

PLT. The link editor maintains the symbol tables while converting compilation units

into modules and lists of small modules into larger modules. The link editor comp-

resses GOT to unique values so that all references to 'extern name' use the same

GOT[entry], the link editor, then, has to alter the offsets used by the call and memory

reference instructions (below) so as to use the correct GOT[entry] after said editing.

To access an imported data by name, the compiler produces a load to GOT[n] which

gets the address of the named symbol, an then a second memory reference to

access the named extern data.

To call an external function, the compiler produces a load to GOT[n] which gets the

address of the function (if it has been linked or the address of PLT[n] if it has not.)

Then, the compiler produces a call-indirect instruction transferring control to either

PLT or to the linked function.

The dynamic linker (generally ld.so) has access to a list of files from the application,

that it searches when an unlinked access is performed. Ld.so's main duty is to search

the symbol tables of the modules in the list of files to resolve names, and position the

code and data in application virtual memory such that; one module can access data

in another module (through GOT) and that one module can call entry-points in another

module (also by using GOT and a PLT).

Code in the text section is not modified, leaving it sharable. GOT exists in the data

section and is initialized (pre ld.so) such that memory reference will fail, and that

function references will be routed first to ld.so for resolution of the name (through

PLT). When the name has ben resolved, ld.so updates GOT to point at the name

(where ever it is in application virtual memory) and (in most architectures) ld.so

also alters the PLT such that it transfers control directly to the named entry-point.

Let us postulate that we have an application module that has been link-edited into

a module that requires 5 dynamic libraries and that somewhere in each of those

dynamic libraries exists a call to printf() in a 6th dynamic library. As the application

runs, it calls one of the dynamic library entry points:: ld.so steps in and reads the

dynamic module header, places a new copy of DATA, BSS in new memory in the

application virtual address space, if the TEXT section is already in memory; ld.so

maps it into this applications VaS, otherwise ld.so reads the module into newly

allocated Vas. Then ld.so proceeds to decorate PLT and GOT that caused this

loading of the module, and then calls the entry-point which started this work. As

each dynamic module is pulled in they get around to calling printf(). The first time,

ld.so maps the existing printf() image from some other application, but each ld.so

excursion fills in missing GOT and PLT table entries. At the end there are 5 GOT

tables pointing at printf().

In order to unload a module at run time, one simply hast to undo the work performed

on the GOTs (and PLTs). So every GOT and PLT pointing at this module needs to

revert back to their original states. So, while linking is a 1:1 mapping (here is an

index to the symbol table, go resolve this name), unloading is a many:few mapping.

Not undoable, but way more complicated than what typical ld.so's do.

If you just want the memory available to another application (or process), this is best

handled by allowing the swapper to see that the memory is not being used, and swap

out that portion.

If the application had been statically linked (except for the calls to printf() ), then

there would only be 1 GOT and 1 entry in the GOT pointing at printf(). So, careful

program analysis and linking could make unloading certain modules almost linear

and easy to accomplish.

{{This is in total contrast to Burroughs 6000 series of stack machines that could

replace a dynamic library while code was still executing in that library. When the

code returns from the library and makes a subsequent call into that library, control

gets transferred to the new library. We could mimic this using GOT and PLT surgery

but it gets fraught with difficulties (especially with multiple threads with a process

as updates to GOT and PLT have to be ATOMIC.....} }}

Reply all

Reply to author

Forward