More of my philosophy about CISC and RISC instructions..

Amine Moulay Ramdane

unread,

Aug 4, 2023, 12:48:38 PM8/4/23

to

Hello,

More of my philosophy about CISC and RISC instructions..

So we can generally consider CISC (Complex Instruction Set Computer)
instructions of x86 architecture to be higher-level programming instructions compared to RISC (Reduced Instruction Set Computer) instructions due to their complexity.

CISC instructions are designed to perform more complex operations in a single instruction. This complexity allows higher-level programming languages and compilers to generate fewer instructions to accomplish certain tasks. CISC architectures often have a broader range of instructions, some of which might even directly correspond to operations in high-level programming languages.

In contrast, RISC instructions are designed to be simpler and more streamlined, typically performing basic operations that can be executed in a single clock cycle. It might require more instructions to accomplish the same high-level task that a CISC instruction could handle in a single operation.

More of my philosophy about Arm Vs. X86 ..

I invite you to read carefully the following interesting article so
that to understand more:

Overhyped Apple Silicon: Arm Vs. X86 Is Irrelevant

https://seekingalpha.com/article/4447703-overhyped-apple-silicon-arm-vs-x86-is-irrelevant

More of my philosophy about code compression of RISC-V and ARM and more of my thoughts..

I think i am highly smart, and i have just read the following paper
that says that RISC-V Compressed programs are 25% smaller than RISC-V programs, fetch 25% fewer instruction bits than RISC-V programs, and incur fewer instruction cache misses. Its code size is competitive with other compressed RISCs. RVC is expected to improve the performance and energy per operation of RISC-V.

Read more here to notice it:

https://people.eecs.berkeley.edu/~krste/papers/waterman-ms.pdf

So i think RVC has the same compression as ARM Thumb-2, so i think
that i was correct in my previous thoughts , read them below,
so i think we have now to look if the x86 or x64 are still more cache friendly even with Thumb-2 compression or RVC.

More of my philosophy of who will be the winner, x86 or x64 or ARM and more of my thoughts..

I think i am highly smart, and i think that since x86 or x64 has complex instructions and ARM has simple instructions, so i think that x86 or x64 is more cache friendly, but ARM has wanted to solve the problem by compressing the code by using Thumb-2 that compresses the code, so i think Thumb-2 compresses the size of the code by around 25%, so i think
we have to look if the x86 or x64 are still more cache friendly even with Thumb-2 compression, and i think that x86 or x64 will still optimize more the power or energy efficiency, so i think that there remains that since x86 or x64 has other big advantages, like the advantage that i am talking about below, so i think the x86 or x64 will be still successful big players in the future, so i think it will be the "tendency". So i think that x86 and x64 will be good for a long time to make money in business, and they will be good for business for USA that make the AMD or Intel CPUs.

More of my philosophy about x86 or x64 and ARM architectures and more of my thoughts..

I think i am highly smart, and i think that x86 or x64 architectures
has another big advantage over ARM architecture, and it is the following:

"The Bright Parts of x86

Backward Compatibility

Compatibility is a two-edged sword. One reason that ARM does better in low-power contexts is that its simpler decoder doesn't have to be compatible with large accumulations of legacy cruft. The downside is that ARM operating systems need to be modified for every new chip version.

In contrast, the latest 64-bit chips from AMD and Intel are still able to boot PC DOS, the 16-bit operating system that came with the original IBM PC. Other hardware in the system might not be supported, but the CPUs have retained backward compatibility with every version since 1978.

Many of the bad things about x86 are due to this backward compatibility, but it's worth remembering the benefit that we've had as a result: New PCs have always been able to run old software."

Read more here on the following web link so that to notice it:

https://www.informit.com/articles/article.aspx?p=1676714&seqNum=6

So i think that you can not compare x86 or x64 to ARM, since it is
not just a power efficiency comparison, like some are doing it by comparing
the Apple M1 Pro ARM CPU to x86 or x64 CPUs, it is why i think that x86 or x64 architectures will be here for a long time, so i think that they will be good for a long time to make money in business, and they are a good business for USA that make the AMD or Intel CPUs.

More of my philosophy about weak memory model and ARM and more of my thoughts..

I think ARM hardware memory model is not good, since it is a
weak memory model, so ARM has to provide us with a TSO memory
model that is compatible with x86 TSO memory model, and read what Kent Dickey is saying about it in my following writing:

ProValid, LLC was formed in 2003 to provide hardware design and verification consulting services.

Kent Dickey, founder and President, has had 20 years experience in hardware design and verification. Kent worked at Hewlett-Packard and Intel Corporation, leading teams in ASIC chip design and pre-silicon and post-silicon hardware verification. He architected bus interface chips for high-end servers at both companies. Kent has received more than 10 patents for innovative work in both design and verification.

Read more here about him:

https://www.provalid.com/about/about.html

And read the following thoughts of Kent Dickey about the weak memory model such as of ARM:

"First, the academic literature on ordering models is terrible. My eyes
glaze over and it's just so boring.

I'm going to guess "niev" means naive. I find that surprising since x86
is basically TSO. TSO is a good idea. I think weakly ordered CPUs are a
bad idea.

TSO is just a handy name for the Sparc and x86 effective ordering for
writeback cacheable memory: loads are ordered, and stores are buffered and will complete in order but drain separately from the main CPU pipeline. TSO can allow loads to hit stores in the buffer and see the new value, this doesn't really matter for general ordering purposes.

TSO lets you write basic producer/consumer code with no barriers. In fact, about the only type of code that doesn't just work with no barriers on TSO is Lamport's Bakery Algorithm since it relies on "if I write a location and read it back and it's still there, other CPUs must see that value as well", which isn't true for TSO.

Lock free programming "just works" with TSO or stronger ordering guarantees, and it's extremely difficult to automate putting in barriers for complex algorithms for weakly ordered systems. So code for weakly ordered systems tend to either toss in lots of barriers, or use explicit locks (with barriers). And extremely weakly ordered systems are very hard to reason about, and especially hard to program since many implementations are not as weakly ordered as the specification says they could be, so just running your code and having it work is insufficient. Alpha was terrible in this regard, and I'm glad it's silliness died with it.

HP PA-RISC was documented as weakly ordered, but all implementations
guaranteed full system sequential consistency (and it was tested in and
enforced, but not including things like cache flushing, which did need
barriers). No one wanted to risk breaking software from the original in-order fully sequential machines that might have relied on it. It wasn't really a performance issue, especially once OoO was added.

Weakly ordered CPUs are a bad idea in much the same way in-order VLIW is a bad idea. Certain niche applications might work out fine, but not for a general purpose CPU. It's better to throw some hardware at making TSO perform well, and keep the software simple and easy to get right.

Kent"

Read the rest on the following web link:

https://groups.google.com/g/comp.arch/c/fSIpGiBhUj0

Thank you,
Amine Moulay Ramdane.

MitchAlsup

unread,

Aug 8, 2023, 3:18:47 PM8/8/23

to

On Friday, August 4, 2023 at 11:48:38 AM UTC-5, Amine Moulay Ramdane wrote:
> Hello,
>
> More of my philosophy about CISC and RISC instructions..
>
> So we can generally consider CISC (Complex Instruction Set Computer)
> instructions of x86 architecture to be higher-level programming instructions compared to RISC (Reduced Instruction Set Computer) instructions due to their complexity.
>
> CISC instructions are designed to perform more complex operations in a single instruction. This complexity allows higher-level programming languages and compilers to generate fewer instructions to accomplish certain tasks. CISC architectures often have a broader range of instructions, some of which might even directly correspond to operations in high-level programming languages.
>
> In contrast, RISC instructions are designed to be simpler and more streamlined, typically performing basic operations that can be executed in a single clock cycle. It might require more instructions to accomplish the same high-level task that a CISC instruction could handle in a single operation.
<

RISC<->CISC is a spectrum:: with things like Intel 432 on one end and MIPS R2000 on the other end.
<
My point here is that one can design a RISC-like ISA where 98% of the instructions are RISC in their purest form but by decorating the ISA with just a few CISC-tricks one can significantly reduce the number of instructions while reducing the code foot print and the number of cycles it takes to execute the program.
<
My 66000 ISA is a case in point. Over 580 subroutines compiled with Brian's compiler and compared to the same front ends (LLVM) My 66000 only takes 70% (average, 68% Geomean, 67% Harmonic-Mean) of the number of instructions that RISC-V takes. The gain comes from 5 My 66000 components {and somewhat surprisingly; each component seems to gain about the same as the others. These components are::
1) Constants
2) Prologue and Epilogue instructions (along with a co-developed ABI)
3) Predication (that does not eat bits in the typical instructions)
4) Vectorization (that adds only 6-bits of state to the implementation)
5) Scaled Index with Displacement Addressing
With a few tweaks on the side
6) Sign control over Operands
7) Transcendental instructions
<
So, basically, there are no (zero, zilch, nada, 0) instructions used to create constants (Integer, pointer, absolute, floating point,...);
There are almost no NEG, INV instructions--these tend to occur mostly when an operand is negated when used as a argument to or result from a function and there is no transmissibility of the negation to the other side of the interface.)
There is 1 prologue instruction per call and one epilogue instruction per return.
<
A My 66000 System is resilient in the face of modern attack strategies; including
1) Return Oriented Programming
2) Buffer Overrun
3) RowHammer
4) Meltdown
5) Spectré
without incurring any execution overhead in doing so !
<
As a best case the r8_erf() function in polpack: My 66000 only requires 43.3% of the instructions RISC-V requires, only occupies 57.3% of memory (all forms) and only occupies 73% as much space on "the stack". (and when the compiler is tuned, it will approach 39% of the instruction count and 53% of the memory footprint.)
<
And it is a RISC-like instruction set.
>
<snip>

>
> So i think that you can not compare x86 or x64 to ARM, since it is
> not just a power efficiency comparison, like some are doing it by comparing
> the Apple M1 Pro ARM CPU to x86 or x64 CPUs, it is why i think that x86 or x64 architectures will be here for a long time, so i think that they will be good for a long time to make money in business, and they are a good business for USA that make the AMD or Intel CPUs.
<

The most power efficient means of placing a static bit pattern as an operand into execution is to have the constant as part of the instruction stream; wasting no instructions, no cycles, no registers; in getting it there. My 66000 does this, RISC-V, ARM, X86,... do not (A 64-bit VAX would have, too).

>
> I think ARM hardware memory model is not good, since it is a
> weak memory model, so ARM has to provide us with a TSO memory
> model that is compatible with x86 TSO memory model,
<

My 66000 memory model depends on how the PTE translates the virtual address.
Cacheable memory is causally ordered
unCacheable memory is sequentially consistent
Memory Mapped I/O is sequentially consistent
Configuration access is strongly ordered
ROM is unordered;
AND
ATOMIC accesses automagicaly converts the memory model from Causally ordered
to sequentially consistent; but only for the duration of the ATOMIC event. This gives
the advantages of faster memory the 99% of the time you are not doing ATOMICS,
and gives you the Sequentially Consistent memory order ATOMICs need without
having to have any FENCE instructions executed !
<
-----------------------------------------------------------------------------------------------------------------------------------
VAX-like instruction count
better than RISC-like performance
easy to pipeline
no microcode
30% of flow control instructions do not "take" any branch
vectorization suitable for vectorizing libm and math (that is byte/string manipulation)
-----------------------------------------------------------------------------------------------------------------------------------
>
You may be smart, but you remain a bit myopic.

pec...@gmail.com

unread,

Aug 13, 2023, 5:58:01 PM8/13/23

to

MitchAlsup wrote:
> My 66000 ISA is a case in point. Over 580 subroutines compiled with Brian's compiler and compared to the same front ends (LLVM) My 66000 only takes 70% (average, 68% Geomean, 67% Harmonic-Mean) of the number of instructions that RISC-V takes. The gain comes from 5 My 66000 components {and somewhat surprisingly; each component seems to gain about the same as the others. These components are::
> 1) Constants
> 2) Prologue and Epilogue instructions (along with a co-developed ABI)
> 3) Predication (that does not eat bits in the typical instructions)
> 4) Vectorization (that adds only 6-bits of state to the implementation)
> 5) Scaled Index with Displacement Addressing
> With a few tweaks on the side
> 6) Sign control over Operands
> 7) Transcendental instructions
> <
> So, basically, there are no (zero, zilch, nada, 0) instructions used to create constants (Integer, pointer, absolute, floating point,...);
> There are almost no NEG, INV instructions--these tend to occur mostly when an operand is negated when used as a argument to or result from a function and there is no transmissibility of the negation to the other side of the interface.)

I think that you should reconsider instruction compression, because it is quite orthogonal to all these advantages.
More advanced instructions remove entropy from correlation between operations.
Better instruction encoding works on individual level.
16 bit parcels, 32 bit header in the instructions should become 16/32(/48?), and smaller parcels could help for small constants.
All you need is to squeeze half of bit in the encoding (for example i think that one operand with value negation should be sufficient, almost all operations are symmetrical).

It won't be as efficient as in typical RISC case, but do you really think that a marginally simpler decoders are more
valuable than 20-25% more instructions in the cache?
I know that you love your encoding, but it seems to me that it is a classical case of premature optimization.
The only advantage I can see for extremely cheap and fast decoding is that maybe there is no need for uOP cache ?

MitchAlsup

unread,

Aug 13, 2023, 7:49:17 PM8/13/23

to

On Sunday, August 13, 2023 at 4:58:01 PM UTC-5, pec...@gmail.com wrote:
> MitchAlsup wrote:
> > My 66000 ISA is a case in point. Over 580 subroutines compiled with Brian's compiler and compared to the same front ends (LLVM) My 66000 only takes 70% (average, 68% Geomean, 67% Harmonic-Mean) of the number of instructions that RISC-V takes. The gain comes from 5 My 66000 components {and somewhat surprisingly; each component seems to gain about the same as the others. These components are::
> > 1) Constants
> > 2) Prologue and Epilogue instructions (along with a co-developed ABI)
> > 3) Predication (that does not eat bits in the typical instructions)
> > 4) Vectorization (that adds only 6-bits of state to the implementation)
> > 5) Scaled Index with Displacement Addressing
> > With a few tweaks on the side
> > 6) Sign control over Operands
> > 7) Transcendental instructions
> > <
> > So, basically, there are no (zero, zilch, nada, 0) instructions used to create constants (Integer, pointer, absolute, floating point,...);
> > There are almost no NEG, INV instructions--these tend to occur mostly when an operand is negated when used as a argument to or result from a function and there is no transmissibility of the negation to the other side of the interface.)
<
> I think that you should reconsider instruction compression, because it is quite orthogonal to all these advantages.
<

With the current encoding there is no where to put more than a few compressed instructions--one would have to get a significant advantage out of 12 holes in the ISA map and at the same time sacrifice expandability of the ISA in the future. This is something I am not willing to do at this time.

<
> More advanced instructions remove entropy from correlation between operations.
<

Easy access to all sizes of constants gets rid of many (MANY) LDs in FP codes used to access constants.
I found 1 string of 42 FP instructions in a row targeting the FMAC unit (no LDs, no STs, no branches, no integers,...) and several of these were transcendental instructions (instead of function calls).

<
> Better instruction encoding works on individual level.
> 16 bit parcels, 32 bit header in the instructions should become 16/32(/48?), and smaller parcels could help for small constants.
<

Many (MANY) times I have wanted a 48-bit constant, and I have not yet figured out how to cram it in.
Also note: My variable length decoder is 4 gates of delay, enabling decode widths of up to 16 IPC in a single pipe stage.

<
> All you need is to squeeze half of bit in the encoding (for example i think that one operand with value negation should be sufficient, almost all operations are symmetrical).
<

The words you are looking for is associative, commutative, and distributive.
And while my encoding my lightly use the encoding space I allocated to all my ISA magic, the do get used.
1<<k
~0<<k
~(~0<<k)
~0>>k
~A & ~B
~A | ~B
...

>
> It won't be as efficient as in typical RISC case, but do you really think that a marginally simpler decoders are more
> valuable than 20-25% more instructions in the cache?
<

The data I do have indicates not-so-much otherwise as it is closer to a complete wash::
<
Comparing 16KB 4-way set, 24KB 3-way set, and 32 KB 4-way set caches measured on eXcel simulator
{So, its no better than hand waving accuracy}
<
16K is 4.4% slower than 24K is 1.1% slower than 32K when backed up by an 8-cycle 256KB 4-way set L2 and 100 cycles to DRAM. "Slower" incorporates all the pipeline delays such as cache misses, branch target latency, multi-cycle back-to-back instructions,...
<
32K is 25% bigger than 24K but only 1.1% faster, and likely burns more than 1.1% more power.
<
Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.

<
> I know that you love your encoding, but it seems to me that it is a classical case of premature optimization.
<

There are RISC-V design groups out there talking about discarding RISC-V compression because it takes too much of the Major OpCode map*............So, even they who have it are not as enamored as academic literature tends to portray.
(*) which they want to use for other novel purposes.
<
On the other hand, If I took RISC-V OpCode map and deleted all compression, I could fit my entire ISA in the missing holes.........

<
> The only advantage I can see for extremely cheap and fast decoding is that maybe there is no need for uOP cache ?
<

The advantages grow with decode width.
<
The shortness of the fetch-decode pipeline is paramount to low branch overhead (both taken and mispredicted).
<
I am working on a 6-wide OoO design that uses a 4 stage fetch-decode pipeline.
In comparison BOOM needs 8 stages for the equivalent fetch-decode pipeline.
{{In a different comparison:: Mc 88120 had a 2-stage fetch-decode pipeline and was similarly 6-wide}}
In addition, My branch mispredict repair is 0-cycles.
{The cycle after the mispredict is detected, you can be decoding instructions on the architectural path with the renamer and ROB up to date.}
This means that BOOM needs a branch predictor with better than ½ the misprediction rate as I will need.
<
My pipeline gives SRAM access 1 complete cycle, BOOM accesses SRAM, compares tag and TLB then multiplexes out the SRAM data in a single cycle. BOOM is getting 1 GHz± designs, I am looking at 3 GHz±.

BGB

unread,

Aug 14, 2023, 1:21:24 AM8/14/23

to

Ironically, in my case, I went with a 16/32/48 encoding at first, which
has now mutated into 16/32/64/96 and 32/64/96 (when I switched over to
the current encoding scheme, the original set of 48-bit instructions
became non-encodable; but gained the ability to encode larger immediate
fields).

In a newer ISA variant, another tradeoff was made which loses the 16-bit
ops but allows for a more orthogonal expansion of the ISA to 64 GPRs.
The jury is still out as to which option is "better" (it is a similar
issue with 32 vs 64 GPRs, as there doesn't seem to be a clear winner in
terms of superiority on this point either).

Though, at present, the gain or loss of 16-bit ops in performance
oriented code is small, since if one optimizes towards executing as few
instructions as possible, the relative merit of 16-bit ops is diminished.

The relative advantage of, say, being able to express "MOV Reg,Reg" in
16-bits is lessened when the alternative is to have not needed to use
the instruction in the first place (as one goes from thinking "MOV is
common, better if it is small", to realizing that the majority of cases
where this instruction ends up being used, it is not strictly necessary).

This is in turn a merit for trying to minimize the number of "commonly
repeating multi-op sequences" (which typically point towards potential
inefficiencies); with the tradeoff that as one does so, their ISA
listing is no longer quite so concise or simple.

Well, and potentially intimidating as my core spec basically covers the
entire ISA, rather than folding off each set of optional features into
its own specifications.

This pain got a little annoying in a recent sub-project of trying to
write a new C compiler from the ground-up, and then needing to fill in
the listing tables and define macros for each mnemonic, and with several
hundred mnemonics (and ~ 700 entries in the listing table), this got a
little annoying. Possibly unsurprisingly, writing an "all new" C
compiler from the ground up (vs my existing "older" C compiler) being
kind of a pain...

Well, then I got distracted from my "write a new C compiler" effort, by
going and writing a new MMIO module to perform edge-walking for
rasterizing primitives (and then debating someone for why I went for
edge-walking rather than edge-equations, ...).

Well, with the hope that doing edge-walking via an MMIO module will be
at least "slightly" faster than the existing software rasterization
loops (though, at present, it would be more of a hybrid strategy).

...

pec...@gmail.com

unread,

Aug 14, 2023, 6:45:10 AM8/14/23

to

MitchAlsup wrote:
> On Sunday, August 13, 2023 at 4:58:01 PM UTC-5, pec...@gmail.com wrote:
> > It won't be as efficient as in typical RISC case, but do you really think that a marginally simpler decoders are more
> > valuable than 20-25% more instructions in the cache?
> <

> The data I do have indicates not-so-much otherwise as it is closer to a complete wash:: (...)

> 24K is 1.1% slower than 32K when backed up by an 8-cycle 256KB 4-way set L2 and 100 cycles to DRAM.
> "Slower" incorporates all the pipeline delays such as cache misses, branch target latency, multi-cycle back-to-back instructions,...
> 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more than 1.1% more power.
> Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.

It is the point.
So it seems that code compression is an important feature only for the low-end spectrum of processors and totally irrelevant for
big cores.

>Many (MANY) times I have wanted a 48-bit constant, and I have not yet figured out how to cram it in.

So what about current encoding scheme on 16 bit parcels? It halves a range of control flow, but will also allow for 16 bit constants.
There is a chance for small benefit.

> There are RISC-V design groups out there talking about discarding RISC-V compression because it takes too much of the Major OpCode map*............So, even they who have it are not as enamored as academic literature tends to portray.
> (*) which they want to use for other novel purposes.

I started to think that RVC should be removed from specification, and its opcode space should be essentially free for any use.
Code compression could be optional and vendor specific, performed during installation or loading/linking.
Compilers are unaware of it anyway and it doesn't affect the size of zipped binaries used for distribution
Reserved part of 16-bit space alone could double available 32 bit opcode space.

> > The only advantage I can see for extremely cheap and fast decoding is that maybe there is no need for uOP cache ?
> <
> The advantages grow with decode width.
> <
> The shortness of the fetch-decode pipeline is paramount to low branch overhead (both taken and mispredicted).
> <
> I am working on a 6-wide OoO design that uses a 4 stage fetch-decode pipeline.
> In comparison BOOM needs 8 stages for the equivalent fetch-decode pipeline.
> {{In a different comparison:: Mc 88120 had a 2-stage fetch-decode pipeline and was similarly 6-wide}}
> In addition, My branch mispredict repair is 0-cycles.
> {The cycle after the mispredict is detected, you can be decoding instructions on the architectural path with the renamer and ROB up to date.}
> This means that BOOM needs a branch predictor with better than ½ the misprediction rate as I will need.

Impressive.

MitchAlsup

unread,

Aug 14, 2023, 12:32:56 PM8/14/23

to

On Monday, August 14, 2023 at 5:45:10 AM UTC-5, pec...@gmail.com wrote:
> MitchAlsup wrote:
> > On Sunday, August 13, 2023 at 4:58:01 PM UTC-5, pec...@gmail.com wrote:
> > > It won't be as efficient as in typical RISC case, but do you really think that a marginally simpler decoders are more
> > > valuable than 20-25% more instructions in the cache?
> > <
> > The data I do have indicates not-so-much otherwise as it is closer to a complete wash:: (...)
> > 24K is 1.1% slower than 32K when backed up by an 8-cycle 256KB 4-way set L2 and 100 cycles to DRAM.
> > "Slower" incorporates all the pipeline delays such as cache misses, branch target latency, multi-cycle back-to-back instructions,...
> > 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more than 1.1% more power.
> > Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
> It is the point.
> So it seems that code compression is an important feature only for the low-end spectrum of processors and totally irrelevant for
> big cores.
> >Many (MANY) times I have wanted a 48-bit constant, and I have not yet figured out how to cram it in.
> So what about current encoding scheme on 16 bit parcels? It halves a range of control flow, but will also allow for 16 bit constants.
<

I already have 16-bit displacements and immediates in the Major OpCode group--
these are all 1 word instructions:: Rd = Rs1 OP Imm16 or LD/ST Rd,[Rb+Disp16].
It is when these are insufficient that I have direct access to 32-bit and 64-bit forms.

<
> There is a chance for small benefit.
> > There are RISC-V design groups out there talking about discarding RISC-V compression because it takes too much of the Major OpCode map*............So, even they who have it are not as enamored as academic literature tends to portray.
> > (*) which they want to use for other novel purposes.
<
> I started to think that RVC should be removed from specification, and its opcode space should be essentially free for any use.
> Code compression could be optional and vendor specific, performed during installation or loading/linking.
> Compilers are unaware of it anyway and it doesn't affect the size of zipped binaries used for distribution
> Reserved part of 16-bit space alone could double available 32 bit opcode space.
<

RISC-V allocates 3/4 of the OpCode encoding to 16-bit stuff and gains all the complexity of variable length instructions but gains little of the benefits.

BGB

unread,

Aug 14, 2023, 2:28:58 PM8/14/23

to

On 8/14/2023 5:45 AM, pec...@gmail.com wrote:
> MitchAlsup wrote:
>> On Sunday, August 13, 2023 at 4:58:01 PM UTC-5, pec...@gmail.com wrote:
>>> It won't be as efficient as in typical RISC case, but do you really think that a marginally simpler decoders are more
>>> valuable than 20-25% more instructions in the cache?
>> <
>> The data I do have indicates not-so-much otherwise as it is closer to a complete wash:: (...)
>> 24K is 1.1% slower than 32K when backed up by an 8-cycle 256KB 4-way set L2 and 100 cycles to DRAM.
>> "Slower" incorporates all the pipeline delays such as cache misses, branch target latency, multi-cycle back-to-back instructions,...
>> 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more than 1.1% more power.
>> Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
>
> It is the point.
> So it seems that code compression is an important feature only for the low-end spectrum of processors and totally irrelevant for
> big cores.
>

It matters if you care about a 10% to 30% delta in code-density, but is
mostly irrelevant for performance if one has a "sufficient" I-cache
(say, 16K or more).

Though, in my case, I am mostly using 1-way / direct-mapped caches (as
in my testing, the gains haven't usually been enough to offset the
costs; though yes, in theory a set-associative cache is better than a
direct-mapped cache).

For a few FPGA configurations I have:
XC7A200T: 512K L2 (1MB is possible, but "pushing it")
16K or 32K L1s, mostly depending on timing
XC7A100T: 256K L2
16K or 32K L1s
XC7S50: 64K L2, 8K L1s

In my case, due to my ISA design, the performance-oriented cases end up
mostly (or entirely) using 32-bit instructions.

One code size optimization feature that has ended up being used in
performance-oriented modes is prolog/epilog compression (via reuse),
mostly because this does end up saving enough to make it worthwhile, and
(with a core with a branch predictor) the relative cost is relatively low.

>> Many (MANY) times I have wanted a 48-bit constant, and I have not yet figured out how to cram it in.
>
> So what about current encoding scheme on 16 bit parcels? It halves a range of control flow, but will also allow for 16 bit constants.
> There is a chance for small benefit.
>

In my case, constant sizes are:
Baseline:
16-bit ops:
4 (various ops)
8 (ADD/SUB)
12 (MOV to fixed-register)
32-bit ops:
5/6 (3R/2R)
9/10 (3R/2R)
16 (limited 2R: MOV, ADD, LDSH/SHORI, etc)
24 (MOV to fixed-register)
64-bit ops (jumbo prefix + 32b)
5 -> 29 bits
9/10 -> 33 bits
48 (MOV to fixed-register, Branch Abs48)
96-bit ops (2x jumbo prefix + 32b)
5 -> 53 bits (optional)
9/10 -> 57 bits (optional)
16 -> 64 bits (limited 2R: MOV, ADD, ...)

XG2 Mode (newer, 32/64/96 only, 1)
32-bit ops:
6/7 (3R/2R, 2)
10/11 (3R/2R, 2)
16 (limited 2R, 3)
24 (MOV to fixed-register, 3)
64/96: Sizes mostly unchanged (3)

1: The 16-bit space was reclaimed for more bits.
For most ops, these extend register fields to 6 bits.
2: The added bit mostly serves as a sign-bit
Most Baseline ops are zero-extended only.
3: The additional bits are reserved for now
Likely to be used for more encoding space.

Jumbo prefixes can be used to expand the encoding space, though "for
reasons", the remaining 32-bit encoding space is "valuable real-estate".

>> There are RISC-V design groups out there talking about discarding RISC-V compression because it takes too much of the Major OpCode map*............So, even they who have it are not as enamored as academic literature tends to portray.
>> (*) which they want to use for other novel purposes.
>
> I started to think that RVC should be removed from specification, and its opcode space should be essentially free for any use.
> Code compression could be optional and vendor specific, performed during installation or loading/linking.
> Compilers are unaware of it anyway and it doesn't affect the size of zipped binaries used for distribution
> Reserved part of 16-bit space alone could double available 32 bit opcode space.
>

I would almost be inclined to agree, but more because the existing RVC
encoding scheme is *awful* (like, someone looked at Thumb and was then
like, "Hey man, hold my beer!").

Though, an irony was that, in my case, the 16-bit ops took a bigger part
of the opcode space (7/8). But, this so happened to be the right number
of bits so that reclaiming it for XG2 Mode allowed extending all the
register fields by 1 bit.

Both encoding schemes coexist for now, as they can share the same decoders.
-++
My core also has an experimental RISC-V decoder, but this hasn't been
terribly useful thus far.

In most areas, my ISA is a superset of RISC-V. The one (potential) area
RISC-V has an advantage in, is in terms of having larger immediate and
displacement fields, but this advantage is mostly rendered moot by some
of its other deficiencies.

Well, there is also the seeming property that (if compiled with GCC),
Dhrystone gives a better looking DMIPS/MHz score with RISC-V. But,
seemingly pretty much everything else I had tested thus far seems to be
worse.

Similarly, C speed is closer to ASM speed, but the ASM can't really be
made any faster than the C, and programs like Doom and similar seem to
"kinda rather suck" with RISC-V (at least excluding all the fancy
proposed "opcode fusion" trickery and similar).

Like, if the Dhrystone results were to be believed, Doom should pull off
25-30 fps on a 50MHz in-order core with RISC-V... But, ... it doesn't...
(like seemingly was getting more like 10 to 14 or similar).

Though, seemingly not exactly like SWeRV is a powerhouse of Doom
performance either.

Contrast in my ISA where rewriting blobs of code in ASM can often yields
a fairly decent speedup (well, partly counterbalanced by "slightly
naive" code generation by my C compiler).

Well, and the sort of irony that software-rasterized GLQuake is
currently faster than Software Quake on my ISA...

Well, and my sort of experimental "move inner rasterization logic into
an MMIO module" (still needs a lot more debugging) seems mostly able to
push GLQuake into high-single-digit to low double-digit territory (or,
around 8-12 fps). So, this at least seems promising...

Granted, this MMIO module is still using affine texturing, and currently
only does nearest-neighbor sampling (bilinear is still possible and may
be added; perspective-correct texturing would be more complicated). This
module would only handle fixed-function (so, if fragment shaders were
supported, they would still need to fall back to software rasterization).

This module will shift most of the "hot path" for TKRA-GL towards the
front-end logic though (projection and geometry subdivision), which
creates more incentive for perspective-correct texturing in the back-end
(since this would require less work in the geometry handling stage).

...

>>> The only advantage I can see for extremely cheap and fast decoding is that maybe there is no need for uOP cache ?
>> <
>> The advantages grow with decode width.
>> <
>> The shortness of the fetch-decode pipeline is paramount to low branch overhead (both taken and mispredicted).
>> <
>> I am working on a 6-wide OoO design that uses a 4 stage fetch-decode pipeline.
>> In comparison BOOM needs 8 stages for the equivalent fetch-decode pipeline.
>> {{In a different comparison:: Mc 88120 had a 2-stage fetch-decode pipeline and was similarly 6-wide}}
>> In addition, My branch mispredict repair is 0-cycles.
>> {The cycle after the mispredict is detected, you can be decoding instructions on the architectural path with the renamer and ROB up to date.}
>> This means that BOOM needs a branch predictor with better than ½ the misprediction rate as I will need.
>
> Impressive.

In my case, I am using an 8-stage pipeline, but this is more related to
EX logic than decoding or similar.

So, general properties:
Pipeline: PF IF ID1 ID2/RF EX1 EX2 EX3 WB
ID2 stage was named as such,
But calling it RF/"Register Fetch" would be more accurate.
3-wide, Strict In Order / VLIW, Stalling pipeline, ...
Native 64 GPRs at present.
Has a branch predictor
...

In my case, the branch predictor can turn most normal/direct branches
into a fixed 2-cycle latency (note that this ISA does not have
branch-delay slots, so this clock cycle is effectively lost; I do not
feel that the potential advantage of branch-delay-slots is worth the
mess they make in other areas).

Generally running at 50 MHz, but with some tweaking 75MHz is possible
(mostly at the cost of L1 cache size and similar, which hurts more than
the gains from a faster clock-speed, *1).

Mostly, this is (I suspect) because the stall signals from the L1 caches
effectively drive the entire rest of the CPU pipeline.

Though, in general, soft processors on this class of FPGAs
(Spartan/Artix) tend to be similarly slow (so, my project still isn't
bottom end on this metric).

OTOH, MicroBlaze runs at higher clock-speeds (eg, 100 MHz), but is also
effectively a 32-bit microcontroller...

*1: I did some early testing, and in terms of performance, 50MHz 3-wide
with bigger L1 caches looked much more promising for performance than
100MHz 1-wide with small L1 caches (with the L1 cache sizes being the
main pivot point both for MHz and for overall performance).

I can note that, say, if one has a 2K or 4K L1 I$, then most of the CPU
clock cycles will end being spent waiting for L1 misses, with L1 miss
rate (and miss handling) entirely dominating everything else (and the
theoretical rate that instructions could execute becomes moot).

With 16K or 32K, then L1 miss rate is no longer the primary dominating
force.

However, I had also evaluated:
25MHz with 64K or 128K L1's would *not* be an advantage over 50MHz with
32K, since while this "mostly" eliminates L1 misses, one sees a pretty
big hit from only being able to run instructions at half the speed.

So, 16K or 32K appears to be a local optimum here.

...

Brett

unread,

Aug 14, 2023, 7:17:09 PM8/14/23

to

This is the killer argument that would have saved me from caring about 16
bit opcodes.
Only toy CPU’s can care about 16 bit opcodes.

JimBrakefield

unread,

Aug 14, 2023, 8:01:44 PM8/14/23

to

On Sunday, August 13, 2023 at 4:58:01 PM UTC-5, pec...@gmail.com wrote:

ISA orthogonality?
|>"instruction compression, because it is orthogonal to all these advantages"

FWIIW, my RISC ISA (ROC24-regsiz) encodes instruction length in two bits: 24, 32, 40 or 48-bits
Constant lengths are encoded in a 5-bit source register field, 12-bit constants take an additional byte.
And adding a data type code field to register file registers, the basic op-codes (add, sub, mult, div, comp, and, or, xor)
have the same encoding throughout the ISA.
This gives 24-bit compression of the most used instructions with the longer instruction variants having more options and register fields.
Down sides: more difficult multiple issue, byte address displacements and no 16-bit constant fields on 32-bit instructions.
Plusses: an approximate 25% reduction in code size helps in the low performance arena, and "compression" without
giving up on a full set of three five-bit register fields.

So, what is the value of instruction orthogonality?
Here there are three forms of orthogonality: instruction size, constant sizing, op-code reuse via register data type codes.
(and register fields at fixed locations within the instructions)

BGB

unread,

Aug 14, 2023, 11:53:02 PM8/14/23

to

Yeah, byte aligned encodings kinda don't really seem worth it IMO.

> So, what is the value of instruction orthogonality?
> Here there are three forms of orthogonality: instruction size, constant sizing, op-code reuse via register data type codes.
> (and register fields at fixed locations within the instructions)

One wants enough orthogonality that there aren't major pain points, say:
Most core ISA features can be used in any combination.

If say, one can encode:
ADD 0x1234, R23
And:
ADD 0x0123, R45
But not:
ADD 0x1234, R45

Or:
ADD?T R21, R11, R30
And:
ADD R21, R43, R60
But not:
ADD?T R21, R43, R60

And so on...

This sort of thing kinda sucks...

But, then in my case it ends up being a tradeoff whether to have
orthogonality in these sorts of areas, or the ability to encode 16-bit ops.

...

MitchAlsup

unread,

Aug 15, 2023, 12:23:17 PM8/15/23

to

Why not ?
<
ADD R7,R9,0x1234
ADD R7,0x1234,R9
ADD R7,R9,0x12345678
ADD R7,0x12345678,R9
ADD R7,R9,0x123456789abcdef
ADD R7,0x123456789abcdef,R9

>
> Or:
> ADD?T R21, R11, R30
> And:
> ADD R21, R43, R60
> But not:
> ADD?T R21, R43, R60
<

You only have that problem when you improperly encode predication.
<
Pcnd Rcnd,TTTEEE
Then-Inst
Then-Inst
Then-Inst
Else-Inst
Else-Inst
Else-Inst
unpredicated-Inst

>
> And so on...
>
> This sort of thing kinda sucks...
>

As Ivan would say:: "Fix it"

BGB

unread,

Aug 15, 2023, 1:30:21 PM8/15/23

to

I was staying within the limits of 32-bit ops here.

Granted, one can encode:
ADD 0x1234, R45
And:
ADD 0x12345678, R45
In the Baseline mode.

But, at the cost of needing to use a jumbo prefix...

>>
>> Or:
>> ADD?T R21, R11, R30
>> And:
>> ADD R21, R43, R60
>> But not:
>> ADD?T R21, R43, R60
> <
> You only have that problem when you improperly encode predication.
> <
> Pcnd Rcnd,TTTEEE
> Then-Inst
> Then-Inst
> Then-Inst
> Else-Inst
> Else-Inst
> Else-Inst
> unpredicated-Inst

The scheme I am using allows me to execute the Then and Else branches at
the same time:
ADD?T R4, 1, R9 | ADD?F R4, -1, R9

This sort of thing is sometimes useful...

>>
>> And so on...
>>
>> This sort of thing kinda sucks...
>>
> As Ivan would say:: "Fix it"

I have a fix for the above encoding issues...

I had called it XG2 Mode...

But, it has a tradeoff:
XG2 can't encode 16-bit ops;
The 40x2 bundles are also non encodable in this mode.
No huge loss, they were an ugly hack...
Main use case for 40x2 being to work around the issues fixed by XG2.

>>
>> But, then in my case it ends up being a tradeoff whether to have
>> orthogonality in these sorts of areas, or the ability to encode 16-bit ops.
>>

But, what I lacked, was a way to fix the register orthogonality issue
while also still being able to encode 16-bit ops.

I spent a while trying to think up new encoding schemes that could give
"everything I wanted", but couldn't come up with anything that wouldn't
also severely compromise opcode space while also still being able to fit
everything into a 32 bit instruction word.

So, XG2 ended up as an "well, at least I *can* pull this off" option
(even if it did still fall short of my "full wish list", *1).

As-is, Baseline mode seems better if one wants to optimize for code
density. XG2 Mode seems like a better option if performance is the
priority (and one wants to use all 64 GPRs).

Note that a subset of XG2 using only the first 32 GPRs is equivalent (in
terms of encoding) to a subset of the Baseline ISA not using any 16-bit
encodings.

*1: I had also wanted the ability to have multiple predicate registers,
and the ability to use predication+bundle encoding with the entire
opcode space, but this wasn't going to happen...

For example:
ADD?T 0x1234, R9 | ADD?F -999, R9
Remains non encodable, since the Imm16 instructions are not part of the
encoding space covered by PrWEX.

But, Imm10 encodings are still available, so:
ADD?T 0x123, R9 | ADD?F -999, R9
Can still be encoded.

Jumbo prefixes can't quite fix this, as Jumbo prefixes can't be used in
bundles in this way (the use of a jumbo prefix and a bundled instruction
being effectively mutually exclusive at present).

But, alas...

MitchAlsup

unread,

Aug 15, 2023, 4:14:43 PM8/15/23

to

I have only 32-bit Instruction-specifiers here, the rest are appended constants.

>
> Granted, one can encode:
> ADD 0x1234, R45
> And:
> ADD 0x12345678, R45
> In the Baseline mode.
>
> But, at the cost of needing to use a jumbo prefix...
> >>
> >> Or:
> >> ADD?T R21, R11, R30
> >> And:
> >> ADD R21, R43, R60
> >> But not:
> >> ADD?T R21, R43, R60
> > <
> > You only have that problem when you improperly encode predication.
> > <
> > Pcnd Rcnd,TTTEEE
> > Then-Inst
> > Then-Inst
> > Then-Inst
> > Else-Inst
> > Else-Inst
> > Else-Inst
> > unpredicated-Inst
<
> The scheme I am using allows me to execute the Then and Else branches at
> the same time:
> ADD?T R4, 1, R9 | ADD?F R4, -1, R9
<

What makes you think I cannot ??

>
> This sort of thing is sometimes useful...
> >>
> >> And so on...
> >>
> >> This sort of thing kinda sucks...
> >>
> > As Ivan would say:: "Fix it"
> I have a fix for the above encoding issues...
>
> I had called it XG2 Mode...
>
> But, it has a tradeoff:
> XG2 can't encode 16-bit ops;
<

I fail to see a problem, here.

<
> The 40x2 bundles are also non encodable in this mode.
<

Bundles are so unnecessary.....

<
> No huge loss, they were an ugly hack...
> Main use case for 40x2 being to work around the issues fixed by XG2.
>

> But, alas...
<
and a lack.....

BGB-Alt

unread,

Aug 15, 2023, 5:27:26 PM8/15/23

to

In my case, the 5/9/10/16 bit immediate encodings encode the immediate
within the same 32-bit instruction word. Larger encodings (such as
Imm33/Disp33) require a pair of 32-bit instruction words, ...

However, using the 64-bit encoding are less desirable, since these can't
be organized into bundles.

Could potentially try to address this by changing some of the ISA rules
(to allow jumbo encodings within bundles), but this would make fetch and
decode more expensive (or, if I allowed a 2-wide case with 2 jumbo
prefixes, this would require supporting a 128-bit instruction fetch, ...).

Though, a possible merit would be, say, if I could allow a
"FEii-iiii-FEii-iiii-FFw0-0iii-ZZnm-ZeiZ"
Special case to allow gluing a 64-bit immediate onto pretty much any
other instruction...

But, this falls more into a "possible but debatable if worth the cost"
category...

Assuming the decoder moves along at one instruction (or bundle) per
clock cycle, then encoding them end-to-end would take twice as long.

Granted, could maybe be more clever if an interleaved
then/else/then/else scheme were used (say, with Fetch and Decode
treating this as special).

>>
>> This sort of thing is sometimes useful...
>>>>
>>>> And so on...
>>>>
>>>> This sort of thing kinda sucks...
>>>>
>>> As Ivan would say:: "Fix it"
>> I have a fix for the above encoding issues...
>>
>> I had called it XG2 Mode...
>>
>> But, it has a tradeoff:
>> XG2 can't encode 16-bit ops;
> <
> I fail to see a problem, here.
> <
>> The 40x2 bundles are also non encodable in this mode.
> <
> Bundles are so unnecessary.....
> <

The logic for superscalar still isn't free...

Granted, a fancier core (or an OoO implementation) might design to
mostly ignore the existing bundling and then do its own thing. But, for
the class of hardware I am dealing with here, it makes sense.

>> No huge loss, they were an ugly hack...
>> Main use case for 40x2 being to work around the issues fixed by XG2.
>>
>> But, alas...
> <
> and a lack.....

The 40x2 encoding did allow a few special cases that can't be encoded
otherwise in a bundle, such as '?ST'/'?SF' predication (which would used
SR.S as the predicate bit rather than SR.T), along with a limited form
of "[Rm+Ri*Sc+Disp]" addressing, ...

Along with the encoding deficiencies in the baseline ISA, like the
inability to use predicated instructions with R32..R63, etc.

Otherwise, these instructions would require a 64-bit scalar encoding in XG2.

Could in theory re-add something similar to the 40x2 encoding, but would
need to find some other "otherwise illegal" encoding to use instead
(since the encoding change effectively knocked the 40x2 bundle prefix
out of existence).

MitchAlsup

unread,

Aug 15, 2023, 5:53:45 PM8/15/23

to

I have 5-bit immediates::
<
FADD R8,#1,R9 // R8 = 1.0D0 + R9;
<
I have two 6-bit immediates::
<
SLL R8,R9,<17:28> // R8 = ~(~0<<17) & (R9>>28)
<
I have 16-bit immediates:
<
ADD R8,R9,0x1234
LD R8,[R9+0x1234]
<
And all of these fit in 1 word--as anyone from the 1st generation RISC camp would
see (except the SPARC guys...)

>
> However, using the 64-bit encoding are less desirable, since these can't
> be organized into bundles.
>

Easily fixed--get rid of the concept of bundles.

>
> Could potentially try to address this by changing some of the ISA rules
> (to allow jumbo encodings within bundles), but this would make fetch and
> decode more expensive (or, if I allowed a 2-wide case with 2 jumbo
> prefixes, this would require supporting a 128-bit instruction fetch, ...).
<

I started with the concept of 64-bit computer with an inherently misaligned
memory model. Loading a misaligned 64-bit item requires fetching 128-bits
from DCache. Then once you have 128-bit DCache, another instance and you
have a 128-bit instruction fetch. Presto, done.....
<
See how easy that is !!

>
> Though, a possible merit would be, say, if I could allow a
> "FEii-iiii-FEii-iiii-FFw0-0iii-ZZnm-ZeiZ"
> Special case to allow gluing a 64-bit immediate onto pretty much any
> other instruction...
<

instruction<15:13,11> contains the "Routing of operands to Function Units".
This includes sign control, immediates and their position,....

>
> But, this falls more into a "possible but debatable if worth the cost"
> category...
<

You are the one who can never let a thread die.........

That is not how the decoder moves along. At the PRED instruction, the
fetch-decode pipeline knows how long the then-clause is and how long
the else-clause is. Now remember I am fetching 128-bits per cycle in
my 1-wide machine and there is a limit of 8 instructions (max) in each
clause, so by the time the PRED condition resolves, all of the instructions
in the then-clause have been fetched, and the else-clause instructions
are being fetched. At this point all one has to do is cancel stuff, but you
still do not need to disrupt the fetch-decode pipeline. At most I decode
1 instruction in the then-clause that will not survive execution. And it
does not get to execute.

>
> Granted, could maybe be more clever if an interleaved
> then/else/then/else scheme were used (say, with Fetch and Decode
> treating this as special).
<

I started with this, and abandoned it due to bit count needed in the PRED
instruction.
<
<snip>

> >> The 40x2 bundles are also non encodable in this mode.
> > <
> > Bundles are so unnecessary.....
> > <
> The logic for superscalar still isn't free...
>
> Granted, a fancier core (or an OoO implementation) might design to
> mostly ignore the existing bundling and then do its own thing. But, for
> the class of hardware I am dealing with here, it makes sense.
<

You are making the same myopic mistakes I made in 1983 as the first
RISC generation was being brought to market. I am simply reliving all
my mistakes through your actions, but this time I am trying to lead you
forward.....

<
> >> No huge loss, they were an ugly hack...
> >> Main use case for 40x2 being to work around the issues fixed by XG2.
> >>
> >> But, alas...
> > <
> > and a lack.....
<
> The 40x2 encoding did allow a few special cases that can't be encoded
> otherwise in a bundle, such as '?ST'/'?SF' predication (which would used
> SR.S as the predicate bit rather than SR.T), along with a limited form
> of "[Rm+Ri*Sc+Disp]" addressing, ...
<

[Rbase+Rindex<<scale+Displacement] is the fundamental for of Address
Generation in my µArchitecture, and every memory reference has access
to this form--not just the ones for which it is easy to find examples of in code.
<
Oh, BTW, can you predicate memory reference instructions ??

>
> Along with the encoding deficiencies in the baseline ISA, like the
> inability to use predicated instructions with R32..R63, etc.
>

I don't even have those registers.......and I still find very tiny amounts of
spill/fill code.

>
> Otherwise, these instructions would require a 64-bit scalar encoding in XG2.
<

Perhaps this is where your ISA encoding went awry.

BGB

unread,

Aug 16, 2023, 2:29:37 AM8/16/23

to

Granted, I am not really familiar with your instruction formats, since
you tend not to describe them in any detail here...

My usual descriptions should at least be easier to figure out, since
they tend to be effectively "notation-modified hexadecimal".

>>
>> However, using the 64-bit encoding are less desirable, since these can't
>> be organized into bundles.
>>
> Easily fixed--get rid of the concept of bundles.

Usual downsides of superscalar notwithstanding. Falling back to
scalar-only operation being similarly undesirable.

Though, granted, in theory a superscalar core would not need to worry
about things which lack a dedicated bundle encoding.

>>
>> Could potentially try to address this by changing some of the ISA rules
>> (to allow jumbo encodings within bundles), but this would make fetch and
>> decode more expensive (or, if I allowed a 2-wide case with 2 jumbo
>> prefixes, this would require supporting a 128-bit instruction fetch, ...).
> <
> I started with the concept of 64-bit computer with an inherently misaligned
> memory model. Loading a misaligned 64-bit item requires fetching 128-bits
> from DCache. Then once you have 128-bit DCache, another instance and you
> have a 128-bit instruction fetch. Presto, done.....
> <
> See how easy that is !!

In my case, the "freely aligned" cases only ended up going up to 64 bits.

A freely aligned 128-bit fetch would effectively require the L1 caches
to internally work with 256 bits at a time (rather than using a 128-bit
block).

Though, instruction alignment that is like:
Well, 16/32/64 bit cases have a 16-bit alignment, but 96 bit requires
32-bit alignment, is a little wonky.

If I were to handle it the same way as my L1 D$, then a 128-bit
instruction fetch would need a 64-bit alignment. This is basically no-go.

So, would need to make this logic wider...

Granted, caches which work with even/odd pairs of 128-bit cache lines,
are at least wide enough to deal with unaligned 128-bit load/store
without needing to redesign the bus-side interface (it would mostly be
an issue of added cost).

>>
>> Though, a possible merit would be, say, if I could allow a
>> "FEii-iiii-FEii-iiii-FFw0-0iii-ZZnm-ZeiZ"
>> Special case to allow gluing a 64-bit immediate onto pretty much any
>> other instruction...
> <
> instruction<15:13,11> contains the "Routing of operands to Function Units".
> This includes sign control, immediates and their position,....
>>
>> But, this falls more into a "possible but debatable if worth the cost"
>> category...
> <
> You are the one who can never let a thread die.........
> <

Hmm...

Hmm...

OK.
My pipeline advances 1 bundle per clock cycle (excluding stalls or
branches).

So:
OP?T
OP?T
OP?T
OP?F
OP?F
OP?F

Would still require 6+ clock-cycles.

Whereas:
OP?T | OP?F
OP?T | OP?F
OP?T | OP?F

Would need 3+ clock cycles.

>>
>> Granted, could maybe be more clever if an interleaved
>> then/else/then/else scheme were used (say, with Fetch and Decode
>> treating this as special).
> <
> I started with this, and abandoned it due to bit count needed in the PRED
> instruction.
> <
> <snip>

In my case, predication is encoded for every instruction.

However, unlike 32-bit ARM, it was done with effectively a 2-bit scheme:
0,0: ?T (E0..E3, E8..EB)
0,1: ?F (E4..E7. EC..EF)
1,0: Scalar (F0..F3, F8/F9)
1,1: WEX / Bundle (F4..F7, FC/FD)

Though, the PrWEX encoding ended up in an "odd corner" of the encoding
space, hence why it only covers a subset of the ISA.

EA/EB,EE/EF

But, this was enough for the F0 and F2 blocks, which are the main
instruction blocks in use here.

The F1 block in N/E, but contains Load/Store ops, which are not allowed
in this context. The F8 block (with the Imm16 ops) is an annoying lack.

The F3 and F9 blocks are, for now, still unused (but the intention for
now was to try to leave them unused if possible).

>>>> The 40x2 bundles are also non encodable in this mode.
>>> <
>>> Bundles are so unnecessary.....
>>> <
>> The logic for superscalar still isn't free...
>>
>> Granted, a fancier core (or an OoO implementation) might design to
>> mostly ignore the existing bundling and then do its own thing. But, for
>> the class of hardware I am dealing with here, it makes sense.
> <
> You are making the same myopic mistakes I made in 1983 as the first
> RISC generation was being brought to market. I am simply reliving all
> my mistakes through your actions, but this time I am trying to lead you
> forward.....
> <

So, you are arguing it would be better to just "bite the bullet" early
and go over to superscalar?...

>>>> No huge loss, they were an ugly hack...
>>>> Main use case for 40x2 being to work around the issues fixed by XG2.
>>>>
>>>> But, alas...
>>> <
>>> and a lack.....
> <
>> The 40x2 encoding did allow a few special cases that can't be encoded
>> otherwise in a bundle, such as '?ST'/'?SF' predication (which would used
>> SR.S as the predicate bit rather than SR.T), along with a limited form
>> of "[Rm+Ri*Sc+Disp]" addressing, ...
> <
> [Rbase+Rindex<<scale+Displacement] is the fundamental for of Address
> Generation in my µArchitecture, and every memory reference has access
> to this form--not just the ones for which it is easy to find examples of in code.
> <

I had built it originally around an [Rm+Ri<<Sc] model, where Ri could be
either a register or displacement, and Sc was typically hard-wired based
on the size of the element type (except when the base register was PC or
GBR, where a 1-byte scale was used).

With:
[Rm+Ri<<Sc+Disp]
Being an optional extension.

Though, these ops involve using a 64-bit encoding.

> Oh, BTW, can you predicate memory reference instructions ??

Yes.

The general predication scheme covers (nearly) the entire ISA.

Main exception being:
LDIn Imm24, R0

Which are unconditional-scalar only.
Trying to predicate these encodes the PrWEX blocks.
Trying to bundle these encodes a Jumbo prefix.

However, this was OK, given the main use-case for this instruction.

PrWEX only covers F0 and F2 blocks though.

So:
MOV.Q?T (R4, 0x120), R8 //Fine
But:
MOV.Q?T (R4, 0x120), R8 | ADD 123, R9 //Non-Encodable

Though, the Disp5 encodings are in F0 block, so:
MOV.Q?T (R4, 0x80), R8 | ADD 123, R9

Can be encoded, but is not valid (since Load/Store is only allowed in
Lane 1; or as the final instruction in the bundle).

>>
>> Along with the encoding deficiencies in the baseline ISA, like the
>> inability to use predicated instructions with R32..R63, etc.
>>
> I don't even have those registers.......and I still find very tiny amounts of
> spill/fill code.

They are niche, granted.

The inner rasterizer loops in TKRA-GL are one of the few cases where
they seem justified.

Programs like Doom and similar see very little difference, but Doom is
mostly devoid of a "small mountain of local variables" coding style.

Granted, grabbing the variable declarations from a function:
void (*DrawSpan)(u64 *parm,
tkra_rastpixel *dstc, tkra_zbufpixel *dstz, int cnt);
void (*DrawSpanZ)(tkra_zbufpixel *dstz, int cnt, u64 zpos, u64 zstep);

u64 dsparm[TKRA_DS_NPARM];
tkra_rastpixel *scr_cb, *cb0;
tkra_zbufpixel *scr_zb, *zb0;
tkra_rastpixel *cb;
tkra_zbufpixel *zb;
register s64 xzpos_l, xzstep_l;
s64 xzpos_r, xzstep_r;
register u64 tpos_l, tstep_l;
u64 tpos_r, tstep_r;
register u64 cpos_l, cstep_l;
u64 cpos_r, cstep_r;
u64 tstep_d, tstep_c;
u64 cstep_d, cstep_c;
int zstep_d, zstep_c;
int clip_x0, clip_x1, clip_y0, clip_y1;
int y, ymax, x0, x1, x2, x3, z0, xcnt, xrcp;
int xshl, yshl;
int scr_xs, scr_ys, scr_zxs;

Or, from another nearby function:

u64 e1_parm[TKRA_ES_NPARM];
u64 e2_parm[TKRA_ES_NPARM];
u64 e3_parm[TKRA_ES_NPARM];
void (*RasterWalkEdges)(TKRA_Context *ctx,
int ytop, u64 *edge_l, u64 *edge_r, int cnt);
u64 *tve0, *tve1, *tve2;
int y0, y1, y2, y3, y3m, y3n;
int x0, x1, x2, x3, x3m, x3n;
int y1cnt, y2cnt, y3cnt, y1rcp, y2rcp, y3rcp;
u64 t0, t1, t2;
s64 l1, l2, l3;

int x1pos, x2pos, x3pos, x1step, x2step, x3step;
int x1step_d, x1step_c;
int x2step_d, x2step_c;
int x3step_d, x3step_c;

int z1pos, z2pos, z3pos, z1step, z2step, z3step;
u64 t1pos, t2pos, t3pos, t1step, t2step, t3step;
u64 t1step_d, t1step_c;
u64 t2step_d, t2step_c;
u64 t3step_d, t3step_c;
u64 c1pos, c2pos, c3pos, c1step, c2step, c3step;
u64 c1step_d, c1step_c, c2step_d, c2step_c, c3step_d, c3step_c;
int z1step_d, z1step_c, z2step_d, z2step_c, z3step_d, z3step_c;
int xshl, yshl;

...

This sort of stuff does seem to be cases which benefit from having 64 GPRs.

Though, I guess now, this part is made less certain with the possibility
of the rasterizer loops being moved into a Verilog module (which as-is
still needs around 20x 64-bit MMIO registers to configure the working
parameters).

Still needs more work, but seems promising (raises hope that it may
still be possible to get Quake out of "single digit framerate" territory).

Granted, all the projection and geometric subdivision still need to be
handled CPU side (with it then feeding the parameters for the edges to
be drawn into a bunch of MMIO registers, and then setting a "Have At It"
bit, which is understood as a "Busy" or "FIFO Full" flag when the
register is read).

Or, basically, if the Busy flag is set, the CPU needs to spin until this
flag clears before it can submit anything more to draw. But, otherwise,
the CPU can operate semi-asynchronously with the edge-walker unit
(processing more primitives to be drawn in parallel with the previous
primitives being drawn into the framebuffer).

But, a module which also performs projection would be a bit more
complicated than one which only deals with edge-walking and
span-drawing. Also not so sure how similar or different the idea is from
early 3D cards.

Ironically, the module is also cheaper than initially expected.

...

>>
>> Otherwise, these instructions would require a 64-bit scalar encoding in XG2.
> <
> Perhaps this is where your ISA encoding went awry.

Dunno...

Trying to deal with 64 GPRs did sort of make things a little messy.
The original form of the ISA, as noted, was designed around the
assumption of 32 GPRs.

So, I have:
Baseline (pure): Limited to 32 GPRs

Baseline+XGPR: 64 GPRs, but only for unconditional ops in the F0/F1/F2
blocks. Reclaimed the 7zzz and 9zzz blocks from 16-bit land to make this
possible. Previously, these had been used for a short-lived "Op24"
experiment (which was quickly revealed to be a dog turd...).

Previously, 7zzz had contained "nothing of value", and 9zzz had
originally been used for SH-4 style FPU instructions (where I had then
later redesigned the FPU, and this encoding block fell into disuse).

pec...@gmail.com

unread,

Aug 16, 2023, 1:04:50 PM8/16/23

to

BGB wrote:
> > I started to think that RVC should be removed from specification, and its opcode space should be essentially free for any use.
> > Code compression could be optional and vendor specific, performed during installation or loading/linking.
> > Compilers are unaware of it anyway and it doesn't affect the size of zipped binaries used for distribution
> > Reserved part of 16-bit space alone could double available 32 bit opcode space.
> >
> I would almost be inclined to agree, but more because the existing RVC
> encoding scheme is *awful* (like, someone looked at Thumb and was then
> like, "Hey man, hold my beer!").

That's why I wrote "vendor specific".
Generally compression scheme should be extension-agnostic (=orthogonal), and concentrated on low-end applications, because it is
the only performance boosting feature in the ISA for this segment.

Unfortunately they (risc nazi) managed to add compressed floating point instructions.
The real irony is that it is the least important area. Most of the cores have no fpu at all. Big cores perform most of the floating point operations in the SIMD units. There is not much room int the market for middle ground.
Moreover, floating point code is quite regular, concentrated in the small loop kernels - performance impact of compression will be negligible.

> So, 16K or 32K appears to be a local optimum here.

Advanced prediction definitely lowers the pressure on i-cache even further.

MitchAlsup

unread,

Aug 16, 2023, 2:15:42 PM8/16/23

to

On Wednesday, August 16, 2023 at 1:29:37 AM UTC-5, BGB wrote:
> On 8/15/2023 4:53 PM, MitchAlsup wrote:
> > On Tuesday, August 15, 2023 at 4:27:26 PM UTC-5, BGB-Alt wrote:
>
> > I have 5-bit immediates::
> > <
> > FADD R8,#1,R9 // R8 = 1.0D0 + R9;
> > <
> > I have two 6-bit immediates::
> > <
> > SLL R8,R9,<17:28> // R8 = ~(~0<<17) & (R9>>28)
> > <
> > I have 16-bit immediates:
> > <
> > ADD R8,R9,0x1234
> > LD R8,[R9+0x1234]
> > <
> > And all of these fit in 1 word--as anyone from the 1st generation RISC camp would
> > see (except the SPARC guys...)
<
> Granted, I am not really familiar with your instruction formats, since
> you tend not to describe them in any detail here...
>
>
> My usual descriptions should at least be easier to figure out, since
> they tend to be effectively "notation-modified hexadecimal".
<

Inst<31:26> Major OpCode
Inst<25:21> Rd or Condition
Inst<20:16> Rs1
Inst<15:11> Instruction Modifiers
Inst<10:05> Minor OpCodes
Inst<04:00> Rs2
<
When Inst<31> == 1
Inst<15:0> IMM16
<
However when Major == 3-Operand
Inst<12:10> Minor OpCode
Inst<9:5> Rs3

<
> >>
> >> However, using the 64-bit encoding are less desirable, since these can't
> >> be organized into bundles.
> >>
> > Easily fixed--get rid of the concept of bundles.
> Usual downsides of superscalar notwithstanding. Falling back to
> scalar-only operation being similarly undesirable.
<

Everyone and his brother have done superscalar without bundles.
Conversely all static VLIW forms have failed.
<
Now, what would you do if you got an FPGA with the resources to do
a 4-wide or 5-wide but not a 6-wide machine ??

>
> Though, granted, in theory a superscalar core would not need to worry
> about things which lack a dedicated bundle encoding.
> >>
> >> Could potentially try to address this by changing some of the ISA rules
> >> (to allow jumbo encodings within bundles), but this would make fetch and
> >> decode more expensive (or, if I allowed a 2-wide case with 2 jumbo
> >> prefixes, this would require supporting a 128-bit instruction fetch, ...).
> > <
> > I started with the concept of 64-bit computer with an inherently misaligned
> > memory model. Loading a misaligned 64-bit item requires fetching 128-bits
> > from DCache. Then once you have 128-bit DCache, another instance and you
> > have a 128-bit instruction fetch. Presto, done.....
> > <
> > See how easy that is !!
> In my case, the "freely aligned" cases only ended up going up to 64 bits.
>
>
> A freely aligned 128-bit fetch would effectively require the L1 caches
> to internally work with 256 bits at a time (rather than using a 128-bit
> block).
<

But Ifetch does not access the ICache misaligned--obviating that.

>
> Though, instruction alignment that is like:
> Well, 16/32/64 bit cases have a 16-bit alignment, but 96 bit requires
> 32-bit alignment, is a little wonky.
<

Variable issue with 16-bit quanta has 2× as many multiplexers as with
32-bit quanta.

>
> If I were to handle it the same way as my L1 D$, then a 128-bit
> instruction fetch would need a 64-bit alignment. This is basically no-go.
>

Sigh.........

SuperScalar pipelines do not have to have all instructions proceeding
at the same rate. OoO pipelines guarantee that they don't.

>
> So:
> OP?T
> OP?T
> OP?T
> OP?F
> OP?F
> OP?F
>
> Would still require 6+ clock-cycles.
<

Only mindless strictly-in-order pipelines mandate this.

>
> Whereas:
> OP?T | OP?F
> OP?T | OP?F
> OP?T | OP?F
>
> Would need 3+ clock cycles.
<

And twice the execution energy.

<
> >>
> >> Granted, could maybe be more clever if an interleaved
> >> then/else/then/else scheme were used (say, with Fetch and Decode
> >> treating this as special).
> > <
> > I started with this, and abandoned it due to bit count needed in the PRED
> > instruction.
> > <
> > <snip>
> In my case, predication is encoded for every instruction.
<

Eating those precious instruction bits on the 90% of instructions that are
not predicated. You are making those who do not use the feature pay for
the feature. Unsound tradeoff when bits are precious--like in an instruction.

If you EVER see your architecture being implemented as a 1-wide
or 4,5-wide or 7-8-wide :: yes absolutely. That is you are locking
in the concept of the bundle that may not be relevant in other
implementations.

Sure:: On the other hand, GPUs place the rasterizer and interpolator as
a single pipeline in a remote function unit so they can produce 8×{X,Y,Z,W}
per cycle continuously. For similar reasons, Texture is performed remotely
in a function unit designed for high performance texturing 8×{R,G,B,A}
per cycle continuously.

>
>
> Though, I guess now, this part is made less certain with the possibility
> of the rasterizer loops being moved into a Verilog module (which as-is
> still needs around 20x 64-bit MMIO registers to configure the working
> parameters).
<

More if you want multiple quanta per cycle.

>
> Still needs more work, but seems promising (raises hope that it may
> still be possible to get Quake out of "single digit framerate" territory).
>

Function-unit-ize raster and interpolation and FUI texture and Quake at
30 FPs should be quite reasonable.

>
> Granted, all the projection and geometric subdivision still need to be
> handled CPU side (with it then feeding the parameters for the edges to
> be drawn into a bunch of MMIO registers, and then setting a "Have At It"
> bit, which is understood as a "Busy" or "FIFO Full" flag when the
> register is read).
>
> Or, basically, if the Busy flag is set, the CPU needs to spin until this
> flag clears before it can submit anything more to draw. But, otherwise,
> the CPU can operate semi-asynchronously with the edge-walker unit
> (processing more primitives to be drawn in parallel with the previous
> primitives being drawn into the framebuffer).
>

In-memory HW queues, GPU gets read access, CPU gets write access
write stalls when full--no busy bits.

pec...@gmail.com

unread,

Aug 16, 2023, 2:18:25 PM8/16/23

to

Brett wrote:

> MitchAlsup <Mitch...@aol.com> wrote:
> > 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more
> > than 1.1% more power.
> > <
> > Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
> This is the killer argument that would have saved me from caring about 16
> bit opcodes.
> Only toy CPU’s can care about 16 bit opcodes.

Instruction compression still matters in embedded applications.
Until the mid-1990s, instruction compression was of great practical importance even on large machines. The first two processors that were "conscious RISCs", made by people who knew what they were doing, had a compact instruction format. Then came a bunch of imitators who, for purely religious reasons, insisted on a fixed instruction size.

MitchAlsup

unread,

Aug 16, 2023, 2:23:04 PM8/16/23

to

On Wednesday, August 16, 2023 at 12:04:50 PM UTC-5, pec...@gmail.com wrote:
> BGB wrote:
> > > I started to think that RVC should be removed from specification, and its opcode space should be essentially free for any use.
> > > Code compression could be optional and vendor specific, performed during installation or loading/linking.
> > > Compilers are unaware of it anyway and it doesn't affect the size of zipped binaries used for distribution
> > > Reserved part of 16-bit space alone could double available 32 bit opcode space.
> > >
> > I would almost be inclined to agree, but more because the existing RVC
> > encoding scheme is *awful* (like, someone looked at Thumb and was then
> > like, "Hey man, hold my beer!").
> That's why I wrote "vendor specific".
> Generally compression scheme should be extension-agnostic (=orthogonal), and concentrated on low-end applications, because it is
> the only performance boosting feature in the ISA for this segment.
>
> Unfortunately they (risc nazi) managed to add compressed floating point instructions.
<

Which violates the main tenet of RISC ISA design::
<
"Architectural Tradeoffs in the Design of MIPS-X" Paul Chow and Mark Horowitz
<
....."The goal of any instruction format should be:
..........1. Simplify Decode
..........2. Simplify Decode
..........3. Simplify Decode
Any attempts at improved code density at the expense of CPU performance should
be ridiculed at every opportunity"
<
And RISC-V chewed up 3/4 of the Major OpCode space, lost 16-bit immediates,.....
for a gain of code density............and screwed up the Decoder at the same time.

Thomas Koenig

unread,

Aug 16, 2023, 3:00:04 PM8/16/23

to

pec...@gmail.com <pec...@gmail.com> schrieb:

> Instruction compression still matters in embedded applications.

> Until the mid-1990s, instruction compression was of great
> practical importance even on large machines. The first two
> processors that were "conscious RISCs", made by people who knew
> what they were doing, had a compact instruction format.

The first real RISC was arguably the 801, and it had both 16
and 32-bit instructoins, and that had both 16-bit and 32-bit
instructions, where the 32-bit instructions had 16-bit constants.
(It was also a 24-bit machine, which seems strange, but probably
due to IBM internal politics).

They did not have three-register instructions, which have becomme
the hallmark of RISC processors later.

Did they, according to your definition, know what they were doing?

And who was the other machine?

>Then came a bunch of imitators who, for purely religious reasons,
>insisted on a fixed instruction size.

Branch range is one reason why a multiple of four for instruction
size can be useful.

MitchAlsup

unread,

Aug 16, 2023, 3:29:57 PM8/16/23

to

On Wednesday, August 16, 2023 at 1:18:25 PM UTC-5, pec...@gmail.com wrote:
> Brett wrote:
> > MitchAlsup <Mitch...@aol.com> wrote:
> > > 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more
> > > than 1.1% more power.
> > > <
> > > Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
> > This is the killer argument that would have saved me from caring about 16
> > bit opcodes.
> > Only toy CPU’s can care about 16 bit opcodes.
> Instruction compression still matters in embedded applications.
<

Very true,
Is it true that: only vanishingly small embedded applications need 64-bit registers or address spaces.

<
> Until the mid-1990s, instruction compression was of great practical importance even on large machines. The first two processors that were "conscious RISCs", made by people who knew what they were doing, had a compact instruction format. Then came a bunch of imitators who, for purely religious reasons, insisted on a fixed instruction size.
<

2nd == Clipper ?? Howard Sacks ?!?
<
On the other hand, I nominate CDC 6600 as a RISC processor design. It certainly
has a compact instruction set, LD/ST (in a way) memory access model, with 2
sizes of instructions (15-bit and 30-bit), it is certainly hardwired logic, with CPU
performance as its main goal.
<
On the other hand 2; some of us who went through that era learned, and no longer support
aligned only memory models, or fixed only length instructions, or branch as the only means
of controlling which instructions execute. {{That is we learned from our mistakes--it seems
Berkeley and Stanford did not}}.

pec...@gmail.com

unread,

Aug 16, 2023, 3:40:55 PM8/16/23

to

Thomas Koenig wrote:
> pec...@gmail.com <pec...@gmail.com> schrieb:
> > Instruction compression still matters in embedded applications.
>
> > Until the mid-1990s, instruction compression was of great
> > practical importance even on large machines. The first two
> > processors that were "conscious RISCs", made by people who knew
> > what they were doing, had a compact instruction format.
> The first real RISC was arguably the 801, and it had both 16
> and 32-bit instructoins, and that had both 16-bit and 32-bit
> instructions, where the 32-bit instructions had 16-bit constants.
> (It was also a 24-bit machine, which seems strange, but probably
> due to IBM internal politics).
>
> They did not have three-register instructions, which have becomme
> the hallmark of RISC processors later.
>
> Did they, according to your definition, know what they were doing?

Sure.

>
> And who was the other machine?

Berkeley RISC Blue

> >Then came a bunch of imitators who, for purely religious reasons,
> >insisted on a fixed instruction size.
> Branch range is one reason why a multiple of four for instruction
> size can be useful.

Yes, but with code compression you can regain half of the effective "span".
The more important advantage is that instructions are aligned, but it is not worth of 40-50% of code expansion.

Timothy McCaffrey

unread,

Aug 16, 2023, 4:09:23 PM8/16/23

to

On Wednesday, August 16, 2023 at 2:18:25 PM UTC-4, pec...@gmail.com wrote:
> Then came a bunch of imitators who, for purely religious reasons, insisted on a fixed instruction size.

I don't think it was purely religious. If you have a fixed size (power of 2 size) instruction it will:
1) Never straddle a cache line.
2) Never straddle a page boundary.
a) That means you never get a page fault in the middle of instruction fetch.
b) Your page fault handler never has to test to see if this instruction straddles a page boundary and make sure both pages are present.

Are these a *real* problem? Probably not, since they all had to be solved for other processors. However, it may have some allowed
some logic to omitted (my feeling was the RISC was mostly about getting to most bang for the least transistors/chip area)).
(Where the design point was around 200k-400k transistors for the CPU sans cache).

- Tim

Timothy McCaffrey

unread,

Aug 16, 2023, 4:12:51 PM8/16/23

to

On Wednesday, August 16, 2023 at 3:29:57 PM UTC-4, MitchAlsup wrote:
> On Wednesday, August 16, 2023 at 1:18:25 PM UTC-5, pec...@gmail.com wrote:
> > Brett wrote:
> > > MitchAlsup <Mitch...@aol.com> wrote:
> > > > 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more
> > > > than 1.1% more power.
> > > > <
> > > > Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
> > > This is the killer argument that would have saved me from caring about 16
> > > bit opcodes.
> > > Only toy CPU’s can care about 16 bit opcodes.
> > Instruction compression still matters in embedded applications.
> <
> Very true,
> Is it true that: only vanishingly small embedded applications need 64-bit registers or address spaces.
> <
> > Until the mid-1990s, instruction compression was of great practical importance even on large machines. The first two processors that were "conscious RISCs", made by people who knew what they were doing, had a compact instruction format. Then came a bunch of imitators who, for purely religious reasons, insisted on a fixed instruction size.
> <
> 2nd == Clipper ?? Howard Sacks ?!?
> <
> On the other hand, I nominate CDC 6600 as a RISC processor design. It certainly
> has a compact instruction set, LD/ST (in a way) memory access model, with 2
> sizes of instructions (15-bit and 30-bit), it is certainly hardwired logic, with CPU
> performance as its main goal.

I would second your nomination. Instructions could not straddle words (60 bit), and
branches were only to a word address. That did have an impact on code density
(programs had lots of NOPs to pad out words).

Timothy McCaffrey

unread,

Aug 16, 2023, 4:19:52 PM8/16/23

to

On Wednesday, August 16, 2023 at 2:18:25 PM UTC-4, pec...@gmail.com wrote:

> Brett wrote:
> > MitchAlsup <Mitch...@aol.com> wrote:
> > > 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more
> > > than 1.1% more power.
> > > <
> > > Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
> > This is the killer argument that would have saved me from caring about 16
> > bit opcodes.
> > Only toy CPU’s can care about 16 bit opcodes.
> Instruction compression still matters in embedded applications.

Given a variable length instruction set, it seems to me it makes sense to encode the most used
instructions into small instructions, if possible. I believe I have read that the most used
instructions are load, compare, add and branch. The rest are in the single digits percentage wise.
(I wish I had a reference, so take the above with a rock sized grain of salt). Anyway, if
you could encode those instructions into a 16 bit word, and leave the longer instructions
for all the useful but not used that much remainder, wouldn't that basically "compress"
your instruction set (even if variants of the longer instructions "overlapped" the short instructions,
it would probably still be a win).
- Tim

Stephen Fuld

unread,

Aug 16, 2023, 4:38:45 PM8/16/23

to

On 8/16/2023 1:19 PM, Timothy McCaffrey wrote:
> On Wednesday, August 16, 2023 at 2:18:25 PM UTC-4, pec...@gmail.com wrote:
>> Brett wrote:
>>> MitchAlsup <Mitch...@aol.com> wrote:
>>>> 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more
>>>> than 1.1% more power.
>>>> <
>>>> Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
>>> This is the killer argument that would have saved me from caring about 16
>>> bit opcodes.
>>> Only toy CPU’s can care about 16 bit opcodes.
>> Instruction compression still matters in embedded applications.
>
> Given a variable length instruction set, it seems to me it makes sense to encode the most used
> instructions into small instructions, if possible.

Sure.

I believe I have read that the most used
> instructions are load, compare, add and branch. The rest are in the single digits percentage wise.
> (I wish I had a reference, so take the above with a rock sized grain of salt).

I think that is at least approximately right.

> Anyway, if
> you could encode those instructions into a 16 bit word, and leave the longer instructions
> for all the useful but not used that much remainder, wouldn't that basically "compress"
> your instruction set (even if variants of the longer instructions "overlapped" the short instructions,
> it would probably still be a win).

Yes, but . . . For loads, you would be limited to a very short
displacement, limiting their usefulness. You almost certainly wouldn't
use three register specifiers, which limits adds to A=A+B, which isn't
terrible, but an annoyance. Having a small immediate field proobably
isn't much of a problem, as I think many constant adds are of a small
number. Branches are probably OK with a smaller displacement, as I
suspect a lot of branches are to quite close. With compare, are you
proposing using condition codes? Otherwise you have the three register
specifier problem - eccch.

I think these considerations reduce (but probably don't eliminate) the
percentage of time the 16 bit instructions would be useful.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Aug 16, 2023, 5:30:07 PM8/16/23

to

On Wednesday, August 16, 2023 at 3:19:52 PM UTC-5, Timothy McCaffrey wrote:
> On Wednesday, August 16, 2023 at 2:18:25 PM UTC-4, pec...@gmail.com wrote:
> > Brett wrote:
> > > MitchAlsup <Mitch...@aol.com> wrote:
> > > > 32K is 25% bigger than 24K but only 1.1% faster, and likely burns more
> > > > than 1.1% more power.
> > > > <
> > > > Comparing 64K 4-way to 48K 6-way:: 64K is only 0.7% faster; with 1M L2 only 0.4% faster.
> > > This is the killer argument that would have saved me from caring about 16
> > > bit opcodes.
> > > Only toy CPU’s can care about 16 bit opcodes.
> > Instruction compression still matters in embedded applications.
<
> Given a variable length instruction set, it seems to me it makes sense to encode the most used
> instructions into small instructions, if possible. I believe I have read that the most used
> instructions are load, compare, add and branch. The rest are in the single digits percentage wise.
<

My 66000 encodes {ADD, CMP, Bcnd} into the LOOP instruction.

<
> (I wish I had a reference, so take the above with a rock sized grain of salt). Anyway, if
<

Hennessey and Patterson (any revision) has data on this.
I have a spreadsheet that cooks all H&P data across several architectures.

Thomas Koenig

unread,

Aug 16, 2023, 5:30:46 PM8/16/23

to

pec...@gmail.com <pec...@gmail.com> schrieb:

> Thomas Koenig wrote:
>> pec...@gmail.com <pec...@gmail.com> schrieb:
>> > Instruction compression still matters in embedded applications.
>>
>> > Until the mid-1990s, instruction compression was of great
>> > practical importance even on large machines. The first two
>> > processors that were "conscious RISCs", made by people who knew
>> > what they were doing, had a compact instruction format.
>> The first real RISC was arguably the 801, and it had both 16
>> and 32-bit instructoins, and that had both 16-bit and 32-bit
>> instructions, where the 32-bit instructions had 16-bit constants.
>> (It was also a 24-bit machine, which seems strange, but probably
>> due to IBM internal politics).
>>
>> They did not have three-register instructions, which have becomme
>> the hallmark of RISC processors later.
>>
>> Did they, according to your definition, know what they were doing?
> Sure.
>>
>> And who was the other machine?
> Berkeley RISC Blue

Register windows have proven not to be a very good idea, finally.

>> >Then came a bunch of imitators who, for purely religious reasons,
>> >insisted on a fixed instruction size.
>> Branch range is one reason why a multiple of four for instruction
>> size can be useful.
> Yes, but with code compression you can regain half of the effective "span".
> The more important advantage is that instructions are aligned, but it is not worth of 40-50% of code expansion.

RISC-V certainly took that path, and they spent a large part of their
opcode space for 16-bit

And this led to follow-on problems - lack of opcode space made
the designers chose small offsets for branches, leading to futher
problems.

How did RISC-II address this issue?

MitchAlsup

unread,

Aug 16, 2023, 5:51:23 PM8/16/23

to

On Wednesday, August 16, 2023 at 4:30:46 PM UTC-5, Thomas Koenig wrote:
> pec...@gmail.com <pec...@gmail.com> schrieb:
> > Thomas Koenig wrote:
> >> pec...@gmail.com <pec...@gmail.com> schrieb:
> >> > Instruction compression still matters in embedded applications.
> >>
> >> > Until the mid-1990s, instruction compression was of great
> >> > practical importance even on large machines. The first two
> >> > processors that were "conscious RISCs", made by people who knew
> >> > what they were doing, had a compact instruction format.
> >> The first real RISC was arguably the 801, and it had both 16
> >> and 32-bit instructoins, and that had both 16-bit and 32-bit
> >> instructions, where the 32-bit instructions had 16-bit constants.
> >> (It was also a 24-bit machine, which seems strange, but probably
> >> due to IBM internal politics).
> >>
> >> They did not have three-register instructions, which have becomme
> >> the hallmark of RISC processors later.
> >>
> >> Did they, according to your definition, know what they were doing?
> > Sure.
> >>
> >> And who was the other machine?
> > Berkeley RISC Blue
> Register windows have proven not to be a very good idea, finally.
<

Only those betting against the power of the optimizing compilers choose
register windows (and some that had to bet with them--Itanic for example)

<
> >> >Then came a bunch of imitators who, for purely religious reasons,
> >> >insisted on a fixed instruction size.
> >> Branch range is one reason why a multiple of four for instruction
> >> size can be useful.
> > Yes, but with code compression you can regain half of the effective "span".
> > The more important advantage is that instructions are aligned, but it is not worth of 40-50% of code expansion.
> RISC-V certainly took that path, and they spent a large part of their
> opcode space for 16-bit
>
> And this led to follow-on problems - lack of opcode space made
> the designers chose small offsets for branches, leading to futher
> problems.
<

It also caused them to a) have to expand the instructions back to size
and then b) fuse instructions together. Literature indicates 5% by
fusing. In contrast, My 66000 ISA only needs 70% of the instruction
count of RISC-V {average, 69% Geomean, 68% Harmonic Mean} over
the 560 subroutines I have spent the time to examine in fine detail.
<
EMBench seems to have several characteristics that RISC-V ISA
illustrates its own disadvantages.
A) many stack frames are big enough that the 12-bit displacement
is insufficient, but that a 16-bit displacement would have been.
B) a plethora of LUI Rt,hi(variable) followed by MEM Rd,lo(variable)[Rt]
C) a plethora of AUPIC Rt,hi(variable).....

BGB

unread,

Aug 16, 2023, 6:25:06 PM8/16/23

to

Half the encoding space is Imm16 ops?...

But, yeah, I guess this works.

I guess, specifying everything with bit-ranges is easier to read than, say:
0zzzzz-ddddd-sssss-xxxx-yyyyyy-ttttt

...

>>>>
>>>> However, using the 64-bit encoding are less desirable, since these can't
>>>> be organized into bundles.
>>>>
>>> Easily fixed--get rid of the concept of bundles.
>> Usual downsides of superscalar notwithstanding. Falling back to
>> scalar-only operation being similarly undesirable.
> <
> Everyone and his brother have done superscalar without bundles.
> Conversely all static VLIW forms have failed.
> <
> Now, what would you do if you got an FPGA with the resources to do
> a 4-wide or 5-wide but not a 6-wide machine ??

Dunno...

A lot of DSP's and some microcontrollers and similar have gotten along
OK with VLIW.

Under my existing practice, it would be:
Define a new WEX profile for the new rules;
Modify compiler to allow rules for new profile;
Deal with annoyance of resulting compatibility issues.

But, current thinking was more like:
I will canonize on 3-wide;
4+ wide, by that point, can probably afford superscalar...

There is little that should prevent superscalar. Since, the bundling
rules still require that the instruction sequence is also "sane" if the
instructions are executed sequentially.

As for 3-wide, at present, there doesn't seem to be enough "free ILP"
floating around to justify going wider. Even 3 is pushing it, but the
main advantage that 3-wide has over 2-wide is that it makes it easier to
justify a 6-port register file (which sidesteps some limitations which
result in my case from a 4-port register file).

Well, and a 2-wide configuration with a 6R register file costs almost as
much as a 3-wide configuration.

>>
>> Though, granted, in theory a superscalar core would not need to worry
>> about things which lack a dedicated bundle encoding.
>>>>
>>>> Could potentially try to address this by changing some of the ISA rules
>>>> (to allow jumbo encodings within bundles), but this would make fetch and
>>>> decode more expensive (or, if I allowed a 2-wide case with 2 jumbo
>>>> prefixes, this would require supporting a 128-bit instruction fetch, ...).
>>> <
>>> I started with the concept of 64-bit computer with an inherently misaligned
>>> memory model. Loading a misaligned 64-bit item requires fetching 128-bits
>>> from DCache. Then once you have 128-bit DCache, another instance and you
>>> have a 128-bit instruction fetch. Presto, done.....
>>> <
>>> See how easy that is !!
>> In my case, the "freely aligned" cases only ended up going up to 64 bits.
>>
>>
>> A freely aligned 128-bit fetch would effectively require the L1 caches
>> to internally work with 256 bits at a time (rather than using a 128-bit
>> block).
> <
> But Ifetch does not access the ICache misaligned--obviating that.

It has to deal with the minimum allowed alignment, which in my case is
16 bit.

So, say, one has a 128-bit block fetched with a 64b alignment (X=16
bits), with 64-bit fetch:
XXXXXXXX
IIII----
-IIII---
--IIII--
---IIII-
Everything fits.

96-bit:
XXXXXXXX
IIIIII--
-IIIIII-
--IIIIII
---IIIIIi //oh-no

Or, 128-bit:
XXXXXXXX
IIIIIIII
-IIIIIIIi //oh-no
--IIIIIIii //oh-no
---IIIIIiii //oh-no

So, errm, block needs to be bigger...

Say, we expand the internal fetch block to 192 bits with a 64-bit alignment:
Or, 128-bit:
XXXXXXXXXXXX
IIIIIIII----
-IIIIIIII---
--IIIIIIII--
---IIIIIIII-

This would work at least...

And, still maps to the "paired cache line" scheme.

EEEEEEEEOOOOOOOOEEEEEEEEOOOOOOOO
XXXXXXXXXXXX----________________
IIIIIIII--------
-IIIIIIII-------
--IIIIIIII------
---IIIIIIII-----
----XXXXXXXXXXXX________________
----IIIIIIII----
-----IIIIIIII---
------IIIIIIII--
-------IIIIIIII-
________XXXXXXXXXXXX----________
________IIIIIIII--------
________-IIIIIIII-------
________--IIIIIIII------
________---IIIIIIII-----
________----XXXXXXXXXXXX________
________----IIIIIIII----
________-----IIIIIIII---
________------IIIIIIII--
________-------IIIIIIII-

I guess, 192 bits is cheaper than 256, and sufficient to deal with free
alignment.

>>
>> Though, instruction alignment that is like:
>> Well, 16/32/64 bit cases have a 16-bit alignment, but 96 bit requires
>> 32-bit alignment, is a little wonky.
> <
> Variable issue with 16-bit quanta has 2× as many multiplexers as with
> 32-bit quanta.

Yeah.

In this case, it is mostly just an issue that there exists a case where
a 96-bit fetch with a 128-bit fetch block would leave the final 16 bits
"hanging off the end".

Defining "96-bit op needs 32-bit alignment" was less wonky than "96-bit
op is not valid if ((PC&0x6)==0x6)..."

Granted, one could argue that "less bad" option is to require that the
I$ be wide enough to deal with any instruction at any alignment.

>>
>> If I were to handle it the same way as my L1 D$, then a 128-bit
>> instruction fetch would need a 64-bit alignment. This is basically no-go.
>>
> Sigh.........

I guess, technically, there is at least a workaround...

My imagination for a superscalar pipeline would have basically been to
have logic to detect valid prefix/suffix pairs and a lack of register
conflict, and then behave as if the WEX flag were set (*).

Granted, I had looked into modeling these semantics for before (mostly
for RISC-V, which lacks any equivalent of a WEX bit), and the results
were seemingly a little underwhelming.

*: This isn't too far off from how BGBCC's WEXifier works, just with the
WEXifier also having the ability to shuffle instructions around to look
for possible ILP (and the ability to model the interlock costs and
similar to try to figure which instructions to swap and bundle, ...).

>>
>> So:
>> OP?T
>> OP?T
>> OP?T
>> OP?F
>> OP?F
>> OP?F
>>
>> Would still require 6+ clock-cycles.
> <
> Only mindless strictly-in-order pipelines mandate this.

Cough... Yeah...

>>
>> Whereas:
>> OP?T | OP?F
>> OP?T | OP?F
>> OP?T | OP?F
>>
>> Would need 3+ clock cycles.
> <
> And twice the execution energy.
> <
>>>>
>>>> Granted, could maybe be more clever if an interleaved
>>>> then/else/then/else scheme were used (say, with Fetch and Decode
>>>> treating this as special).
>>> <
>>> I started with this, and abandoned it due to bit count needed in the PRED
>>> instruction.
>>> <
>>> <snip>
>> In my case, predication is encoded for every instruction.
> <
> Eating those precious instruction bits on the 90% of instructions that are
> not predicated. You are making those who do not use the feature pay for
> the feature. Unsound tradeoff when bits are precious--like in an instruction.

Goes and looks at Doom stats:
Scalar: ~ 69%
Predicated: ~ 19%
WEX: ~ 12%
PrWEX: ~ 0.14%

It is a little off-balance, but not *that* far off.

I have done a 1-wide implementation, but it was pretty limited.

But, within the limitations of a 1-wide context, there isn't really any
strong advantage over RISC-V.

I had also done a 2-wide implementation, but there is the annoyance of
binary compatibility issues.

4+ wide, will probably go superscalar at this point, in which case the
bundling can be ignored, by design.

OK.

No fancy pipelines in this case.

There is an LDTEX instruction though, which can do a nearest-neighbor
texel fetch (one needing multiple uses of LDTEX and some interpolation
handling steps to do bilinear though).

>>
>>
>> Though, I guess now, this part is made less certain with the possibility
>> of the rasterizer loops being moved into a Verilog module (which as-is
>> still needs around 20x 64-bit MMIO registers to configure the working
>> parameters).
> <
> More if you want multiple quanta per cycle.
>>
>> Still needs more work, but seems promising (raises hope that it may
>> still be possible to get Quake out of "single digit framerate" territory).
>>
> Function-unit-ize raster and interpolation and FUI texture and Quake at
> 30 FPs should be quite reasonable.

At the moment, a mock-up in my emulator gets Quake into 8-12 fps
territory. TKRA-GL still eats a lot of CPU time, but now it is mostly
focused in the projection and geometry-handling stages (the low-level
rasterization effectively "off the table").

This creates an incentive to add support for perspective-correct
texturing, mostly as a big chunk of cost here could potentially be
avoided if I can make this work (and I can probably "YOLO it" for the
"1/(1/w)" calculation).

Where, as-is:
Project vertices to screen space;
Figure out edge lengths;
Check if sum of edge lengths exceeds limit;
Split primitive into multiple parts;
Repeat this process for each part.
Else (sufficiently small), send it off to rasterizer.

If using perspective correct texturing, possibly:
Project vertices to screen space;
Send it off to rasterizer.

As-is, the module seems to be taking around 8 clock cycles per pixel on
average (in a unit test), but this is still basically fast enough to
where it should be able to rasterize geometry basically about as fast as
the front-end can throw requests at it...

I suspect it may be slower "in practice" though due mostly to memory
bandwidth issues (where the "slowness" of accessing RAM via the L2 cache
was not modeled in the unit test...).

Either way, still probably going to be faster than rasterizer loops
running on the CPU (and its "fast path" can also deal with all of the
various blending modes and similar as well, ...).

>>
>> Granted, all the projection and geometric subdivision still need to be
>> handled CPU side (with it then feeding the parameters for the edges to
>> be drawn into a bunch of MMIO registers, and then setting a "Have At It"
>> bit, which is understood as a "Busy" or "FIFO Full" flag when the
>> register is read).
>>
>> Or, basically, if the Busy flag is set, the CPU needs to spin until this
>> flag clears before it can submit anything more to draw. But, otherwise,
>> the CPU can operate semi-asynchronously with the edge-walker unit
>> (processing more primitives to be drawn in parallel with the previous
>> primitives being drawn into the framebuffer).
>>
> In-memory HW queues, GPU gets read access, CPU gets write access
> write stalls when full--no busy bits.

OK.

I had used a bunch of MMIO registers, which are then cycled through
16-entry circular-arrays (LUT based). As-is, the request is expressed
with ~ 18x 64-bit registers (which express things like the general
parameters, and where the framebuffer and texture and similar are at,
the clip values for the viewport/scissor-test, ...).

Memory based arrays are possible but would be more complicated.

For now, there are two busy flags, one that indicates when the module is
working, and another that specifies when the FIFO is full (and thus, the
CPU needs to wait before it can submit any more requests).

MitchAlsup

unread,

Aug 16, 2023, 7:41:03 PM8/16/23

to

Just under, yes.

>
> But, yeah, I guess this works.
>
> I guess, specifying everything with bit-ranges is easier to read than, say:
> 0zzzzz-ddddd-sssss-xxxx-yyyyyy-ttttt
<

There are 7 formats:: and in your form::
<
The Major OpCode escapes
>
000110-bcond-sssss-PRED-----imm4----imm4 // predication cast
000111-ddddd-sssss-SHF-s-wwwwww-llllll // shift with immediates
001001-ddddd-sssss-MODIF-MEMORY-sssss // 2-register memory references
001010-ddddd-sssss-MODIF-2OPRND-sssss // 2-register calculations
001100-ddddd-sssss-MOD-3OP-sssss-sssss // 3-register calculations
001101-ddddd-sssss-MODIF-1OPERND-sssss // 1-register calculations
<
Then the Major OpCodes
<
01100x-bonbit-sssss-displacement16
011010-bcond-sssss-displacement16
011011-hhhhh-sssss-displacement16 // Table Transfer (switch)
011110-displacement26 // branch
011111-displacement26 // CALL
100xxx-ddddd-sssss-immediate16 // LDs
101xxx-ddddd-sssss-immediate16 // STs
110xxx-ddddd-sssss-immediate16 // integer
111xxx-ddddd-sssss-immediate16 // logical

There is a time and place for fully-resourcing a machine--the register file
is not one of them.
<
A 3-wide machine which can provide 2-2-2 and 3-2-1 register operands
to any function units is far more performing than a 2-wide machine 3-3.
{Counting long constants} as much as 45% of instructions use a constant
and thereby don't need the second register port, while only 10%-ish need
all 3.
<
Also note: Storing of a constant becomes independent of any register dependency.

<
> >>
> >> Though, granted, in theory a superscalar core would not need to worry
> >> about things which lack a dedicated bundle encoding.
> >>>>
> >>>> Could potentially try to address this by changing some of the ISA rules
> >>>> (to allow jumbo encodings within bundles), but this would make fetch and
> >>>> decode more expensive (or, if I allowed a 2-wide case with 2 jumbo
> >>>> prefixes, this would require supporting a 128-bit instruction fetch, ...).
> >>> <
> >>> I started with the concept of 64-bit computer with an inherently misaligned
> >>> memory model. Loading a misaligned 64-bit item requires fetching 128-bits
> >>> from DCache. Then once you have 128-bit DCache, another instance and you
> >>> have a 128-bit instruction fetch. Presto, done.....
> >>> <
> >>> See how easy that is !!
> >> In my case, the "freely aligned" cases only ended up going up to 64 bits.
> >>
> >>
> >> A freely aligned 128-bit fetch would effectively require the L1 caches
> >> to internally work with 256 bits at a time (rather than using a 128-bit
> >> block).
> > <
> > But Ifetch does not access the ICache misaligned--obviating that.
> It has to deal with the minimum allowed alignment, which in my case is
> 16 bit.
<

Simpler than that::
<
128-bits = ICache[ IP & ~15];
<
Instruction = 128-bits[ IP & 15 ];

>
> So, say, one has a 128-bit block fetched with a 64b alignment (X=16
> bits), with 64-bit fetch:
> XXXXXXXX
> IIII----
> -IIII---
> --IIII--
> ---IIII-
> Everything fits.
>
> 96-bit:
> XXXXXXXX
> IIIIII--
> -IIIIII-
> --IIIIII
> ---IIIIIi //oh-no
>
> Or, 128-bit:
> XXXXXXXX
> IIIIIIII
> -IIIIIIIi //oh-no
> --IIIIIIii //oh-no
> ---IIIIIiii //oh-no
>
> So, errm, block needs to be bigger...
>

No, FETCH needs to be more agressive::
<
Fetch on cycle[1] :: IP + 00 ::: 0123456789abcdef
Fetch on cycle[2] :: IP + 16 ::: fedcba9876543210
<
inst-128 can be ovserved in cycle[2] but only after tag==TLB comparisons (late)
So, you have time to PARSE the instructions determine boundaries but not
decode them or access register file ports.
<
So, DECODE is in cycle[3] and multiple decoders are handed 1 instruction-specifier
each {with or without appended constants}. By the end of cycle[3] the overhanging
constants on the subsequent FETCH will have arrived and can be routed directly into
execution (while arriving instructions-themselves-cannot.) So, yes, there is the possi-
bility of stutter stepping but to instruction alignment, because you are fetching 4×
your execution width the front end plows far enough ahead this is seldom a problem.
{Far less of a problem that branch target latency}

For your ISA and your execution width (only)

> >>
> >> Though, instruction alignment that is like:
> >> Well, 16/32/64 bit cases have a 16-bit alignment, but 96 bit requires
> >> 32-bit alignment, is a little wonky.
> > <
> > Variable issue with 16-bit quanta has 2× as many multiplexers as with
> > 32-bit quanta.
> Yeah.
>
> In this case, it is mostly just an issue that there exists a case where
> a 96-bit fetch with a 128-bit fetch block would leave the final 16 bits
> "hanging off the end".
<

But by the time you have figured that out, all you have to look at is
hit[+1] to know that the rest of the instruction has arrived.

>
> Defining "96-bit op needs 32-bit alignment" was less wonky than "96-bit
> op is not valid if ((PC&0x6)==0x6)..."
>
>
> Granted, one could argue that "less bad" option is to require that the
> I$ be wide enough to deal with any instruction at any alignment.
<

My requirement was dramatically simpler: I don't want to expend the
engineering resources to build an ICache when I can instantiate a
DCache and let the great Verilog gate-eater get rid of the unneeded
stuff. ICache is a degenerate subset of DCache functionality. Build
once, verify once, instantiate as many times as desired.

Add function unit conflict detect and you have the sequencer from Mc 88110.
>
<snip>

> >> So, you are arguing it would be better to just "bite the bullet" early
> >> and go over to superscalar?...
> > <
> > If you EVER see your architecture being implemented as a 1-wide
> > or 4,5-wide or 7-8-wide :: yes absolutely. That is you are locking
> > in the concept of the bundle that may not be relevant in other
> > implementations.
> > <
> I have done a 1-wide implementation, but it was pretty limited.
>
> But, within the limitations of a 1-wide context, there isn't really any
> strong advantage over RISC-V.
>
> I had also done a 2-wide implementation, but there is the annoyance of
> binary compatibility issues.
>

See, my ISA goes from the small through the large (mostly) effortlessly.
{Something I paid dearly for in the middle of my career having to engineer
my way out of the holes I had dug myself into earlier.}
<
Mental test cases show I can scale typical ICached based Fetch-Insert
pipelines from 1 through 16 instructions per clock rather easily.
Exactly how wide a given technology can fit (or even be optimal for)
is a significantly harder question.

pec...@gmail.com

unread,

Aug 16, 2023, 8:25:06 PM8/16/23

to

It is 7% for RISC-V common idioms. And for this purpose compressed instructions are very useful. In the future
it can be even more.

> In contrast, My 66000 ISA only needs 70% of the instruction
> count of RISC-V {average, 69% Geomean, 68% Harmonic Mean} over
> the 560 subroutines I have spent the time to examine in fine detail.

I think there is no point to compare RISC-V to My66000, it is in a different league.
Let's compare it ARMv8: I don't think all these shortcomings translate into a performance loss of more than 10% in any significant metric, which means it won't prevent the success of this architecture. For me, the bigger problem is the arrogance of the organization and its bureaucratic inefficiency.

> EMBench seems to have several characteristics that RISC-V ISA
> illustrates its own disadvantages.
> A) many stack frames are big enough that the 12-bit displacement
> is insufficient, but that a 16-bit displacement would have been.
> B) a plethora of LUI Rt,hi(variable) followed by MEM Rd,lo(variable)[Rt]
> C) a plethora of AUPIC Rt,hi(variable).....

A lot of idioms to fuse...

MitchAlsup

unread,

Aug 16, 2023, 9:25:13 PM8/16/23

to

You cannot overcome incompetence with arrogance.
and
Leading with arrogance often implies a base of incompetence.

BGB

unread,

Aug 17, 2023, 1:39:27 AM8/17/23

to

OK, doesn't seem like a whole lot of bits left for opcode in some cases
though...

Probably needs more looking at it.

As noted, I had generally smaller immediate and displacement fields to
leave more room for opcode bits and similar.

Only about 12.5% of the 32-bit instruction space had gone to Imm16 ops,
which were, as noted:
MOV Imm16u, Rn
MOV Imm16n, Rn
ADD Imm16s, Rn
LDSH Imm16u, Rn
FLDCH Imm16u, Rn

With a few more spots left over (the block was basically large enough to
encode 8 instruction spots). With XG2, this block implicitly expands to
32 spots (with the Ws/Wt bits being reserved as opcode).

So, within the F0 block, there were effectively 9 bits for opcode (for
the entire space). Though, part of this was carved off for branch ops,
and there are also 2R and 1R spaces carved off of this space (space
worth roughly 24 3R ops was carved off for the 2R spaces; which
currently allows for around 384 2R instruction spots).

The F1 block Load/Store had enough space for 32 unique spots.
Half this space was used for normal Load/Store ops (and LEA);
1/4 was used for more specialized Load/Store ops;
1/4 was used for Compare+Branch ops (like in RISC-V).

The F2 block was divided roughly in half:
Low part was used for 3RI Imm9 ops (~ 18 spots);
High part was used for "Imm10, Rn" ops (~ 224 spots).

...

Checking, currently there are roughly 391 unique mnemonics (though,
looks like around 30-40% of these are various SIMD ops and similar).

The remaining F3 and F9 blocks are both 24 bits, which if used in the
same way as the F0 block, could potentially each allow and additional
512 3R instruction spots (or 1024 spots in total). Though, my current
plan is to try to leave F3 unused (mostly to leave it for
implementation-defined instructions).

There is a bit more space left if one counts 64-bit encodings (but, thus
far, the definitions of instructions specific to 64-bit encodings has
ended up a little ad-hoc).

...

My layouts are, as-is:
1-wide: 3R1W and 6R2W (128-bit operand)
2-wide: 2x 3R1W
3-wide: 3x 2R1W

But, yeah, 3-wide with a 6R3W register file looked like the local optimum.

Given:
3R1W, generally needed for 1-wide;
4R2W (2-wide), kinda sucked due to limitations.
Though, slightly more capable than 1-wide;
Can be made to support SIMD and the MOV.X instructions.
6R2W (2-wide), almost as expensive as 3-wide;
6R3W (3-wide), only slightly more cost, but more capable.

Even if infrequent, in cases where they come up, 3-wide execution is
nice to have.

One "choke point" for ILP is only having a single memory port, which for
a lot of code, it seems like the only way to get more ILP would be to be
able to support 2 memory access per clock-cycle.

But, pulling this off effectively in the L1 D$ is a bit more of an issue
(I had looked before at a second read-only memory port, but the gains
weren't quite enough to justify the "fairly steep" cost increase).

>>>>
>>>> Though, granted, in theory a superscalar core would not need to worry
>>>> about things which lack a dedicated bundle encoding.
>>>>>>
>>>>>> Could potentially try to address this by changing some of the ISA rules
>>>>>> (to allow jumbo encodings within bundles), but this would make fetch and
>>>>>> decode more expensive (or, if I allowed a 2-wide case with 2 jumbo
>>>>>> prefixes, this would require supporting a 128-bit instruction fetch, ...).
>>>>> <
>>>>> I started with the concept of 64-bit computer with an inherently misaligned
>>>>> memory model. Loading a misaligned 64-bit item requires fetching 128-bits
>>>>> from DCache. Then once you have 128-bit DCache, another instance and you
>>>>> have a 128-bit instruction fetch. Presto, done.....
>>>>> <
>>>>> See how easy that is !!
>>>> In my case, the "freely aligned" cases only ended up going up to 64 bits.
>>>>
>>>>
>>>> A freely aligned 128-bit fetch would effectively require the L1 caches
>>>> to internally work with 256 bits at a time (rather than using a 128-bit
>>>> block).
>>> <
>>> But Ifetch does not access the ICache misaligned--obviating that.
>> It has to deal with the minimum allowed alignment, which in my case is
>> 16 bit.
> <
> Simpler than that::
> <
> 128-bits = ICache[ IP & ~15];
> <
> Instruction = 128-bits[ IP & 15 ];

?...

The above pattern would hold for 128-bit fetch, which is wider than my
current limit of 96 bits.

>>>>
>>>> Though, instruction alignment that is like:
>>>> Well, 16/32/64 bit cases have a 16-bit alignment, but 96 bit requires
>>>> 32-bit alignment, is a little wonky.
>>> <
>>> Variable issue with 16-bit quanta has 2× as many multiplexers as with
>>> 32-bit quanta.
>> Yeah.
>>
>> In this case, it is mostly just an issue that there exists a case where
>> a 96-bit fetch with a 128-bit fetch block would leave the final 16 bits
>> "hanging off the end".
> <
> But by the time you have figured that out, all you have to look at is
> hit[+1] to know that the rest of the instruction has arrived.

There wasn't really any workaround for this case, short of either fixing
it by making the cache internals wider (at a slight cost increase), or
by hacking around it in the compiler.

A more strict interpretation of the rule would effectively require any
context where an 96-bit encoding "could" occur to either force a 32-bit
alignment, or not encode 96-bit ops if not 32-bit aligned.

Though, a "cheese it" solution is just sort of to detect the "(PC&6)==6"
case when emitting a 96-bit op and insert a NOP (noting that a
relatively small number of Op96 cases would fall at an offending alignment).

But, if I were to support 128-bit fetch, I would basically be required
to "fix" this case, as suddenly it goes from "(PC&6)==6" to "(PC&6)!=0",
and one can't cheese over this with a NOP.

>>
>> Defining "96-bit op needs 32-bit alignment" was less wonky than "96-bit
>> op is not valid if ((PC&0x6)==0x6)..."
>>
>>
>> Granted, one could argue that "less bad" option is to require that the
>> I$ be wide enough to deal with any instruction at any alignment.
> <
> My requirement was dramatically simpler: I don't want to expend the
> engineering resources to build an ICache when I can instantiate a
> DCache and let the great Verilog gate-eater get rid of the unneeded
> stuff. ICache is a degenerate subset of DCache functionality. Build
> once, verify once, instantiate as many times as desired.

I have two difference caches.

The D$ has the stuff to support stores, and sign/zero extending various
sizes (though, I had debated if the sign/zero extension part would have
been better handled by dedicated logic in the EX3 stage).

The I$ has logic to figure out, given the operating mode and similar,
how big the instruction/bundle would be (so that it can figure out how
far to advance PC for the next clock cycle).

I don't really need this, per-se, as whether or not a unit is available
depends on lane, which is an implicit part of the prefix/suffix
determination.

The one likely exception here is that FPU and 64-bit FP-SIMD ops may be
initiated from either Lane 1 or Lane 2, but only from Lane 2 if there is
not already an FPU op in Lane 1.

So:
ALU | FPU //OK
FPU | MEM //OK
FPU | FPU //BAD

>>
> <snip>
>>>> So, you are arguing it would be better to just "bite the bullet" early
>>>> and go over to superscalar?...
>>> <
>>> If you EVER see your architecture being implemented as a 1-wide
>>> or 4,5-wide or 7-8-wide :: yes absolutely. That is you are locking
>>> in the concept of the bundle that may not be relevant in other
>>> implementations.
>>> <
>> I have done a 1-wide implementation, but it was pretty limited.
>>
>> But, within the limitations of a 1-wide context, there isn't really any
>> strong advantage over RISC-V.
>>
>> I had also done a 2-wide implementation, but there is the annoyance of
>> binary compatibility issues.
>>
> See, my ISA goes from the small through the large (mostly) effortlessly.
> {Something I paid dearly for in the middle of my career having to engineer
> my way out of the holes I had dug myself into earlier.}
> <
> Mental test cases show I can scale typical ICached based Fetch-Insert
> pipelines from 1 through 16 instructions per clock rather easily.
> Exactly how wide a given technology can fit (or even be optimal for)
> is a significantly harder question.

I would still have other potential issues with things like optional ISA
features and similar.

But, yeah, I will concede at least that superscalar does avoid the
annoyance of needing to specialize code for a given pipeline width and
having potential compatibility issues in cases when trying to run it on
a core with a different pipeline width (or diminished performance, if it
was built for a more conservative profile).

Also, I will note that while "sorta possible", the XC7S25 is effectively
too small for a "practically usable" implementation of BJX2, and even on
the XC7S50 it is a bit of a strain (and compromises need to be made).

The usable range is smaller if one limits things to "FPGAs big enough
that the configurations can be binary compatible with each other", where
my current "mainline" configurations basically set something like an
XC7A100T as the bar.

Granted, this is assuming "feature creep" doesn't eat too much LUT
budget (but, at least for the CPU core, this is counter-balanced by LUT
cost being negatively correlated with its ability to pass timing).

Though, at least, I can still burn LUTs on things like an external
rasterizer module, which is not directly tied to the CPU core (and thus
has less effect on timing within the CPU core).

Instead, it is mostly glued on to my display module (it repurposing the
memory interface from the VRAM MMIO interface as a general RAM
interface). This does implicitly assume that one doesn't try to use the
hardware rasterizer and draw something into VRAM via the MMIO interface
at the same time.

...

Anton Ertl

unread,

Aug 17, 2023, 4:48:39 AM8/17/23

to

MitchAlsup <Mitch...@aol.com> writes:
>Only those betting against the power of the optimizing compilers choose
>register windows (and some that had to bet with them--Itanic for example)

One could say that for Berkeley RISC vs. Stanford MIPS (and the
commercial architectures that were derived from them, SPARC/29K and
MIPS).

But as you note, the IA-64 architects very much bet on optimizing
compilers, so why did they choose the register stack, and why didn't
the ARM A64 and RISC-V architects?

While in SPEC89/92/95 benchmarketing with statically-linked
link-time-optimized and profile-feedback-optimized binaries the
overhead of saving and restoring the registers at call boundaries is
small, in production settings on an in-order machine it is
significantly larger, because of more object-oriented code than in
SPEC89-95, dynamic linking (static linking costs development time and
space as well as end-user space, especially in the 1990s), no LTO
(costs too much link time), and no profile-feedback (costs too much
developer and compile time). And while the IA-64 architects thought
they could win with their architectural ideas and optimizing compilers
within functions, they obviously thought that it would be a good idea
to provide architectural support for fast calls rather than relying on
the power of benchmarketing compiler settings. They also had numbers
for the amount of loads and stores saved by the register stack.

When A64 and RISC-V were designed, the high-performance cores all used
out-of-order execution. OoO (especially sophisticated OoO with
hardware alias prediction) reduces the cost of saving and restoring
registers compared to in-order CPUs. The saves depend on the stack
pointer and the saved data, the loads just depend on the stack
pointer. The stack pointer is not updated that often on these
architectures, so it is likely available relatively early, so the
loads can be performed relatively early, and the results are therefore
also available early, meaning that the restoring had little influence
on execution time (and the saving had little influence anyway).

If a save has to wait long for its data, and the restore of that data
is relatively shortly after (the only case where the load latency
could be a problem), the advanced store-to-load-forwarding mechanisms
of modern CPUs will rename the saved value directly into the target
register of the load, without incurring the save-load roundtrip
latency. In essence, these combinations of microarchitectural
mechanisms achieve what register windows was designed to achieve, only
coming from the other side.

An A64, the store-pair and load-pair instructions also reduce the
resource usage of the load-store units by the saving and restoring by
up to 50%.

Overall, I think that register windows/stack is a valid design choice
for an architecture designed for in-order execution, but OoO has won,
and there you don't need this feature.

>It also caused them to a) have to expand the instructions back to size
>and then b) fuse instructions together. Literature indicates 5% by
>fusing. In contrast, My 66000 ISA only needs 70% of the instruction
>count of RISC-V {average, 69% Geomean, 68% Harmonic Mean} over
>the 560 subroutines I have spent the time to examine in fine detail.

It's easy to win on a metric the RISC-V architects are not optimizing
for.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Anton Ertl

unread,

Aug 17, 2023, 11:39:46 AM8/17/23

to

Thomas Koenig <tko...@netcologne.de> writes:
[IBM 801]

>(It was also a 24-bit machine, which seems strange, but probably
>due to IBM internal politics).

According to
<http://www.bitsavers.org/pdf/ibm/system801/The_801_Minicomputer_an_Overview_Sep76.pdf>,
Page 9:

|[...] why we did not go to 32 bit registers. Primarily the reason is
|that a technucal case is hard to make for the additional cost.
[...]
|The CPU cost will grow from 7,600 gates to about 10,000 gates.

John Levine

unread,

Aug 17, 2023, 2:07:40 PM8/17/23

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>Thomas Koenig <tko...@netcologne.de> writes:
>[IBM 801]
>>(It was also a 24-bit machine, which seems strange, but probably
>>due to IBM internal politics).
>
>According to
><http://www.bitsavers.org/pdf/ibm/system801/The_801_Minicomputer_an_Overview_Sep76.pdf>,
>Page 9:
>
>|[...] why we did not go to 32 bit registers. Primarily the reason is
>|that a technucal case is hard to make for the additional cost.
>[...]
>|The CPU cost will grow from 7,600 gates to about 10,000 gates.

The 801 project, probably more than any CPU design before or after,
didn't do anything in hardware if they could do it in the compiler,
and their PL.8 compiler was very good. That's why they didn't have
register windows or even load/store multiple. It was the first
compiler to do graph coloring and managed loads and stores well enough
that fancy register instructions weren't needed. Its descendants
compromised with the reality that not all compilers were as good as
PL.8, so the ROMP was 32 bits and had LM/STM.

The Berkeley RISC people were using the old PCC compiler which did
Sethi-Ullman numbering to get expressions to fit into the registers
available and not much else, and let you use register declarations
to tell it to put variables in the registers. No wonder they invented
register windows.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Stephen Fuld

unread,

Aug 17, 2023, 2:17:19 PM8/17/23

to

On 8/17/2023 8:31 AM, Anton Ertl wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
> [IBM 801]
>> (It was also a 24-bit machine, which seems strange, but probably
>> due to IBM internal politics).
>
> According to
> <http://www.bitsavers.org/pdf/ibm/system801/The_801_Minicomputer_an_Overview_Sep76.pdf>,
> Page 9:
>
> |[...] why we did not go to 32 bit registers. Primarily the reason is
> |that a technucal case is hard to make for the additional cost.
> [...]
> |The CPU cost will grow from 7,600 gates to about 10,000 gates.
>

Also, remember that the original 801 was designed for an embedded
application, specifically a telephone switch, so there was probably no
need for 32 bit registers, etc.

https://en.wikipedia.org/wiki/IBM_801#Original_concept

BGB

unread,

Aug 18, 2023, 2:10:37 AM8/18/23

to

Yeah.

Realistically, a few major things make sense as 16-bit ops:
MOV Reg, Reg
ADD Reg, Reg
MOV Imm, Reg
ADD Imm, Reg
A selection of basic Load/Store ops;
A few short branch encodings;
...

It makes sense to give the instructions which appear in the highest
densities the shorter encodings, and one can gloss over everything else.

Also preferably without the encoding scheme being a dog-chewed mess.
Granted, my own ISA is not entirely free of dog-chew, but both it and
RISC-V sort of have this in common.

Mine has some encoding wonk from its origins as an ISA originally with
16-bit instructions (which, ironically, has been gradually migrating
away from its 16-bit origins).

Having recently seen some of Mitch's encoding, I can admit that it is at
least "not dog chewed".

Though, it does seem to lean a little further in the direction of
immediate bits at the expense of opcode bits.

But, OTOH, there are tradeoffs here.

And, admittedly, on the other side, not as many people are as likely to
agree to my sentiment that 9-bits for more immediate and displacement
fields is "mostly sufficient".

Well, and my instruction listing has also gotten bigger than I would
prefer, ...

Where, as can be noted, if expressed in bits (this for the XG2 variant):
NMOp ZpZZ nnnn mmmm ZZZZ Znmo oooo ZZZZ //3R
NMYp ZpZZ nnnn mmmm ZZZZ ZnmZ ZZZZ ZZZZ //2R
NMIp ZpZZ nnnn mmmm ZZZZ Znmi iiii iiii //3RI (Imm9 / 10s)
NMIp ZpZZ nnnn ZZZZ ZZZZ Znii iiii iiii //2RI (Imm10 / 11s)
NYYp 1p00 ZZZn nnnn iiii iiii iiii iiii //2RI (Imm16)
YYYp 1p1Z iiii iiii iiii iiii iiii iiii //Imm24/Jumbo/PrWEX

Where, Z is the bits effectively used as part of the opcode.
n/m/o: Register, i=immediate, p=predicate.
M/N/O: Register (high inverted bit)
Y: Reserved for Opcode (future, must be 1 for now).

Or, for Baseline:
111p ZpZZ nnnn mmmm ZZZZ Znmo oooo ZZZZ //3R
111p ZpZZ nnnn mmmm ZZZZ ZnmZ ZZZZ ZZZZ //2R
111p ZpZZ nnnn mmmm ZZZZ Znmi iiii iiii //3RI (Imm9)
111p ZpZZ nnnn ZZZZ ZZZZ Znii iiii iiii //2RI (Imm10)
111p 1p00 ZZZn nnnn iiii iiii iiii iiii //2RI (Imm16)

Where, as noted, the baseline encoding has 5-bit register fields.

There are limits though to what is possible within a 32 bits layout.

And, I had made what tradeoffs I had made...

>
>> So, 16K or 32K appears to be a local optimum here.
> Advanced prediction definitely lowers the pressure on i-cache even further.
>

Yeah.

Predication can help to reduce the overall "branchiness" of the code:
Average trace-length gets longer;
The number of branch ops goes down;
One can save a lot of cycles with short if-expressions;
...

Some tasks that are painfully slow on more conventional processors can
see a nice speed boost:
Range-clamping expressions;
The PNG Paeth filter;
Things like range coders;
...

Granted, a compiler can't always know which is better, since knowledge
about whether or not a given branch is predictable is not known at
compile time.

...

Otherwise, I have been distracted recently.
There was my "new C compiler" sub-effort, and then going and
implementing a rasterizer module.

Also recently, working in the shop (my "day job"), where I have recently
gotten (and am setting up) a newer/fancier CNC milling machine.

A few pictures here:
https://twitter.com/cr88192/status/1692344456337375421

Admittedly, wasn't a particularly cheap machine though.
When fully assembled, it will be one of the ones with an enclosed
cabinet and a flood-cooling system (and will have an automatic tool
changer, but this part is still in the boxes). At present, still
unboxing and assembling stuff.

Lots of random stuff going on at the moment...

Anton Ertl

unread,

Aug 18, 2023, 4:17:10 AM8/18/23

to

John Levine <jo...@taugh.com> writes:
>According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:
>>Thomas Koenig <tko...@netcologne.de> writes:
>>[IBM 801]
>>>(It was also a 24-bit machine, which seems strange, but probably
>>>due to IBM internal politics).
>>
>>According to
>><http://www.bitsavers.org/pdf/ibm/system801/The_801_Minicomputer_an_Overview_Sep76.pdf>,
>>Page 9:
>>
>>|[...] why we did not go to 32 bit registers. Primarily the reason is
>>|that a technucal case is hard to make for the additional cost.
>>[...]
>>|The CPU cost will grow from 7,600 gates to about 10,000 gates.
>
>The 801 project, probably more than any CPU design before or after,
>didn't do anything in hardware if they could do it in the compiler,
>and their PL.8 compiler was very good. That's why they didn't have
>register windows or even load/store multiple. It was the first
>compiler to do graph coloring and managed loads and stores well enough
>that fancy register instructions weren't needed. Its descendants
>compromised with the reality that not all compilers were as good as
>PL.8, so the ROMP was 32 bits and had LM/STM.

I don't think that the 24-bit architecture has anything to do with the
quality of the compiler. The 801 was a research project, and building
it as 24-bit architecture was good enough as a proof-of-concept (and
for a number of applications).

The 32-bitness of the ROMP was probably needed for the step from
research project to product. The load/store-multiple may also have to
do with that: Another poster here explained how load/store-multiple on
ARM increased the block-copy performance by close to a factor of 4,
and this may be relevant for a product like ROMP that does not have an
I-cache (unlike the MIPS R2000 typically had).

Interestingly, Power kept load/store-multiple despite having an
instruction cache already in its first incarnation. Maybe the large
number of callee-saved registers in the Power(PC) calling convention
has to do with the availability of load/store-multiple.

It's interesting that Power and MIPS, both with lots of compiler
expertise, ended up on opposite extremes of the number of callee-saved
registers in their calling conventions.

>The Berkeley RISC people were using the old PCC compiler which did
>Sethi-Ullman numbering to get expressions to fit into the registers
>available and not much else, and let you use register declarations
>to tell it to put variables in the registers. No wonder they invented
>register windows.

They certainly must have used registers also for local variables in
Pascal (and auto variables in C). If they had kept auto variables in
memory like the original PCC, there would be little point to register
windows.

Scott Lurndal

unread,

Aug 18, 2023, 12:05:20 PM8/18/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, August 14, 2023 at 5:45:10=E2=80=AFAM UTC-5, pec...@gmail.com wr=
>ote:

>> Reserved part of 16-bit space alone could double available 32 bit opcode =
>space.
><
>RISC-V allocates 3/4 of the OpCode encoding to 16-bit stuff and gains all t=
>he complexity of variable length instructions but gains little of the benef=
>its.

ARM has the Thumb32 instruction set, which I just finished a simulator for,
which reserves three of the 16-bit encodings to indicate 32-bit instructions.

It also includes the rather unusual T16 IT instruction (If-Then) which, as a form
of predication, can cover up to four subsequent T16 instructions.

It's worth noting that the IT instruction was deprecated in the thumb
support for AArch32 in ARMv8+.

Scott Lurndal

unread,

Aug 18, 2023, 12:31:09 PM8/18/23

to

I had to fix a bug in the Moto 88100 version of PCC related to the
temporary register allocation back in 1990 - I don't recall if it
was a defect in the implementation of the sethi ullman algorithms,
or if it just plain weren't there in the 88100 version. My recollection
is that it was the later case.

PCC in this case was fed the output from cfront which was prone to
generating long statements with multiple comma operators resulting
in large expression trees and failure allocating temp registers.

BGB

unread,

Aug 18, 2023, 1:25:33 PM8/18/23

to

Yeah, you aren't getting much over a 3 or 4 bit displacement.

In my case, I had:
MOV.x (Rm), Rn //Load, no offset
MOV.x Rn, (Rm) //Store, no offset
MOV.x (Rm, R0), Rn //Load, R0 is offset
MOV.x Rn, (Rm, R0) //Store, R0 is offset

MOV.{L/Q} (SP, Disp4), Rn //SP+Disp4*(4|8)
MOV.{L/Q} Rn, (SP, Disp4) //SP+Disp4*(4|8)

MOV.L (Rm, Disp3), Rn //Rm+Disp3*4
MOV.L Rn, (Rm, Disp3) //Rm+Disp3*4

In my case, the ISA uses an "SR.T" flag bit as a general-purpose
true/false flag, which at least has less issues than condition codes.
Still less ideal for purists.

A few ops were defined for the range of R0..R31:
MOV Rm,Rn
MOV Imm4,Rn
MOV.(L/Q) {SP,Disp4},Rn
...

Nearly everything else in the 16-bit part of the ISA is limited to
R0..R15. This part of the ISA does not have any access to R32..R63.

Unlike some other "compressed" ISA's, I didn't even really bother with
3-register ALU ops.

Partly this was based on noting that limiting them to 8 registers, even
if selecting the most commonly used registers, would still leave them
"borderline useless". Well, at least short of the compiler aggressively
trying to keep variables in these particular registers (likely
increasing spill rate as a result).

> I think these considerations reduce (but probably don't eliminate) the
> percentage of time the 16 bit instructions would be useful.
>

Early on, it was closer to 60% 16-bit, 40% 32-bit...

But, as things "evolved" it has drifted closer to 15% 16-bit, 85% 32-bit
(speed optimized code), and 35% 16-bit, 65% 32-bit, for size-optimized code.

Reasons are "various".

Partly it is that, generating "denser" / "more efficient" code has
(ironically enough) reduced many of the situations where the existing
16-bit ops were useful.

Note, for 16-bit percentage vs purely 32-bit:
60%: 70% original size.
35%: 82% original size.
15%: 93% original size.

Any performance advantage is mostly negligible with a 32K L1 I$.

Main area it does have an effect is mostly related to how much code I
can fit into a 32K Boot ROM.

Generated ".text" size is generally smaller than x86-64, A64, and SH-4.

Though does not quite match i386 or Thumb though (there tending to be a
fairly large code-size delta between i386 and x86-64, *).
But, appears "competitive".

If compiling equivalent code, I seem to be getting smaller code than for
RISC-V.

*: If "hand-compiling" code, x86-64 would be more size competitive, but
it seems like modern compilers (particularly MSVC) tend towards "bulky"
code generation (beyond what could be attributed to the presence of
absence of a REX prefix on each instruction).

Though, one thing that can throw off measurements for "naive comparison"
is that in my case, I also tend to effectively link the whole OS kernel
to the binaries in many cases (which adds a fair bit of code-size
overhead; but in the emulator, allows running them without also needing
to use the shell).

Well, also naive file-size comparisons also can't really be used, given
BGBCC tends to produce LZ4 compressed binaries by default.

But, yeah, otherwise I could be like "Hey, check it out, my binaries are
smaller than i386 binaries" (even when comparing a binary with a
static-linked kernel with a shared-object libc...).

...

Though, in terms of practical tradeoffs, who "wins" in terms of
code-density likely doesn't matter that much, and as long as one isn't
"massively worse", it is probably OK.

So, say, ".text" size for Doom:
BJX2: ~ 500K (with the kernel linked in, ...)
Probably fine...
MSVC (x64): ~ 1400K
What exactly is going on here?...
GCC (x86-64) + SDL (Linux, shared objects): ~ 800K
Original DOS version: Also ~ 500K-ish
(EXE: ~700K, ~200K initialized data + strings)
Thumb2 (with dynamic C library):
~ 360K.
RISC-V (RV64IMA, basic C library only, static-linked):
~ 700K
...

Meanwhile:
Doom in 7..9MB: "Oh dear, what is this crap?!"
If it looks anything like this, there is a problem...

Not everything is equivalent between programs.

Have noted that despite similar ".text" sizes, both i386 and Thumb code
does seem to LZ compress better than BJX2 code, so something is going on
here as well. Both i386 and Thumb seem to be readily compressible.

In general though, LZ4 seems to be reasonably well suited to binaries
(my RP2 format does better for general data compression, but compiler
output across several ISA's seems to more favor LZ4 for whatever reason).

...

MitchAlsup

unread,

Aug 18, 2023, 1:52:49 PM8/18/23

to

This is a consequence of me having done a moderately dog-chewed ISA
in 1983, worked on SPARC for 9 years, then over in x86-64 for 7 years
then having done a GPU ISA, and then retired from working for corporations.
<
What you see is an attempt to combine the best features of RISC with the
best features of CISC (and there are some--much to the chagrin of the
puritans) into a cohesive and mostly orthogonal ISA.

>
> Though, it does seem to lean a little further in the direction of
> immediate bits at the expense of opcode bits.
<

Because it was here that pure RISC ISAs waste so many instructions on
pasting bits together only to sue them once as operands. So by inventing
universal constants all of these bit pasting instructions vanish from the
instruction stream.

>
>
> But, OTOH, there are tradeoffs here.
>
>
>
> And, admittedly, on the other side, not as many people are as likely to
> agree to my sentiment that 9-bits for more immediate and displacement
> fields is "mostly sufficient".
<

I agree it is "mostly sufficient", but wouldn't you rather have "almost entirely
sufficient" instead of "mostly sufficient" ?? i.e., 16-bits

>
> Well, and my instruction listing has also gotten bigger than I would
> prefer, ...
>
> Where, as can be noted, if expressed in bits (this for the XG2 variant):
> NMOp ZpZZ nnnn mmmm ZZZZ Znmo oooo ZZZZ //3R
> NMYp ZpZZ nnnn mmmm ZZZZ ZnmZ ZZZZ ZZZZ //2R
> NMIp ZpZZ nnnn mmmm ZZZZ Znmi iiii iiii //3RI (Imm9 / 10s)
> NMIp ZpZZ nnnn ZZZZ ZZZZ Znii iiii iiii //2RI (Imm10 / 11s)
> NYYp 1p00 ZZZn nnnn iiii iiii iiii iiii //2RI (Imm16)
> YYYp 1p1Z iiii iiii iiii iiii iiii iiii //Imm24/Jumbo/PrWEX
>
> Where, Z is the bits effectively used as part of the opcode.
> n/m/o: Register, i=immediate, p=predicate.
> M/N/O: Register (high inverted bit)
> Y: Reserved for Opcode (future, must be 1 for now).
>
> Or, for Baseline:
> 111p ZpZZ nnnn mmmm ZZZZ Znmo oooo ZZZZ //3R
> 111p ZpZZ nnnn mmmm ZZZZ ZnmZ ZZZZ ZZZZ //2R
> 111p ZpZZ nnnn mmmm ZZZZ Znmi iiii iiii //3RI (Imm9)
> 111p ZpZZ nnnn ZZZZ ZZZZ Znii iiii iiii //2RI (Imm10)
> 111p 1p00 ZZZn nnnn iiii iiii iiii iiii //2RI (Imm16)
>
> Where, as noted, the baseline encoding has 5-bit register fields.
>
>
> There are limits though to what is possible within a 32 bits layout.
<

I am on record that the ideal instruction size is 34-36-bits.

>
> And, I had made what tradeoffs I had made...
> >
> >> So, 16K or 32K appears to be a local optimum here.
> > Advanced prediction definitely lowers the pressure on i-cache even further.
> >
> Yeah.
>
>
> Predication can help to reduce the overall "branchiness" of the code:
> Average trace-length gets longer;
> The number of branch ops goes down;
> One can save a lot of cycles with short if-expressions;
> ...
>
> Some tasks that are painfully slow on more conventional processors can
> see a nice speed boost:
> Range-clamping expressions;
> The PNG Paeth filter;
> Things like range coders;
> ...
>
> Granted, a compiler can't always know which is better, since knowledge
> about whether or not a given branch is predictable is not known at
> compile time.
>

It often changes from predictable and back based on the data being processed
by the application.

John Levine

unread,

Aug 18, 2023, 5:14:04 PM8/18/23

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

>>>According to
>>><http://www.bitsavers.org/pdf/ibm/system801/The_801_Minicomputer_an_Overview_Sep76.pdf>,
>>>Page 9:
>>>
>>>|[...] why we did not go to 32 bit registers. Primarily the reason is
>>>|that a technucal case is hard to make for the additional cost.
>>>[...]
>>>|The CPU cost will grow from 7,600 gates to about 10,000 gates.

>I don't think that the 24-bit architecture has anything to do with the
>quality of the compiler. The 801 was a research project, and building
>it as 24-bit architecture was good enough as a proof-of-concept (and
>for a number of applications).

We don't have to guess. They explain the rationale on page 9 of that
paper. In an era when addresses were 24 bits or shorter, there wasn't
anything that was signficantly easier to do in 32 bits than 24, given
the excellent optimization their compiler did.

>The 32-bitness of the ROMP was probably needed for the step from
>research project to product. The load/store-multiple may also have to
>do with that: Another poster here explained how load/store-multiple on
>ARM increased the block-copy performance by close to a factor of 4,

Having written a fair amount of the kernel design and toolchain for
ROMP AIX, I hope I have some insights here.

I think the main reason is that by the late 1970s it was apparent that
24 bits of address wasn't enough, and registers definitely had to be
big enough to contain an address. The 801 depended on the compiler to
enforce code safety and that wasn't going to work when the code wasn't
all written in PL.8. (There were some efforts to put a C front end on
the compiler but they gave up when I explained how thoroughly confused
C's pointers and arrays are.) They added an MMU which I think was
the first reverse mapped one, with TLB misses trapping and handled
by software.

Flushing the entire MMU on a context switch would have been horrible
so they came up with a hack which I think was carried over into POWER.
Virtual addresses were 40 bits, with the high 12 bits considered the
segment number and the low 28 the address. They had a 16 entry fast
RAM of 12 bit segment numbers, so the high 4 bits of each virtual
address were mapped to a 12 bit segment number, and the 40 bit address
looked up in the MMU. Each process saw 16 segments of up to 256M, and
context switches just had to reload the 16 entry RAM. This also made
it easy to share segments. My recollection is that in AIX, segment 0
was the VM kernel, segment 1 was the Unix kernel, segments 2 and 3
were the program's code and data segments and I think we had another
segment for a large static shared C library.

JimBrakefield

unread,

Aug 18, 2023, 8:50:16 PM8/18/23

to

Ugh, possibilities:

|>I am on record that the ideal instruction size is 34-36-bits.

Seven 9-bit "bytes" will fit into 64-bits. So one could do one 27-bit instruction and one 36-bit instruction in 64-bits.
Given some flexibility in the their placement, and three configurations one can have: 32-32, 27-36 and 36-27 less the configuration bit(s).
So there is a way, if one will go into uncharted territory?
And, what is the percentage of 32 or 36 bit compiler generated instructions that will easily fit into 27-bits??

MitchAlsup

unread,

Aug 18, 2023, 9:10:38 PM8/18/23

to

Insightful--thanks

<
> So there is a way, if one will go into uncharted territory?
<

At this point:: Might as well.

<
> And, what is the percentage of 32 or 36 bit compiler generated instructions that will easily fit into 27-bits??
<

My guess (1st order) is "enough" will compared to the times one needs 36-bits for a big instruction.
{This comes with the implication that 36-bit instructions are less than 20% of instruction stream}
<
But how do you take a trap and get back between the 27-bit and the 36-bit instruction ??
Or between the 36-bit instruction and the 27-bit instruction ??
{{And a few other questions along the same lines}}

JimBrakefield

unread,

Aug 18, 2023, 10:01:53 PM8/18/23

to

Three design choices off the top of my head:

|> But how do you take a trap and get back between the 27-bit and the 36-bit instruction ??
|> Or between the 36-bit instruction and the 27-bit instruction ??
|> {{And a few other questions along the same lines}}

The simplest is a configuration/status bit that switches between 32-32 and 27-36
The 27-36 has a 64th bit that can select between 27-36 and 36-27
And the 36-bit instructions could contain a bit to switch back to 32-32?

Otherwise the 32-bit instructions need to be modified to provide one or two bits for configuration of the current 64-bits?
(one bit taken from each 32-bit instruction and one bit from the 36-bit instruction;
giving four instruction format state/state-changes of your choosing)

Or more generally with nine-bit instruction parts, any combination of 18, 27 and 36 bit instructions decoded serially.
The 64th bit could be used to help with the decode, for instance, whether the instructions extend into the next 64-bits.
{{Instruction addresses and displacements would have one unused value in the three LSBs}}

Scott Lurndal

unread,

Aug 19, 2023, 10:30:32 AM8/19/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, August 18, 2023 at 7:50:16=E2=80=AFPM UTC-5, JimBrakefield wrote=

>> And, what is the percentage of 32 or 36 bit compiler generated instructio=

>ns that will easily fit into 27-bits??
><

>My guess (1st order) is "enough" will compared to the times one needs 36-bi=

>ts for a big instruction.

>{This comes with the implication that 36-bit instructions are less than 20%=
> of instruction stream}
><
>But how do you take a trap and get back between the 27-bit and the 36-bit i=

>nstruction ??
>Or between the 36-bit instruction and the 27-bit instruction ??

Add a bit to the PC to record which part is next? Use something
like the PDP-8 link register? Record it in the processor status
register (e.g. like ARM Thumb IT instruction state)?

BGB

unread,

Aug 19, 2023, 12:10:41 PM8/19/23

to

Hmm, what about an ISA where instructions are mostly a prime number of
bytes:
2, 3, 5, 7, 11.

xxxx-xxxx xxxx-xxx0
xxxx-xxxx xxxx-xxxx xxxx-xx01
xxxx-xxxx xxxx-xxxx xxxx-xxxx xxxx-xxxx xxxx-x011
...

Then, say:
16 bit ops have 2 4-bit register fields.
24 bit ops have 3 5-bit register fields.
40 bit ops have 3 6-bit register fields.

zzzz-ssss nnnn-zzz0
tttt-tsss sszn-nnnn zzzz-zz01
zzzz-zztt tttt-zsss ssnn-nnnn zzzz-zzzz ppzz-z011

The 16-bit ops would mostly hold a collection of 2R ops.

The 24-bit ops hold a selection of Ld/St and 3R ALU ops.
iiii-isss ss0n-nnnn zzz0-0001 //LD (Rs, Disp5)
iiii-isss ss1n-nnnn zzz0-0001 //ST (Rs, Disp5)
tttt-tsss ss0n-nnnn zzz1-0001 //LD (Rs, Rt)
tttt-tsss ss1n-nnnn zzz1-0001 //ST (Rs, Rt)
iiii-isss ss0n-nnnn zzz0-1001 //ALU Rs, Imm5u, Rn
tttt-tsss ss1n-nnnn zzz0-1001 //ALU Rs, Rt, Rn
tttt-tsss ss0n-nnnn zzz1-1001 //Misc (3R)
zzzz-zsss ss1n-nnnn zzz1-1001 //Misc (2R ops)
...
iiii-isss sszn-nnnn zzzz-0101 //LD (Rs, Disp5)
iiii-iiii iiii-iiii zz11-1101 //Branch (Disp16s)

The 16 and 24 bit ops could be defined as (hopefully straightforward)
unpacking rules into the 40 bit format (they can be considered as
"compressed", but in the sense that one needs to define bit-for-bit
mapping rules to the larger formats).

The 56 and 88 bit formats would mostly add immediate bits or similar
onto the 40 bit format.

In this case, branch displacements would be in terms of bytes.

...

Probably not terribly sensible as an ISA design, but could be kinda
amusing I think.

Also funny if one could do a superscalar implementation of such an ISA...

MitchAlsup

unread,

Aug 19, 2023, 12:31:27 PM8/19/23

to

I see not giving full access to the whole RF as a poor choice,
Feel free to disagree with me. {There are too many register
allocation problems without having artificial boundaries in
use of registers. You might have set up a situation where you
have to register allocate from one virtual RF space to another
virtual RF space before allocating into the physical RF space.}

>
> zzzz-ssss nnnn-zzz0
> tttt-tsss sszn-nnnn zzzz-zz01
> zzzz-zztt tttt-zsss ssnn-nnnn zzzz-zzzz ppzz-z011
>
>
> The 16-bit ops would mostly hold a collection of 2R ops.
>
> The 24-bit ops hold a selection of Ld/St and 3R ALU ops.
> iiii-isss ss0n-nnnn zzz0-0001 //LD (Rs, Disp5)
> iiii-isss ss1n-nnnn zzz0-0001 //ST (Rs, Disp5)
> tttt-tsss ss0n-nnnn zzz1-0001 //LD (Rs, Rt)
> tttt-tsss ss1n-nnnn zzz1-0001 //ST (Rs, Rt)
<

I think you have sacrificed too much entropy to this particular encoding.
Consider a 32-bit RISC LD/ST instruction can have a 16-bit displacement
So a 24-bit one should be able to have an 8-bit displacement.

BGB

unread,

Aug 19, 2023, 12:40:27 PM8/19/23

to

Fair enough.

>>
>> Though, it does seem to lean a little further in the direction of
>> immediate bits at the expense of opcode bits.
> <
> Because it was here that pure RISC ISAs waste so many instructions on
> pasting bits together only to sue them once as operands. So by inventing
> universal constants all of these bit pasting instructions vanish from the
> instruction stream.

Yeah, this is why I ended up adding jumbo prefixes...

Even within a pure RISC, there are better/worse:
OK: LDSH/SHORI
Worse: LUI+ADD or similar;
BAD: PC-relative Load

Main advantage of LDSH/SHORI being that it expands easily to 64-bit
constants, whereas LUI doesn't.

Ironically, despite being a microcontroller RISC, the IMM
prefix-instruction in MicroBlaze is also functionally similar to a jumbo
prefix.

>>
>>
>> But, OTOH, there are tradeoffs here.
>>
>>
>>
>> And, admittedly, on the other side, not as many people are as likely to
>> agree to my sentiment that 9-bits for more immediate and displacement
>> fields is "mostly sufficient".
> <
> I agree it is "mostly sufficient", but wouldn't you rather have "almost entirely
> sufficient" instead of "mostly sufficient" ?? i.e., 16-bits

It is mostly a difference of a few percent if going by my stats.
9 bits still "wipes the floor" with 5 or 6 bit displacement fields.
12 (scaled) does a little better, but enough to justify 33% more bits.

The practical difference between 96.9% and 99.5% is "not that huge",
whereas the difference from 60% (scaled) or 20% (unscaled) for a 5u or
6s displacement is, quite a bit more significant.

Though, the 9-bit cases effectively expand to 10-bit signed in XG2,
partly because, while 9-bit unsigned won out over 9-bit signed, 10-bit
signed wins out over 10-bit unsigned (but, it was pretty close here).

Ironically, both 9-bit unsigned and 10-bit signed, with a displacement
scale, manage to slightly beat out the 12-bit signed/unscaled
displacement style used by RISC-V.

Say, Disp12s can reach +/-2K. Whereas, scaled Disp9u (for QWORD) can
reach 4K.

Granted, the RISC-V strategy (unscaled displacements) would be more of a
win if the general case use of packed structs or similar was "actually a
thing".

For ALU immediate values, 9 bits still gets ~ 95%, 12 bits would get ~
97%. Both beat out Imm5 at roughly 54%, ...

Then, with fallback cases:
Load a Imm25s into R0, use R0 instead of an immediate (*1);
Or:
Use a jumbo prefix, now it is Imm33s.

If 5% of the time, one needs to use a jumbo prefix or similar, this
isn't all that terrible.

*1: Many of the immediate or displacement values that blew out
Imm9/Disp9 also often blow out Imm16s, having originally needed ~24u/25s
bits as the "covers most cases" fallback case. This was designed before
I later added jumbo prefixes.

There is a little layout wonkiness due to the deprecated/dropped BT/BF
encodings as well, but this was because I had designed branches before I
had added predication.

In retrospect, it might have made more sense to have put the branches
into the remaining space in the F8 block. Say:
F8Ei-iiii BRA Disp20s
F8Fi-iiii BSR Disp20s
E8Ei-iiii BT Disp20s
ECEi-iiii BF Disp20s

But, such a change would break my existing code (and would require new
relocs/... as well).

As noted, BGBCC (and the ABI) ended up treating R0 and R1 as special
registers that may be stomped without warning. However, since then, R1
ended up being reclaimed more as a scratch/auxiliary link register or
scratch branch-address register.

If writing ASM code, one needs to tread carefully if using these
registers (partly due to possibly wonky cases in the assembler, and
partly as they are sometimes treated as special case encodings in the
instruction decoder for certain ops).

>>
>> Well, and my instruction listing has also gotten bigger than I would
>> prefer, ...
>>
>> Where, as can be noted, if expressed in bits (this for the XG2 variant):
>> NMOp ZpZZ nnnn mmmm ZZZZ Znmo oooo ZZZZ //3R
>> NMYp ZpZZ nnnn mmmm ZZZZ ZnmZ ZZZZ ZZZZ //2R
>> NMIp ZpZZ nnnn mmmm ZZZZ Znmi iiii iiii //3RI (Imm9 / 10s)
>> NMIp ZpZZ nnnn ZZZZ ZZZZ Znii iiii iiii //2RI (Imm10 / 11s)
>> NYYp 1p00 ZZZn nnnn iiii iiii iiii iiii //2RI (Imm16)
>> YYYp 1p1Z iiii iiii iiii iiii iiii iiii //Imm24/Jumbo/PrWEX
>>
>> Where, Z is the bits effectively used as part of the opcode.
>> n/m/o: Register, i=immediate, p=predicate.
>> M/N/O: Register (high inverted bit)
>> Y: Reserved for Opcode (future, must be 1 for now).
>>
>> Or, for Baseline:
>> 111p ZpZZ nnnn mmmm ZZZZ Znmo oooo ZZZZ //3R
>> 111p ZpZZ nnnn mmmm ZZZZ ZnmZ ZZZZ ZZZZ //2R
>> 111p ZpZZ nnnn mmmm ZZZZ Znmi iiii iiii //3RI (Imm9)
>> 111p ZpZZ nnnn ZZZZ ZZZZ Znii iiii iiii //2RI (Imm10)
>> 111p 1p00 ZZZn nnnn iiii iiii iiii iiii //2RI (Imm16)
>>
>> Where, as noted, the baseline encoding has 5-bit register fields.
>>
>>
>> There are limits though to what is possible within a 32 bits layout.
> <
> I am on record that the ideal instruction size is 34-36-bits.

Yes, but memory being built around 8-bit bytes kinda precludes this.

Fixed-length 40 or 48 bit instructions "aint gonna fly".

>>
>> And, I had made what tradeoffs I had made...
>>>
>>>> So, 16K or 32K appears to be a local optimum here.
>>> Advanced prediction definitely lowers the pressure on i-cache even further.
>>>
>> Yeah.
>>
>>
>> Predication can help to reduce the overall "branchiness" of the code:
>> Average trace-length gets longer;
>> The number of branch ops goes down;
>> One can save a lot of cycles with short if-expressions;
>> ...
>>
>> Some tasks that are painfully slow on more conventional processors can
>> see a nice speed boost:
>> Range-clamping expressions;
>> The PNG Paeth filter;
>> Things like range coders;
>> ...
>>
>> Granted, a compiler can't always know which is better, since knowledge
>> about whether or not a given branch is predictable is not known at
>> compile time.
>>
> It often changes from predictable and back based on the data being processed
> by the application.
>

Yeah, either way, the compiler isn't going to know.

BGB

unread,

Aug 19, 2023, 1:21:10 PM8/19/23

to

This is less of a problem if every shorter encoding has a corresponding
encoding in a wider format (and the effects of instruction size are not
explicit at the ASM level).

In this case, the wonky register sizes become merely an size
optimization issue, where one can have the compiler prioritize the
registers that can use shorter formats over the ones that need longer
formats.

Though, in this case, for this combination, it would likely make sense
to keep a similar register layout to BJX2, which ironically mostly has
the needed layout out of a side-effect of having "grown out" of an
earlier 16-register layout (the 64 GPR layout being effectively the 16
GPR layout repeated 4 times...).

...

Still better than having an encoding where a bunch of combinations are
non-encodable in the ISA (and the code-generator needs to have a bunch
of wonky edge cases to work around being unable to encode the offending
cases...).

But, yeah, I had been down the above road both with R16..R31 in BJX1,
and with R32..R63 in BJX2's Baseline encoding (eg: the fun of trying to
work with 64 GPRs in an ISA encoding designed around 5-bit register fields).

At least the XG2 encoding sort of "fixes" the above issue, albeit at the
potential cost of code density due to the loss of 16-bit encodings.

Then again, given Doom with the XG2 encoding is still smaller than
either RV64IMA or x86-64 builds, I don't think it is doing too horribly
(even if XG2 is roughly 11% worse than the Baseline encoding in terms of
code-density).

I guess the main tradeoff here being whether one wants to build programs
for 32 or 64 GPRs (I may consider splitting my A and G/H profiles along
these lines, possibly with the G/H profiles assuming XG2 encoding as the
default, but A assuming Baseline; partly as using XG2 in a 32 GPR
configuration gains nothing; but using Baseline in a 64 GPR
configuration sucks due to non-orthogonality issues...).

>>
>> zzzz-ssss nnnn-zzz0
>> tttt-tsss sszn-nnnn zzzz-zz01
>> zzzz-zztt tttt-zsss ssnn-nnnn zzzz-zzzz ppzz-z011
>>
>>
>> The 16-bit ops would mostly hold a collection of 2R ops.
>>
>> The 24-bit ops hold a selection of Ld/St and 3R ALU ops.
>> iiii-isss ss0n-nnnn zzz0-0001 //LD (Rs, Disp5)
>> iiii-isss ss1n-nnnn zzz0-0001 //ST (Rs, Disp5)
>> tttt-tsss ss0n-nnnn zzz1-0001 //LD (Rs, Rt)
>> tttt-tsss ss1n-nnnn zzz1-0001 //ST (Rs, Rt)
> <
> I think you have sacrificed too much entropy to this particular encoding.
> Consider a 32-bit RISC LD/ST instruction can have a 16-bit displacement
> So a 24-bit one should be able to have an 8-bit displacement.
> <

Then for this encoding block, you would have *nothing* apart from LD/ST
ops...

One could note that Disp5u still typically hits roughly 50% of the time
in my stats. This is probably enough for the encoding to still be "useful".

Granted, half the time, one would still need to use the 40-bit format...

BGB

unread,

Aug 19, 2023, 1:46:19 PM8/19/23

to

On 8/18/2023 11:05 AM, Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Monday, August 14, 2023 at 5:45:10=E2=80=AFAM UTC-5, pec...@gmail.com wr=
>> ote:
>
>>> Reserved part of 16-bit space alone could double available 32 bit opcode =
>> space.
>> <
>> RISC-V allocates 3/4 of the OpCode encoding to 16-bit stuff and gains all t=
>> he complexity of variable length instructions but gains little of the benef=
>> its.
>
> ARM has the Thumb32 instruction set, which I just finished a simulator for,
> which reserves three of the 16-bit encodings to indicate 32-bit instructions.
>

Having developed along a vaguely similar trajectory, I had ended up with
a similar scheme (to Thumb2) in my case.

> It also includes the rather unusual T16 IT instruction (If-Then) which, as a form
> of predication, can cover up to four subsequent T16 instructions.
>
> It's worth noting that the IT instruction was deprecated in the thumb
> support for AArch32 in ARMv8+.

I would guess that this mechanism would have required a way to preserve
and restore this state during interrupts, which could be "rather
annoying" to deal with.

Probably combined with limited use by compilers compared with normal
branches.

Conventional wisdom is usually that "branch predictor makes branches not
slow" so "one does not need predication".

Except now the CPU performance may "eat it" when trying to deal with a
PNG Paeth filter or bitwise range coder or similar (which effectively
feed raw entropy from the data stream into the branch hit/miss
handling). Likewise for things like alpha-testing pixels in a software
rasterizer, etc.

But, a lot of people (including compiler writers) seem inclined to
ignore these cases.

But, then CPU designers are like "well, we will interpret a short
forward branch as predicating the next N instructions rather than doing
a branch", ...

...

MitchAlsup

unread,

Aug 19, 2023, 3:12:40 PM8/19/23

to

Only when "done wrong".
<
LDD R7,[IP,0x1234]
<
Is one <word> instruction using R0 as a proxy for IP when used as a base register.

>
> Main advantage of LDSH/SHORI being that it expands easily to 64-bit
> constants, whereas LUI doesn't.
>

LDSH = Load Signed Half Word ??
SHORI = Store Half OR Immediate ??

>
> Ironically, despite being a microcontroller RISC, the IMM
> prefix-instruction in MicroBlaze is also functionally similar to a jumbo
> prefix.
<

STD 3.141592653589278643,[R3,R7<<3,DISP64]
<
Is 1 instruction, issues in 1 cycle, wastes no temporary registers,.......
That is, you can store an arbitrary constant anywhere in memory
using any addressing mode at any time with a single instruction.

> >>
> >>
> >> But, OTOH, there are tradeoffs here.
> >>
> >>
> >>
> >> And, admittedly, on the other side, not as many people are as likely to
> >> agree to my sentiment that 9-bits for more immediate and displacement
> >> fields is "mostly sufficient".
> > <
> > I agree it is "mostly sufficient", but wouldn't you rather have "almost entirely
> > sufficient" instead of "mostly sufficient" ?? i.e., 16-bits
<
> It is mostly a difference of a few percent if going by my stats.
> 9 bits still "wipes the floor" with 5 or 6 bit displacement fields.
> 12 (scaled) does a little better, but enough to justify 33% more bits.
<

I don't think you could point to a place where I sacrificed anything to enable
almost all integer and memory references getting 16-bit immediates.
Whereas; EMBench demonstrates that RISC-V's 12-bit displacements
are insufficient for most memory accesses. {Almost as if EMBench
had been designed to illustrate that disparity.}

>
> The practical difference between 96.9% and 99.5% is "not that huge",
> whereas the difference from 60% (scaled) or 20% (unscaled) for a 5u or
> 6s displacement is, quite a bit more significant.
>

You are still operating under the assumption that I had to sacrifice
anything.

>
> Though, the 9-bit cases effectively expand to 10-bit signed in XG2,
> partly because, while 9-bit unsigned won out over 9-bit signed, 10-bit
> signed wins out over 10-bit unsigned (but, it was pretty close here).
>
>
> Ironically, both 9-bit unsigned and 10-bit signed, with a displacement
> scale, manage to slightly beat out the 12-bit signed/unscaled
> displacement style used by RISC-V.
>
> Say, Disp12s can reach +/-2K. Whereas, scaled Disp9u (for QWORD) can
> reach 4K.
<

This is the same argument I used in Mc 88100 arguing that displacement
arithmetic need not be signed (ala IBM 360), that the arithmetic was
congruent (could be rephrased in the same number of instructions,
and allow certain linker tricks.
<
The compiler people wouldn't even discuss it.

>
>
> Granted, the RISC-V strategy (unscaled displacements) would be more of a
> win if the general case use of packed structs or similar was "actually a
> thing".
>

What you are saying is that "If RISC-V hadn't screwed up so many things it
would have been a significantly better ISA". And no one could possibly disagree
with you.

>
> For ALU immediate values, 9 bits still gets ~ 95%, 12 bits would get ~
> 97%. Both beat out Imm5 at roughly 54%, ...
<

But now you have to route all sorts of different sizes from the instruction
to various operand busses, whereas I only have to route {16,32-64}-bits.
This takes less decode logic and less multiplexing logic in the <time
critical> forwarding "loop".
<
RISC-V then compounds this problem by adding compression.
>
><snip>

>
> As noted, BGBCC (and the ABI) ended up treating R0 and R1 as special
> registers that may be stomped without warning. However, since then, R1
> ended up being reclaimed more as a scratch/auxiliary link register or
> scratch branch-address register.
<

I have no registers that any external force can stomp on

>
> If writing ASM code, one needs to tread carefully if using these
> registers (partly due to possibly wonky cases in the assembler, and
> partly as they are sometimes treated as special case encodings in the
> instruction decoder for certain ops).
<

I don't have these issues.
<snip>

> > I am on record that the ideal instruction size is 34-36-bits.
> Yes, but memory being built around 8-bit bytes kinda precludes this.

Somewhat of a shame, actually.........
>
><snip>

> >> Granted, a compiler can't always know which is better, since knowledge
> >> about whether or not a given branch is predictable is not known at
> >> compile time.
> >>
> > It often changes from predictable and back based on the data being processed
> > by the application.
> >
> Yeah, either way, the compiler isn't going to know.
<

If it weren't for benchmarketeering, the compiler would never have had to know.

MitchAlsup

unread,

Aug 19, 2023, 3:17:31 PM8/19/23

to

On Saturday, August 19, 2023 at 12:21:10 PM UTC-5, BGB wrote:
> On 8/19/2023 11:31 AM, MitchAlsup wrote:

> >> The 16-bit ops would mostly hold a collection of 2R ops.
> >>
> >> The 24-bit ops hold a selection of Ld/St and 3R ALU ops.
> >> iiii-isss ss0n-nnnn zzz0-0001 //LD (Rs, Disp5)
> >> iiii-isss ss1n-nnnn zzz0-0001 //ST (Rs, Disp5)
> >> tttt-tsss ss0n-nnnn zzz1-0001 //LD (Rs, Rt)
> >> tttt-tsss ss1n-nnnn zzz1-0001 //ST (Rs, Rt)
> > <
> > I think you have sacrificed too much entropy to this particular encoding.
> > Consider a 32-bit RISC LD/ST instruction can have a 16-bit displacement
> > So a 24-bit one should be able to have an 8-bit displacement.
> > <
> Then for this encoding block, you would have *nothing* apart from LD/ST
> ops...
<

2 flavors
a) MEM Rd,[Rb,DISP16]
b) MEM Rd,[Rb,Ri<<s] // which have optional displacements {32,64}

>
> One could note that Disp5u still typically hits roughly 50% of the time
> in my stats. This is probably enough for the encoding to still be "useful".
<

Whereas, My encoding gives that "flavor" 16-bits which as you stated is good
to the 99% level. 99% > 50% to the point the compiler does not need the
intermediate pattern recognition cases.

MitchAlsup

unread,

Aug 19, 2023, 3:24:18 PM8/19/23

to

Extract and Insert Instructions simplify the encoding of these.

>
> But, a lot of people (including compiler writers) seem inclined to
> ignore these cases.
<

Often disguised as a series of shifts (a << const1)>>const2 because
the underlying language does not express variable length bit-fields
efficiently.

>
> But, then CPU designers are like "well, we will interpret a short
> forward branch as predicating the next N instructions rather than doing
> a branch", ...
<

AND WHY NOT ??
>
> ...

BGB

unread,

Aug 19, 2023, 4:54:44 PM8/19/23

to

LDSH = Load-via-Shift (the name I originally came up with for BJX1).
SHORI = Shift-with-OR (the name Hitachi came up with for SH5).

Both basically being the same mechanism:
Rn = (Rn<<16)|Imm16u;

In both ISA branches, Load/Store (from memory) uses a MOV.x name, such as:
MOV.W (R4), R9 //BJX2 notation
MOV.W @R4, R9 //SuperH notation

The original SH-2/4 ISA had instead used dedicated PC-relative load
instructions, IIRC:
MOV.W (PC, Disp8), R0
MOV.L (PC, Disp8), R0

But, these were a pain...

Basically, the assembler would need to find spots to silently dump a
blob of whatever constants were pending, or silently in the middle of
the instruction stream if the distance got large enough (anywhere near a
hard limit of 512 bytes).

Typically, this would also involve emitting a branch over the blob of
constants (along with a NOP since branches in SH had a delay slot), ...

>>
>> Ironically, despite being a microcontroller RISC, the IMM
>> prefix-instruction in MicroBlaze is also functionally similar to a jumbo
>> prefix.
> <
> STD 3.141592653589278643,[R3,R7<<3,DISP64]
> <
> Is 1 instruction, issues in 1 cycle, wastes no temporary registers,.......
> That is, you can store an arbitrary constant anywhere in memory
> using any addressing mode at any time with a single instruction.

Possible.

Pulling similar off in my case would likely require 3 instructions
(assuming the RiMOV extension), or 4 (otherwise).

But, this is not a common case...

>>>>
>>>>
>>>> But, OTOH, there are tradeoffs here.
>>>>
>>>>
>>>>
>>>> And, admittedly, on the other side, not as many people are as likely to
>>>> agree to my sentiment that 9-bits for more immediate and displacement
>>>> fields is "mostly sufficient".
>>> <
>>> I agree it is "mostly sufficient", but wouldn't you rather have "almost entirely
>>> sufficient" instead of "mostly sufficient" ?? i.e., 16-bits
> <
>> It is mostly a difference of a few percent if going by my stats.
>> 9 bits still "wipes the floor" with 5 or 6 bit displacement fields.
>> 12 (scaled) does a little better, but enough to justify 33% more bits.
> <
> I don't think you could point to a place where I sacrificed anything to enable
> almost all integer and memory references getting 16-bit immediates.
> Whereas; EMBench demonstrates that RISC-V's 12-bit displacements
> are insufficient for most memory accesses. {Almost as if EMBench
> had been designed to illustrate that disparity.}

I will not claim that 9 bits gets universal coverage, but in the
programs I have been running this far, it has good coverage (and just
slightly better than the RISC-V strategy on average despite having 3
fewer bits).

Granted... packed structures in my case would require displacements to
be shuffled through R0 (there is an special case for unscaled R0
displacements; like in its SuperH ancestors...).

>>
>> The practical difference between 96.9% and 99.5% is "not that huge",
>> whereas the difference from 60% (scaled) or 20% (unscaled) for a 5u or
>> 6s displacement is, quite a bit more significant.
>>
> You are still operating under the assumption that I had to sacrifice
> anything.

There is less space for opcode bits.

Something like x86 SSE or ARM NEON style SIMD would likely be an issue
for encoding space, at least in terms of 32-bit ops... Granted, I am
guessing you probably also have an "escape hatch" for more opcode space?...

>>
>> Though, the 9-bit cases effectively expand to 10-bit signed in XG2,
>> partly because, while 9-bit unsigned won out over 9-bit signed, 10-bit
>> signed wins out over 10-bit unsigned (but, it was pretty close here).
>>
>>
>> Ironically, both 9-bit unsigned and 10-bit signed, with a displacement
>> scale, manage to slightly beat out the 12-bit signed/unscaled
>> displacement style used by RISC-V.
>>
>> Say, Disp12s can reach +/-2K. Whereas, scaled Disp9u (for QWORD) can
>> reach 4K.
> <
> This is the same argument I used in Mc 88100 arguing that displacement
> arithmetic need not be signed (ala IBM 360), that the arithmetic was
> congruent (could be rephrased in the same number of instructions,
> and allow certain linker tricks.
> <
> The compiler people wouldn't even discuss it.

The arithmetic is not unsigned, but the 9-bit displacements are.

Argument is that this last bit can increase hit-rate by from around 88%
to 97%, whereas a sign bit would have gained only a fraction of a percent.

However, from 9 to 10 bits, sign won out (in both cases, it was only of
a fraction of a percent).

Reaching 100% would still require a significantly larger displacement.

>>
>>
>> Granted, the RISC-V strategy (unscaled displacements) would be more of a
>> win if the general case use of packed structs or similar was "actually a
>> thing".
>>
> What you are saying is that "If RISC-V hadn't screwed up so many things it
> would have been a significantly better ISA". And no one could possibly disagree
> with you.

Probably.

Scaled displacements, Register-indexed Load/Store, Constant-loading that
"doesn't suck", ...

Yet, it is seemingly the most popular open ISA at this point.

In most other regards, I would put my bet instead on ARMv8, except:
Not an open ISA;
ALU condition codes, bleh...

>>
>> For ALU immediate values, 9 bits still gets ~ 95%, 12 bits would get ~
>> 97%. Both beat out Imm5 at roughly 54%, ...
> <
> But now you have to route all sorts of different sizes from the instruction
> to various operand busses, whereas I only have to route {16,32-64}-bits.
> This takes less decode logic and less multiplexing logic in the <time
> critical> forwarding "loop".

After the decode stage, all the pipeline sees is a 33-bit value...

Granted, during decode, the decoder needs to deal with all of the
various possible instruction layouts.

So, say:
Lookup opcode based on the various bits;
Finds where it is routed to;
Finds the "FormID" which tells which instruction layout was used/...
Unpack instruction based on FormID rules;

Outer decoder maps the decoded instruction decoder's outputs to the
pipeline's lanes and register ports.

> <
> RISC-V then compounds this problem by adding compression.

Yeah.
It also looks simple, but in some ways is kind of a pain.

RVC is "kinda evil" in some ways. Doesn't map easily to "hey, unpack
this instruction according to this particular layout", as there are a
number of one-off deviations, ...

>>
>> <snip>
>>
>> As noted, BGBCC (and the ABI) ended up treating R0 and R1 as special
>> registers that may be stomped without warning. However, since then, R1
>> ended up being reclaimed more as a scratch/auxiliary link register or
>> scratch branch-address register.
> <
> I have no registers that any external force can stomp on
>>
>> If writing ASM code, one needs to tread carefully if using these
>> registers (partly due to possibly wonky cases in the assembler, and
>> partly as they are sometimes treated as special case encodings in the
>> instruction decoder for certain ops).
> <
> I don't have these issues.

OK.

It was a tradeoff...

> <snip>
>>> I am on record that the ideal instruction size is 34-36-bits.
>> Yes, but memory being built around 8-bit bytes kinda precludes this.
> Somewhat of a shame, actually.........
>>
>> <snip>
>>>> Granted, a compiler can't always know which is better, since knowledge
>>>> about whether or not a given branch is predictable is not known at
>>>> compile time.
>>>>
>>> It often changes from predictable and back based on the data being processed
>>> by the application.
>>>
>> Yeah, either way, the compiler isn't going to know.
> <
> If it weren't for benchmarketeering, the compiler would never have had to know.

Granted.

Scott Lurndal

unread,

Aug 19, 2023, 4:54:44 PM8/19/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Saturday, August 19, 2023 at 11:10:41=E2=80=AFAM UTC-5, BGB wrote:
>> On 8/19/2023 9:30 AM, Scott Lurndal wrote:=20
>> > MitchAlsup <Mitch...@aol.com> writes:=20
>> >> On Friday, August 18, 2023 at 7:50:16=3DE2=3D80=3DAFPM UTC-5, JimBrake=
>field wrote=3D=20
>> >=20
>> >>> And, what is the percentage of 32 or 36 bit compiler generated instru=
>ctio=3D=20
>> >> ns that will easily fit into 27-bits??=20
>> >> <=20
>> >> My guess (1st order) is "enough" will compared to the times one needs =
>36-bi=3D=20
>> >> ts for a big instruction.=20
>> >> {This comes with the implication that 36-bit instructions are less tha=
>n 20%=3D=20
>> >> of instruction stream}=20
>> >> <=20
>> >> But how do you take a trap and get back between the 27-bit and the 36-=
>bit i=3D=20
>> >> nstruction ??=20
>> >> Or between the 36-bit instruction and the 27-bit instruction ??=20
>> >=20
>> > Add a bit to the PC to record which part is next? Use something=20
>> > like the PDP-8 link register? Record it in the processor status=20

>> > register (e.g. like ARM Thumb IT instruction state)?

>> Hmm, what about an ISA where instructions are mostly a prime number of=20
>> bytes:=20
>> 2, 3, 5, 7, 11.=20
>>=20
>> xxxx-xxxx xxxx-xxx0=20
>> xxxx-xxxx xxxx-xxxx xxxx-xx01=20
>> xxxx-xxxx xxxx-xxxx xxxx-xxxx xxxx-xxxx xxxx-x011=20
>> ...=20
>>=20
>> Then, say:=20
>> 16 bit ops have 2 4-bit register fields.=20
>> 24 bit ops have 3 5-bit register fields.=20
>> 40 bit ops have 3 6-bit register fields.=20

><
>I see not giving full access to the whole RF as a poor choice,

That's one of the "features" of ARM's Thumb32. The 32-bit
instructions have access to all 16 registers, while the
16-bit instructions only access the first 8.

MitchAlsup

unread,

Aug 19, 2023, 5:54:36 PM8/19/23

to

On Saturday, August 19, 2023 at 3:54:44 PM UTC-5, BGB wrote:
> On 8/19/2023 2:12 PM, MitchAlsup wrote:

> >> Main advantage of LDSH/SHORI being that it expands easily to 64-bit
> >> constants, whereas LUI doesn't.
> >>
> > LDSH = Load Signed Half Word ??
> > SHORI = Store Half OR Immediate ??
> LDSH = Load-via-Shift (the name I originally came up with for BJX1).
> SHORI = Shift-with-OR (the name Hitachi came up with for SH5).
>
> Both basically being the same mechanism:
> Rn = (Rn<<16)|Imm16u;
<

lack of addition is problematic.

>
> In both ISA branches, Load/Store (from memory) uses a MOV.x name, such as:
> MOV.W (R4), R9 //BJX2 notation
> MOV.W @R4, R9 //SuperH notation
>

An inbound memory reference should be spelled LD
An outbound memory reference should be spelled ST

>

> >> Ironically, despite being a microcontroller RISC, the IMM
> >> prefix-instruction in MicroBlaze is also functionally similar to a jumbo
> >> prefix.
> > <
> > STD 3.141592653589278643,[R3,R7<<3,DISP64]
> > <
> > Is 1 instruction, issues in 1 cycle, wastes no temporary registers,.......
> > That is, you can store an arbitrary constant anywhere in memory
> > using any addressing mode at any time with a single instruction.
> Possible.
>
> Pulling similar off in my case would likely require 3 instructions
> (assuming the RiMOV extension), or 4 (otherwise).
<

RISC-V typically uses 3 or 4 instructions:
AUPIC; LD const; LDHI; ST location

>
> But, this is not a common case...
<

I can show subroutines with 10 of these in a row.
<
-------------

> > You are still operating under the assumption that I had to sacrifice
> > anything.
<
> There is less space for opcode bits.
<

Not compared to MIPS, Mc 88100, RISC-V.
And every OpCode Group has space remaining.

>
> Something like x86 SSE or ARM NEON style SIMD would likely be an issue
> for encoding space, at least in terms of 32-bit ops... Granted, I am
> guessing you probably also have an "escape hatch" for more opcode space?...
<

All of the unused OpCode Groups are reserved for the future. There are 22
(out of 64) Major OpCodes for future expansion. Given that I consumed
21 for 16-bit immediates, I think there is plenty (at least for the rest of
my lifetime.) Also notice I got Vectorization and SIMD into 2 instructions.

> >>
> >> Though, the 9-bit cases effectively expand to 10-bit signed in XG2,
> >> partly because, while 9-bit unsigned won out over 9-bit signed, 10-bit
> >> signed wins out over 10-bit unsigned (but, it was pretty close here).
> >>
> >>
> >> Ironically, both 9-bit unsigned and 10-bit signed, with a displacement
> >> scale, manage to slightly beat out the 12-bit signed/unscaled
> >> displacement style used by RISC-V.
> >>
> >> Say, Disp12s can reach +/-2K. Whereas, scaled Disp9u (for QWORD) can
> >> reach 4K.
> > <
> > This is the same argument I used in Mc 88100 arguing that displacement
> > arithmetic need not be signed (ala IBM 360), that the arithmetic was
> > congruent (could be rephrased in the same number of instructions,
> > and allow certain linker tricks.
> > <
> > The compiler people wouldn't even discuss it.
> The arithmetic is not unsigned, but the 9-bit displacements are.
>
> Argument is that this last bit can increase hit-rate by from around 88%
> to 97%, whereas a sign bit would have gained only a fraction of a percent.
<

My measured data indicates that about 93% of integer constants are positive.
And that positive has a much wider span than negatives:: such that if one
dedicated 7/8th of the constants to positive and 1/8 to negatives, both
sides would be served better than 1/2 and 1/2.

>
> However, from 9 to 10 bits, sign won out (in both cases, it was only of
> a fraction of a percent).
>
> Reaching 100% would still require a significantly larger displacement.
> >>
> >>
> >> Granted, the RISC-V strategy (unscaled displacements) would be more of a
> >> win if the general case use of packed structs or similar was "actually a
> >> thing".
> >>
> > What you are saying is that "If RISC-V hadn't screwed up so many things it
> > would have been a significantly better ISA". And no one could possibly disagree
> > with you.
> Probably.

Only Probably ?!?

>
> Scaled displacements, Register-indexed Load/Store, Constant-loading that
> "doesn't suck", ...
>
> Yet, it is seemingly the most popular open ISA at this point.
>

An Open ISA simply means somebody else can come in a dump a crapload
of new OpCodes where you wanted to put your next generation feature that
had been so carefully worked out to fit exactly right there without adding
any gates to the decoder.

>
> In most other regards, I would put my bet instead on ARMv8, except:
> Not an open ISA;
> ALU condition codes, bleh...
> >>
> >> For ALU immediate values, 9 bits still gets ~ 95%, 12 bits would get ~
> >> 97%. Both beat out Imm5 at roughly 54%, ...
> > <
> > But now you have to route all sorts of different sizes from the instruction
> > to various operand busses, whereas I only have to route {16,32-64}-bits.
> > This takes less decode logic and less multiplexing logic in the <time
> > critical> forwarding "loop".
> After the decode stage, all the pipeline sees is a 33-bit value...
<

On a 64-bit machine ?!?

>
> Granted, during decode, the decoder needs to deal with all of the
> various possible instruction layouts.
>
> So, say:
> Lookup opcode based on the various bits;
<

if( 6 <= inst.major <= 14 ) then OpCode format is from OpCode
else OpCode format is from Major
// but the important thing is that all register specifiers are always in the same
// bit positions

<
> Finds where it is routed to;

if( 9<= inst.major <= 10) MODIF determines routing
if( inst.major == 12 ) MOD determines routing

> Finds the "FormID" which tells which instruction layout was used/...
> Unpack instruction based on FormID rules;
>
> Outer decoder maps the decoded instruction decoder's outputs to the
> pipeline's lanes and register ports.
> > <
> > RISC-V then compounds this problem by adding compression.
> Yeah.
> It also looks simple, but in some ways is kind of a pain.
>
> RVC is "kinda evil" in some ways. Doesn't map easily to "hey, unpack
> this instruction according to this particular layout", as there are a
> number of one-off deviations, ...
<

Just wait until people add their own OpCodes to this compressed space.

BGB

unread,

Aug 19, 2023, 11:19:17 PM8/19/23

to

On 8/19/2023 4:54 PM, MitchAlsup wrote:
> On Saturday, August 19, 2023 at 3:54:44 PM UTC-5, BGB wrote:
>> On 8/19/2023 2:12 PM, MitchAlsup wrote:
>
>>>> Main advantage of LDSH/SHORI being that it expands easily to 64-bit
>>>> constants, whereas LUI doesn't.
>>>>
>>> LDSH = Load Signed Half Word ??
>>> SHORI = Store Half OR Immediate ??
>> LDSH = Load-via-Shift (the name I originally came up with for BJX1).
>> SHORI = Shift-with-OR (the name Hitachi came up with for SH5).
>>
>> Both basically being the same mechanism:
>> Rn = (Rn<<16)|Imm16u;
> <
> lack of addition is problematic.
>>
>> In both ISA branches, Load/Store (from memory) uses a MOV.x name, such as:
>> MOV.W (R4), R9 //BJX2 notation
>> MOV.W @R4, R9 //SuperH notation
>>
> An inbound memory reference should be spelled LD
> An outbound memory reference should be spelled ST

In some naming conventions (in many traditional RISC's).

Less true of where I started from...

Major influences on the design were:
SuperH, TMS320, MSP430, ...

How do these ISA's name their Load/Store ops?
MOV.x
...

Seemingly, M68K and similar also followed a lot of similar conventions.

So, the design didn't originally "evolve" out of something like RISC-V
or similar, rather, it evolved out of SuperH with influence from TMS320,
but then managed to go in a convergent direction towards RISC-V in some
areas...

>>
>
>>>> Ironically, despite being a microcontroller RISC, the IMM
>>>> prefix-instruction in MicroBlaze is also functionally similar to a jumbo
>>>> prefix.
>>> <
>>> STD 3.141592653589278643,[R3,R7<<3,DISP64]
>>> <
>>> Is 1 instruction, issues in 1 cycle, wastes no temporary registers,.......
>>> That is, you can store an arbitrary constant anywhere in memory
>>> using any addressing mode at any time with a single instruction.
>> Possible.
>>
>> Pulling similar off in my case would likely require 3 instructions
>> (assuming the RiMOV extension), or 4 (otherwise).
> <
> RISC-V typically uses 3 or 4 instructions:
> AUPIC; LD const; LDHI; ST location

You would need more than this to represent such an address mode.

I would estimate this case would need more like 8 instructions for RISC-V:
AUIPC; LDD; AUIPC; LDD; SLL; ADD; ADD; STD

My case, it is mostly the addressing mode:
MOV Imm64, R16
MOV Disp64, R17
ADD R3, R17, R17
MOV.Q R16, (R17, R7)

If it were a Disp33:
MOV Imm64, R16
LEA.B (R3, Disp33s), R17
MOV.Q R16, (R17, R7)

>>
>> But, this is not a common case...
> <
> I can show subroutines with 10 of these in a row.
> <
> -------------
>>> You are still operating under the assumption that I had to sacrifice
>>> anything.
> <
>> There is less space for opcode bits.
> <
> Not compared to MIPS, Mc 88100, RISC-V.
> And every OpCode Group has space remaining.

OK.

As noted, I had concerns before for the encoding space left over in
RISC-V once the various extensions were considered.

But, yeah, many of my existing encoding blocks are already mostly full.
Apart from F3 and F9, both of which have ~ 24 bits of unassigned space.

Within the F0 block, remaining space is:
F0-7 (partial)
F0-9/A/B (enough here for ~ 96 3R ops)
F0-E/F (Reclaimed), ~ 64 more 3R ops.

Most of the F0-3-(8..F) 2R space remains free:
Around 256 2R ops;
All of the F0-7-(8..F) 2R space remains free:
Another 256 2R ops;

So, could add around 160 3R ops and 512 2R ops, then F0 would be full.

As is, there are around 264 3R ops, and around 272 2R ops.

The number of mnemonics is a little less as some encodings share mnemonics.

The F2 block still has enough space reserved for around 96 more
"Imm10,Rn" ops. Currently, all assigned "Rm,Imm9,Rn" spots are in use.

The F1 block has 1 spot available (out of 32).
This was used for the Disp9 Load/Store ops.

>>
>> Something like x86 SSE or ARM NEON style SIMD would likely be an issue
>> for encoding space, at least in terms of 32-bit ops... Granted, I am
>> guessing you probably also have an "escape hatch" for more opcode space?...
> <
> All of the unused OpCode Groups are reserved for the future. There are 22
> (out of 64) Major OpCodes for future expansion. Given that I consumed
> 21 for 16-bit immediates, I think there is plenty (at least for the rest of
> my lifetime.) Also notice I got Vectorization and SIMD into 2 instructions.

As noted, I would have assumed having enough opcode space to fit
ideally, say, several thousand unique instructions.

But, wanting some Imm9/Disp9 encodings eats into things a fair bit.

>>>>
>>>> Though, the 9-bit cases effectively expand to 10-bit signed in XG2,
>>>> partly because, while 9-bit unsigned won out over 9-bit signed, 10-bit
>>>> signed wins out over 10-bit unsigned (but, it was pretty close here).
>>>>
>>>>
>>>> Ironically, both 9-bit unsigned and 10-bit signed, with a displacement
>>>> scale, manage to slightly beat out the 12-bit signed/unscaled
>>>> displacement style used by RISC-V.
>>>>
>>>> Say, Disp12s can reach +/-2K. Whereas, scaled Disp9u (for QWORD) can
>>>> reach 4K.
>>> <
>>> This is the same argument I used in Mc 88100 arguing that displacement
>>> arithmetic need not be signed (ala IBM 360), that the arithmetic was
>>> congruent (could be rephrased in the same number of instructions,
>>> and allow certain linker tricks.
>>> <
>>> The compiler people wouldn't even discuss it.
>> The arithmetic is not unsigned, but the 9-bit displacements are.
>>
>> Argument is that this last bit can increase hit-rate by from around 88%
>> to 97%, whereas a sign bit would have gained only a fraction of a percent.
> <
> My measured data indicates that about 93% of integer constants are positive.
> And that positive has a much wider span than negatives:: such that if one
> dedicated 7/8th of the constants to positive and 1/8 to negatives, both
> sides would be served better than 1/2 and 1/2.

I was running stats separately between displacements and integer
immediate values.

But, yeah, this is how it ended up.

Apart from ADD (and, implicitly, SUB); most of the ALU ops ended up with
positive-only immediate values. These values remained unsigned in XG2,
but were expanded to 10 bits.

Load/Store displacements ended up becoming signed though.

>>
>> However, from 9 to 10 bits, sign won out (in both cases, it was only of
>> a fraction of a percent).
>>
>> Reaching 100% would still require a significantly larger displacement.
>>>>
>>>>
>>>> Granted, the RISC-V strategy (unscaled displacements) would be more of a
>>>> win if the general case use of packed structs or similar was "actually a
>>>> thing".
>>>>
>>> What you are saying is that "If RISC-V hadn't screwed up so many things it
>>> would have been a significantly better ISA". And no one could possibly disagree
>>> with you.
>> Probably.
> Only Probably ?!?
>>
>> Scaled displacements, Register-indexed Load/Store, Constant-loading that
>> "doesn't suck", ...
>>
>> Yet, it is seemingly the most popular open ISA at this point.
>>
> An Open ISA simply means somebody else can come in a dump a crapload
> of new OpCodes where you wanted to put your next generation feature that
> had been so carefully worked out to fit exactly right there without adding
> any gates to the decoder.

Possibly true.

But also the option to be like "that design sucks, I am not going to
adopt it".

And, no one needs to pay royalties.

Like, say, if someone wanted to use BJX2 in their own project, they are
not under any obligation to pay me royalties, this is "just how it is".

Even if, granted, this does mean one needs a "day job" to pay for ones'
cost of living and similar.

>>
>> In most other regards, I would put my bet instead on ARMv8, except:
>> Not an open ISA;
>> ALU condition codes, bleh...
>>>>
>>>> For ALU immediate values, 9 bits still gets ~ 95%, 12 bits would get ~
>>>> 97%. Both beat out Imm5 at roughly 54%, ...
>>> <
>>> But now you have to route all sorts of different sizes from the instruction
>>> to various operand busses, whereas I only have to route {16,32-64}-bits.
>>> This takes less decode logic and less multiplexing logic in the <time
>>> critical> forwarding "loop".
>> After the decode stage, all the pipeline sees is a 33-bit value...
> <
> On a 64-bit machine ?!?

Yeah. If you want to pass a 64-bit immediate, it eats multiple lanes...
The decoders cooperate to produce a 64-bit value split across two 33-bit
immediate fields (which may then be glued back together at a later stage).

"A 33 bit immediate should be big enough for anyone..."

Basically, the decoder deals with 64-bit values in a similar way to how
the ALU ops deal with 128-bit values, namely by having multiple narrower
units cooperate and give the illusion of a wider unit.

>>
>> Granted, during decode, the decoder needs to deal with all of the
>> various possible instruction layouts.
>>
>> So, say:
>> Lookup opcode based on the various bits;
> <
> if( 6 <= inst.major <= 14 ) then OpCode format is from OpCode
> else OpCode format is from Major
> // but the important thing is that all register specifiers are always in the same
> // bit positions
> <
>> Finds where it is routed to;
> if( 9<= inst.major <= 10) MODIF determines routing
> if( inst.major == 12 ) MOD determines routing

My instruction format wasn't organized based on where the instruction is
routed. In some cases, this routing has changed around based on design
changes within the core (adding or removing units, ...).

Rather, things were more organized by instruction format, so 3R
instructions are near other 3R instructions, most 2R instructions are
consolidated into big blocks, ...

In effect, there is a giant set of nested "casez" blocks for every
instruction in the ISA.

Checking, this chunk of code is roughly 7000 lines of Verilog...

The combination of BJX2 + RISC-V decoders is a little over 6k LUT.

Seemingly, more of the LUTs go into the logic for unpacking stuff based
on "FormID" though, than on the massive 'casez' blocks.

>> Finds the "FormID" which tells which instruction layout was used/...
>> Unpack instruction based on FormID rules;
>>
>> Outer decoder maps the decoded instruction decoder's outputs to the
>> pipeline's lanes and register ports.
>>> <
>>> RISC-V then compounds this problem by adding compression.
>> Yeah.
>> It also looks simple, but in some ways is kind of a pain.
>>
>> RVC is "kinda evil" in some ways. Doesn't map easily to "hey, unpack
>> this instruction according to this particular layout", as there are a
>> number of one-off deviations, ...
> <
> Just wait until people add their own OpCodes to this compressed space.
>

Yeah...

I preferred SuperH in this area.

It was 16-bit ops, but the 16-bit ops were not some ugly dog-chewed mess
(unlike Thumb and RVC).

Though, opcodes were basically squirreled off in random corners all over
the place, which is design pattern that ended up carried over into BJX2.

Scott Lurndal

unread,

Aug 20, 2023, 11:19:45 AM8/20/23

to

ARM T32 has:

MOVT = Move a 16-bit immediate value to the top halfword of the destination register.

ARM A64 has:

MOVK = Move a 16-bit immediate value to anywhere in the destination register.

Operation
bits(datasize) result;
result = X[d, datasize];
result<pos+15:pos> = imm16;
X[d, datasize] = result;

MitchAlsup

unread,

Aug 20, 2023, 11:29:18 AM8/20/23

to

MOV is only appropriate when you can combine both the LD and ST in a
single instruction:: MOV @R4,@r6

>
> So, the design didn't originally "evolve" out of something like RISC-V
> or similar, rather, it evolved out of SuperH with influence from TMS320,
> but then managed to go in a convergent direction towards RISC-V in some
> areas...
> >>
> >
> >>>> Ironically, despite being a microcontroller RISC, the IMM
> >>>> prefix-instruction in MicroBlaze is also functionally similar to a jumbo
> >>>> prefix.
> >>> <
> >>> STD 3.141592653589278643,[R3,R7<<3,DISP64]
> >>> <
> >>> Is 1 instruction, issues in 1 cycle, wastes no temporary registers,.......
> >>> That is, you can store an arbitrary constant anywhere in memory
> >>> using any addressing mode at any time with a single instruction.
> >> Possible.
> >>
> >> Pulling similar off in my case would likely require 3 instructions
> >> (assuming the RiMOV extension), or 4 (otherwise).
> > <
> > RISC-V typically uses 3 or 4 instructions:
> > AUPIC; LD const; LDHI; ST location
> You would need more than this to represent such an address mode.
<

Where you is not me but is most other RISCs.

>
> I would estimate this case would need more like 8 instructions for RISC-V:
> AUIPC; LDD; AUIPC; LDD; SLL; ADD; ADD; STD
<

Which is why RISC-V is mediocre at best.

>
>
> My case, it is mostly the addressing mode:
> MOV Imm64, R16
> MOV Disp64, R17
> ADD R3, R17, R17
> MOV.Q R16, (R17, R7)
<

Still 1 instruction in my ISA
<
STD 3.141592653589278643,[R3,R7<<3,DISP64]

>
> If it were a Disp33:
> MOV Imm64, R16
> LEA.B (R3, Disp33s), R17
> MOV.Q R16, (R17, R7)
<

DISP32 form saves 1 word::
<
STD 3.141592653589278643,[R3,R7<<3,DISP32]

> > All of the unused OpCode Groups are reserved for the future. There are 22
> > (out of 64) Major OpCodes for future expansion. Given that I consumed
> > 21 for 16-bit immediates, I think there is plenty (at least for the rest of
> > my lifetime.) Also notice I got Vectorization and SIMD into 2 instructions.
> As noted, I would have assumed having enough opcode space to fit
<
> ideally, say, several thousand unique instructions.
<

Certainly my ISA has room, but remember I get both vectorization and SIMD
out of exactly 2 instructions--instead of 1300..........

>

> >> After the decode stage, all the pipeline sees is a 33-bit value...
> > <
> > On a 64-bit machine ?!?
> Yeah. If you want to pass a 64-bit immediate, it eats multiple lanes...
<

Then it is not really a 64-bit machine in a similar manner that Mc 68000
was a 16-bit machine that could perform 32-bit calculations.

<
> The decoders cooperate to produce a 64-bit value split across two 33-bit
> immediate fields (which may then be glued back together at a later stage).
>
> "A 33 bit immediate should be big enough for anyone..."
>

Even Floating Point ??

>
> Basically, the decoder deals with 64-bit values in a similar way to how
> the ALU ops deal with 128-bit values, namely by having multiple narrower
> units cooperate and give the illusion of a wider unit.
> >>
> >> Granted, during decode, the decoder needs to deal with all of the
> >> various possible instruction layouts.
> >>
> >> So, say:
> >> Lookup opcode based on the various bits;
> > <
> > if( 6 <= inst.major <= 14 ) then OpCode format is from OpCode
> > else OpCode format is from Major
> > // but the important thing is that all register specifiers are always in the same
> > // bit positions
> > <
> >> Finds where it is routed to;
> > if( 9<= inst.major <= 10) MODIF determines routing
> > if( inst.major == 12 ) MOD determines routing
> My instruction format wasn't organized based on where the instruction is
> routed. In some cases, this routing has changed around based on design
> changes within the core (adding or removing units, ...).
<

No, you misunderstand:: it is not where instructions are routed to that I am
talking about, it is where OPERANDS are routed from.

>
> Rather, things were more organized by instruction format, so 3R
> instructions are near other 3R instructions, most 2R instructions are
> consolidated into big blocks, ...
>

Yes, I have this setup, too, but INS and FMAC sit in the same subGroup.

>
> In effect, there is a giant set of nested "casez" blocks for every
> instruction in the ISA.
<

I do this with tabularized subroutines:: three_operand[opcode](arguments);
where the routing <from> is performed as setup to arguments.
>

BGB

unread,

Aug 20, 2023, 3:39:06 PM8/20/23

to

I didn't make this convention.
It was more one of those things that ended up "grandfathered in".

I just sort of made the stylistic change from "@Reg" to "(Reg)", but for
the most part the assembler will still accept "@Reg".

BGBCC will also accept auto-increment notation as well, but these are
faked with a multi-op sequence:
MOV.L R4, @-R6
Emitted as:
ADD -4, R6
MOV.L R4, (R6)

Could have faked a "MOV.L @R4, @R5" instruction, but didn't, as the
ancestor ISA didn't have this either.

In my newer (still incomplete) TKUCC effort, the handling for most of
these "fake" instructions was dropped, so the ASM will need to be
written more in terms of what instructions actually exist.

Though, one other difference was that for TKUCC, it was generally
assuming that jumbo prefixes always exist.

>>
>> So, the design didn't originally "evolve" out of something like RISC-V
>> or similar, rather, it evolved out of SuperH with influence from TMS320,
>> but then managed to go in a convergent direction towards RISC-V in some
>> areas...
>>>>
>>>
>>>>>> Ironically, despite being a microcontroller RISC, the IMM
>>>>>> prefix-instruction in MicroBlaze is also functionally similar to a jumbo
>>>>>> prefix.
>>>>> <
>>>>> STD 3.141592653589278643,[R3,R7<<3,DISP64]
>>>>> <
>>>>> Is 1 instruction, issues in 1 cycle, wastes no temporary registers,.......
>>>>> That is, you can store an arbitrary constant anywhere in memory
>>>>> using any addressing mode at any time with a single instruction.
>>>> Possible.
>>>>
>>>> Pulling similar off in my case would likely require 3 instructions
>>>> (assuming the RiMOV extension), or 4 (otherwise).
>>> <
>>> RISC-V typically uses 3 or 4 instructions:
>>> AUPIC; LD const; LDHI; ST location
>> You would need more than this to represent such an address mode.
> <
> Where you is not me but is most other RISCs.

I meant on RISC-V...

That addressing mode kinda "steps in it".

>>
>> I would estimate this case would need more like 8 instructions for RISC-V:
>> AUIPC; LDD; AUIPC; LDD; SLL; ADD; ADD; STD
> <
> Which is why RISC-V is mediocre at best.
>>
>>
>> My case, it is mostly the addressing mode:
>> MOV Imm64, R16
>> MOV Disp64, R17
>> ADD R3, R17, R17
>> MOV.Q R16, (R17, R7)
> <
> Still 1 instruction in my ISA
> <
> STD 3.141592653589278643,[R3,R7<<3,DISP64]
>>
>> If it were a Disp33:
>> MOV Imm64, R16
>> LEA.B (R3, Disp33s), R17
>> MOV.Q R16, (R17, R7)
> <
> DISP32 form saves 1 word::
> <
> STD 3.141592653589278643,[R3,R7<<3,DISP32]
>

Here it saves a constant load, since the largest allowed displacement
encoding is 33 bits. While it could theoretically be encoded, a larger
fixed displacement would not easily be supported with the current
implementation.

If one were to use a simpler addressing mode, this case could drop to 2
instructions.

There is not currently any way to directly store a constant to memory.
Similarly, there are still some other implementation limits at present,
like there is only support for a single immediate/displacement for a
given instruction (at least short of using multiple lanes and some
additional decoder hackery).

With the RiMOV extension, there is, however:
MOV.Q R2, (R3, R7, Disp11u)

But, unlike the normal displacements, this displacement is unscaled and
can't currently be expanded with a jumbo prefix.

A similar encoding was used for instructions like:
DMACS.L R4, R5, R6, R7 //R7=R4*R5+R6
But, this feature is also an optional extension.

>>> All of the unused OpCode Groups are reserved for the future. There are 22
>>> (out of 64) Major OpCodes for future expansion. Given that I consumed
>>> 21 for 16-bit immediates, I think there is plenty (at least for the rest of
>>> my lifetime.) Also notice I got Vectorization and SIMD into 2 instructions.
>> As noted, I would have assumed having enough opcode space to fit
> <
>> ideally, say, several thousand unique instructions.
> <
> Certainly my ISA has room, but remember I get both vectorization and SIMD
> out of exactly 2 instructions--instead of 1300..........

There end up needing to be a lot of special cases even for integer ops, say:
ADD Rm, Ro, Rn
ADD Rm, Imm9u, Rn //zero extended
ADD Rm, Imm9n, Rn //one extended
ADD Imm16u, Rn
ADD Imm16n, Rn

ADDS.L Rm, Ro, Rn //sign-extend result from 32-bits
ADDS.L Rm, Imm9u, Rn //zero extended
ADDS.L Rm, Imm9n, Rn //one extended

ADDU.L Rm, Ro, Rn //zero-extend result from 32-bits
ADDU.L Rm, Imm9u, Rn //zero extended
ADDU.L Rm, Imm9n, Rn //one extended

...

Or, variant semantics:
FADD Rm, Ro, Rn //FPU ADD, Binary64, fixed RNE
FADDG Rm, Ro, Rn //FPU ADD, Binary64, dynamic rounding mode
FADDA Rm, Ro, Rn //FPU ADD, Binary64, fake Binary32 RNE
...

Or:
FADD Rm, Imm5fp, Rn //FPIMM
...

SIMD ops, eg:
PADD.H Rm, Ro, Rn //Packed ADD 4x Binary16
PADD.F Rm, Ro, Rn //Packed ADD 2x Binary32
PADDX.F Xm, Xo, Xn //Packed ADD 4x Binary32

PADD.W Rm, Ro, Rn //Packed ADD 4x Int16
PADD.L Rm, Ro, Rn //Packed ADD 2x Int32
...

Didn't bother with signed and unsigned saturate variants, at least for
32-bit encodings (things like "PADDSS.W"/"PADDUS.W"/... would add a lot
of ops).

And, a lot of format converter ops, ...

PLDCH Rm, Rn //2x Binary16 (Low bits) to 2x Binary32
PLDCHH Rm, Rn //2x Binary16 (High bits) to 2x Binary32
PLDCXH Rm, Xn //4x Binary16 to 4x Binary32

PSTCH Rm, Rn //2x Binary32 to 2x Binary16
...

RGB5UPCK64 Rm, Rn //Unpack RGB555 to 64-bit (16b per component)
RGB5PCK64 Rm, Rn //Pack 64-bit to RGB555
...

Though, these sorts of converter ops have resulted in a fair number of
mnemonics.

But, yeah, as noted, assuming that no more "heavy eaters" are added, the
remaining F3 and F9 blocks have theoretically enough space for 1024 more
3R ops (or 32768 if one wanted to use it all for 2R ops...).

Potentially, relocating BRA/BSR to the F8 block could free up 64 more 3R
spots in the F0 block, but would be a pretty major "breaking change".

And, potentially, one could need some more Imm16 ops, and there was
debate over the possibility of, say, adding "BRGT Rn, Disp12s" ops and
similar (say, because usefulness of the existing Disp8s ops are limited
by the small displacement size; and "Conditional branch that doesn't
stomp SR.T" is potentially useful for combining predication with
modulo-scheduling, ...).

Granted, one could do the latter case by faking:
BRGT R4, .L0
As:
BRLE R4, .L1
BRA .L0
.L1:
In cases where .L0 is outside the 256 byte limit.

But, this is bulkier and less efficient.

The 2-register cases, eg:
BREQ R4, R5, Label
Would not see such an upgrade (if done, it would only be for the
"compare register with 0" cases).

Well, and then there was debate for, if added, whether to put these in
the F8 block or in the reclaimed space in the F0 block.

A lot comes down to the uncertainty of, whether in the future, I might
need any more Imm16 ops than the ones I have already (since, as noted,
this block is already 5/8 full, or 6/8 if including the space reserved
for the Disp12s compare-with-0 branches).

Partly, it is a case that it is not exactly difficult to write loop
bodies which exceed the Disp8s limit.

In which case, the current typical fallback being:
CMPGT 0, R4
BT Label //Encoded as BRA?T
Which is, technically, 2 ops and stomps SR.T, but can branch +/- 1MB.

>>
>
>>>> After the decode stage, all the pipeline sees is a 33-bit value...
>>> <
>>> On a 64-bit machine ?!?
>> Yeah. If you want to pass a 64-bit immediate, it eats multiple lanes...
> <
> Then it is not really a 64-bit machine in a similar manner that Mc 68000
> was a 16-bit machine that could perform 32-bit calculations.
> <

The registers and ops are still 64-bits...

Just the immediate field from the decoders remain 33 bits.

For reasons, the width of the immediate field has a disproportionate
impact on LUT cost (so, it was "better" to have the decoders spit out
33-bit halves and glue them together later, than have each decoder emit
a full-fledged 64-bit immediate).

It sorta works:
No (single) 32-bit instruction can produce more than 33 bits.

For the Jumbo96 encodings, one can special case it, with the Lane1
decoder dealing with the low 32 bits, and the Lane2 decoder with the
remaining 32 bits.

Imm57 and Imm53 cases add ugly (and not cheap) special cases, which is
why I was on the fence about them.

They require the other decoder to know what is going on in Lane 1:
F0 block: Imm53
F1 block: Imm57 (Invalid)
F2 block: Imm57
F3 block: Imm53?
F4..F7: Invalid
F8 block: Imm64
F9 block: Imm53?
FA/FB: Imm48
FC..FF: Invalid
...

Naturally, otherwise, all Lane 2 would see would be a pair of Jumbo
prefixes:
Lane 2 sees the prefixes in Lane 2 and Lane 3 spots;
Lane 1 would see the instruction word and the prefix in Lane 2.

If Lane3 sees a jumbo prefix, or Lane2 a solitary prefix (excluding
Imm48), it does nothing (just sorta behaves as if it were a NOP).

>> The decoders cooperate to produce a 64-bit value split across two 33-bit
>> immediate fields (which may then be glued back together at a later stage).
>>
>> "A 33 bit immediate should be big enough for anyone..."
>>
> Even Floating Point ??

Say:
MOV Imm64, Rn

Can load full Binary64, but is technically a 2-lane operation (that two
32 bit halves are glued together in the pipeline is invisible).

Or:
FLDCH Imm16, Rn //Load immediate as Binary16 to Binary64
Routes the immediate through a format converter.

For the FpImm experiment, had ended up needing to make the decoders
perform a 5-bit to Binary16 conversion, with Binary16 to Binary64
converters shoved into the register-file module (only valid on certain
register ports).

These sorts of cases are handled with "fake" internal registers that
essentially tell the register-file "Hey, there is a Binary16 value in
the Imm33 field, get the value of this having been converted to Binary64".

>>
>> Basically, the decoder deals with 64-bit values in a similar way to how
>> the ALU ops deal with 128-bit values, namely by having multiple narrower
>> units cooperate and give the illusion of a wider unit.
>>>>
>>>> Granted, during decode, the decoder needs to deal with all of the
>>>> various possible instruction layouts.
>>>>
>>>> So, say:
>>>> Lookup opcode based on the various bits;
>>> <
>>> if( 6 <= inst.major <= 14 ) then OpCode format is from OpCode
>>> else OpCode format is from Major
>>> // but the important thing is that all register specifiers are always in the same
>>> // bit positions
>>> <
>>>> Finds where it is routed to;
>>> if( 9<= inst.major <= 10) MODIF determines routing
>>> if( inst.major == 12 ) MOD determines routing
>> My instruction format wasn't organized based on where the instruction is
>> routed. In some cases, this routing has changed around based on design
>> changes within the core (adding or removing units, ...).
> <
> No, you misunderstand:: it is not where instructions are routed to that I am
> talking about, it is where OPERANDS are routed from.

OK.

In my case, how to decode the operands is determined by the FormID,
which is determined based on looking up the opcode bits.

Basically, it looks up a few parameters:
NMID (6b): Major opcode (function unit)
FMID (5b): Major instruction layout
UCMDIX(6b): Minor Opcode / Control Bits
ITY (4b): Layout sub-type (*1)
BTY (3b): (Load/Store ops): Data type for memory access
UCTY (3b): Control for multi-lane/conditional operations, etc.

*1: Ordering of register ports, zero/one extension for immeds, ...
So, say, 3R ops: (Rm,Ro,Rp,Rn)
Rm, Ro, Rn, Rn
Ro, Rm, Rn, Rn
2R ops:
Rn, Rm, Rn, Rn
Rm, Rn, Rn, Rn
ZR, Rm, Rn, Rn
Rm, ZR, Rn, Rn
Cm, ZR, Rn, Rn
Rm, ZR, Cn, Cn
...
So, the ITY field is a "necessary evil" here.

Then another set of blocks does all the unpacking based on the FMID and
similar.

Some preceding logic unpacks all the register fields and possible
immediate values (based on mode and presence/absence of jumbo-prefixes,
...).

So, the FMID logic is basically a big mess of case blocks to plug the
correct unpacked values into the correct output ports (I suspect this
part is where the bulk of the LUTs is going).

Granted, I guess an alternate strategy could have been to specify a
per-port permutation field, say:
Rm/Ro/Rp/Rn outputs: Select from Rm/Ro/Rn/Cm/Cn/IMM/ZR/...
Imm33 output: Select from
0/Imm5u/Imm5n/Imm6u/Imm6n/Imm9u/Imm9n/Imm10u/Imm10n/Imm16u/Imm16n/Imm16s/Imm20s/Imm24u/Imm24n

But, likely this would have ended up needing more LUTs than the FMID+ITY
approach.

Though, having the FMID drive a selector for the Imm33 output vs handle
the Immed bits directly could be worth looking at (could potentially
save some LUTs).

Inner decoder Outputs:
Rm, Ro, Rp, Rn: Registers, each 7 bits
Rm/Ro/Rp: Source Ports (elsewhere Rs/Rt/Rp)
Rn: Destination Port
For most ops, Rp==Rn.
Imm: 33 bits
UCmd: 9 bits (6b major op, 3b control)
UIxt: 9 bits (6b minor op, 3b control)
uFl: 20 bits
Control-flags for decoding multi-lane ops
Secondary Load/Store operation or inline-shuffle value.
(The values for these are stashed in Lane 3).

The outer part of the decoder then packs these outputs into the pipeline
outputs (based on the bundle layout and similar), and also deals with a
lot of the special handling for multi-lane operations.

Generally, in the outer stage, there are, say:
3x BJX2 Op32 decoders
1x BJX2 Op16 decoder
1x (or 2x) RISC-V Op32 decoders.
1x RISC-V Op16/RVC decoder (incomplete, RVC = blarg).

The RISC-V decoder uses a similar 2-stage approach to the BJX2 decoder
(but then annoys me with its dog-chewed immediate fields).

Started on an RVC decoder, but, its encoding is dog chewed and there are
too many "special case" encodings, ... So, I just sort of gave up. My
preference is to avoid the proliferation of "one off" cases.

>>
>> Rather, things were more organized by instruction format, so 3R
>> instructions are near other 3R instructions, most 2R instructions are
>> consolidated into big blocks, ...
>>
> Yes, I have this setup, too, but INS and FMAC sit in the same subGroup.

OK.

Apart from the F8/Imm16 block (which has its own layout), the other
blocks have the same layout, so theoretically nothing would have
prevented putting F1 or F2 style ops in F0, or F0 style ops in F2, ...

But, I had consolidated all the Disp9 LD/ST ops and Imm9/Imm10 ops into
larger blocks for organizational reasons.

Also I found it preferable to have most of the 2R ops consolidated
rather than spread all over the place. I suspect also this sort of
consolidation is likely better for LUT cost in the decoder as well.

Say, the "casez" doesn't need to check any of the opcode bits for 2R ops
if no 2R ops are in the area (Vivado appears to do a fair job in this area).

>>
>> In effect, there is a giant set of nested "casez" blocks for every
>> instruction in the ISA.
> <
> I do this with tabularized subroutines:: three_operand[opcode](arguments);
> where the routing <from> is performed as setup to arguments.

OK.

>>
>

MitchAlsup

unread,

Aug 20, 2023, 5:23:35 PM8/20/23

to

On Sunday, August 20, 2023 at 2:39:06 PM UTC-5, BGB wrote:
> On 8/20/2023 10:29 AM, MitchAlsup wrote:
>
> > Then it is not really a 64-bit machine in a similar manner that Mc 68000
> > was a 16-bit machine that could perform 32-bit calculations.
> > <
> The registers and ops are still 64-bits...
>
> Just the immediate field from the decoders remain 33 bits.
>

So, in order to use a 64-bit constant you consume 2/3rds of your execution lanes ?!?

>

> >>
> > Even Floating Point ??
> Say:
> MOV Imm64, Rn
>
> Can load full Binary64, but is technically a 2-lane operation (that two
> 32 bit halves are glued together in the pipeline is invisible).
<

Right, so using a 64-bit constant eats 2/3rds of your execution width.

>
> Or:
> FLDCH Imm16, Rn //Load immediate as Binary16 to Binary64
> Routes the immediate through a format converter.
>

I do similar with 5-bit immediates in FP.
<
FDIV R9,#5,R16 // R9 = 5.0D0 / R16

BGB

unread,

Aug 20, 2023, 5:29:05 PM8/20/23

to

Paeth filter (from memory) is something like:
P=A+B-C
dA=abs(P-A)
dB=abs(P-B)
dC=abs(P-C)
if(dA<dB)
{
if(dA<dC)
{ D=A; }
else if(dB<dC)
{ D=B; }
else
{ D=C; }
}else
{
if(dB<dC)
{ D=B; }
else
{ D=C; }
}

But, then evaluated for nearly every component of nearly every pixel in
an image (because it tends to do better than the other filters), but is
also the slowest (though, there are "faster" ways to do it by turning it
into a mess of subtraction, shifts, and bitwise operators).

Range-coding would be something like (also from memory):
struct RangeCtx_s {
byte wvals[65536]; //probability weights
uint16_t wctx; //context of previous bits
uint32_t rhi; //high value of range
uint32_t rlo; //low value of range
uint32_t rmid; //midpoint (encoded range)
byte *cs; //encoded bitstream
};

int DecodeBit(RangeCtx *ctx)
{
uint32_t m, r;
byte w, b;

w=ctx->wvals[ctx->wctx]; //fetch weight
r=ctx->rhi-ctx->rlo; //size of range
m=ctx->rlo+((r>>8)*w); //calc midpoint based on weight

if(ctx->rmid>=m)
{
b=1; //if >= midpoint, we have a 1 bit
ctx->rlo=m; //cut off low part of range
w=w_inctab[w]; //adjust weight for 1 bit
}else
{
b=0; //if < midpoint, we have a 0 bit
ctx->rhi=m; //cut off high part of range
w=w_dectab[w]; //adjust weight for 0 bit
}
ctx->wvals[ctx->wctx]=w; //update probability weight
ctx->wctx=(ctx->wctx<<1)|b; //update context

//check and renormaulize as range converges
if(!((ctx->rhi^ctx->rlo)>>24))
{
ctx->rhi=(ctx->rhi<<8)|0xFF;
ctx->rlo=(ctx->rlo<<8)|0x00;
ctx->rmid=(ctx->rmid<<8)|(*ctx->cs++);
}
return(b);
}

int DecodeByte(RangeCtx *ctx)
{
int b;
b=DecodeBit(ctx);
b=(b<<1)|DecodeBit(ctx);
b=(b<<1)|DecodeBit(ctx);
b=(b<<1)|DecodeBit(ctx);
b=(b<<1)|DecodeBit(ctx);
b=(b<<1)|DecodeBit(ctx);
b=(b<<1)|DecodeBit(ctx);
b=(b<<1)|DecodeBit(ctx);
return(b);
}

Where, say, this sort of thing can get an entropy coder which compresses
stuff reasonably well, but is painfully slow (meanwhile, a Huffman style
entropy coder is significantly faster).

One can argue, "why not encode 1 bit at a time, and instead encode a
whole symbol?", this works, but ironically tends to be slower than
encoding/decoding things 1 bit at a time...

There is a lot of variation on this sort of idea.

...

>>
>> But, a lot of people (including compiler writers) seem inclined to
>> ignore these cases.
> <
> Often disguised as a series of shifts (a << const1)>>const2 because
> the underlying language does not express variable length bit-fields
> efficiently.

For bitfields and similar...

Less sure about Paeth or bitwise range coders, which are more hurt by
needing to deal with all the branching and similar.

Granted, there are faster alternatives in both cases.

>>
>> But, then CPU designers are like "well, we will interpret a short
>> forward branch as predicating the next N instructions rather than doing
>> a branch", ...
> <
> AND WHY NOT ??

I guess, probably true enough...

>>
>> ...

BGB

unread,

Aug 20, 2023, 6:27:02 PM8/20/23

to

On 8/20/2023 4:23 PM, MitchAlsup wrote:
> On Sunday, August 20, 2023 at 2:39:06 PM UTC-5, BGB wrote:
>> On 8/20/2023 10:29 AM, MitchAlsup wrote:
>>
>>> Then it is not really a 64-bit machine in a similar manner that Mc 68000
>>> was a 16-bit machine that could perform 32-bit calculations.
>>> <
>> The registers and ops are still 64-bits...
>>
>> Just the immediate field from the decoders remain 33 bits.
>>
> So, in order to use a 64-bit constant you consume 2/3rds of your execution lanes ?!?

Actually, encoding an instruction with a 64-bit constant eats *all* of
the lanes...

How much space does it take to encode a 64-bit constant?
96 bits.
How wide is the fetch?
96 bits.
How many more ops *could* I have bundled here?
0.

The 32-bit encodings can be bundled, but none of them is capable of
producing a full 64-bit value in the first place.

I could almost have gotten away with a 25-bit field here...
The largest 32-bit encodings only encode a 25-bit immed;
All larger values could have been multi-lane.

But, for other reasons, 33-bit made more sense here than 25 bit.

Though, FWIW, in early versions the 64-bit constant load did split the
value like: 24 bits in lanes 2 and 3; 16 bits in Lane 1.

If I supported hybrid bundles, say:
FE-Op32 | Op32
Or similar.

One could potentially have jumbo encodings in a bundle, but the largest
cases would have still been 33 bits.

Only real way for this to become a limiting factor would be to support
larger bundles.

>>
>
>>>>
>>> Even Floating Point ??
>> Say:
>> MOV Imm64, Rn
>>
>> Can load full Binary64, but is technically a 2-lane operation (that two
>> 32 bit halves are glued together in the pipeline is invisible).
> <
> Right, so using a 64-bit constant eats 2/3rds of your execution width.

Yeah.

>>
>> Or:
>> FLDCH Imm16, Rn //Load immediate as Binary16 to Binary64
>> Routes the immediate through a format converter.
>>
> I do similar with 5-bit immediates in FP.
> <
> FDIV R9,#5,R16 // R9 = 5.0D0 / R16

I had interpreted the 5-bit values as E3.F2, had tried various schemes,
but E3.F2 ended up with the best overall hit-rate among the
possibilities tested.

Hit rate still isn't particularly high though.

Meanwhile, it turns out Binary16 can exactly represent a majority of the
floating point constants which appear in code, so the operation to
express a Binary16 value directly has a fairly good hit rate.

>>
>> For the FpImm experiment, had ended up needing to make the decoders
>> perform a 5-bit to Binary16 conversion, with Binary16 to Binary64
>> converters shoved into the register-file module (only valid on certain
>> register ports).
>>
>> These sorts of cases are handled with "fake" internal registers that
>> essentially tell the register-file "Hey, there is a Binary16 value in
>> the Imm33 field, get the value of this having been converted to Binary64".
>

Meanwhile, checking some other stats:
Only around 5% of function-local branches are within +/- 256 bytes.
But, the vast majority (96%) are within +/- 4K.

This implies that a 12-bit branch displacement would be a fair bit more
useful than an 8 bit displacement.

Meanwhile, looking at my compiler, it had somehow slipped my mind that I
also already have "BRcc Rn, Disp33s" encodings via jumbo prefixes, which
end up being the main form used if this feature is enabled in my
compiler (but... I had forgotten it seems...).

So, it is more a tradeoff between burning encoding space, vs needing a
64-bit encoding for these.

MitchAlsup

unread,

Aug 20, 2023, 6:57:16 PM8/20/23

to

On Sunday, August 20, 2023 at 5:27:02 PM UTC-5, BGB wrote:
> On 8/20/2023 4:23 PM, MitchAlsup wrote:
> > On Sunday, August 20, 2023 at 2:39:06 PM UTC-5, BGB wrote:
> >> On 8/20/2023 10:29 AM, MitchAlsup wrote:
> >>
> >>> Then it is not really a 64-bit machine in a similar manner that Mc 68000
> >>> was a 16-bit machine that could perform 32-bit calculations.
> >>> <
> >> The registers and ops are still 64-bits...
> >>
> >> Just the immediate field from the decoders remain 33 bits.
> >>
> > So, in order to use a 64-bit constant you consume 2/3rds of your execution lanes ?!?
> Actually, encoding an instruction with a 64-bit constant eats *all* of
> the lanes...
>
> How much space does it take to encode a 64-bit constant?
> 96 bits.
> How wide is the fetch?
> 96 bits.
> How many more ops *could* I have bundled here?
> 0.
>

OK, I see the disconnect. I am fetching 128-bits wide on a 1-wide machine
so that I can use excess I$ bandwidth to do other things (including power
savings), while you are fetching only as wide as you can issue. Secondarily
I am designing a scalable ISA where you are designing an ISA targeting a
particular data path design.

>
> The 32-bit encodings can be bundled, but none of them is capable of
> producing a full 64-bit value in the first place.
<

value = Operand or value = result ?

>
> I could almost have gotten away with a 25-bit field here...
> The largest 32-bit encodings only encode a 25-bit immed;
> All larger values could have been multi-lane.
<

And you are naming lanes of decode not lanes of execution. Gotcha.

>
>

> > I do similar with 5-bit immediates in FP.
> > <
> > FDIV R9,#5,R16 // R9 = 5.0D0 / R16
> I had interpreted the 5-bit values as E3.F2, had tried various schemes,
> but E3.F2 ended up with the best overall hit-rate among the
> possibilities tested.
>
> Hit rate still isn't particularly high though.
>

This would have caused problems in assembly and disassembly, So,
after looking at the data we choose that the expansions from int->fp
were just like (double)int_constant. Sure it limited use, but there are
a lot of 1,2,5,10s in FP codes and while we missed things like 0.5,...
what we did was a pure win as we still have float->double conversions
in the "routing".

>
> Meanwhile, it turns out Binary16 can exactly represent a majority of the
> floating point constants which appear in code, so the operation to
> express a Binary16 value directly has a fairly good hit rate.
<

I would have done something like this, but I don't have the ability to
spontaneously poof a 16-bit immediate onto a FP instruction.
<
On the other hand, having universal constants means I save crap_loads
of instructions delivering constants as FP Operands.

<
>
> >
> Meanwhile, checking some other stats:
> Only around 5% of function-local branches are within +/- 256 bytes.
> But, the vast majority (96%) are within +/- 4K.
<

An even larger number are within ¼Mb. In fact, I don't think Brian's compiler
has run into a subroutine large enough to need a backup plan in this area.

>
> This implies that a 12-bit branch displacement would be a fair bit more
> useful than an 8 bit displacement.
>

My argument is that 16-bits is even more useful than 12. Although Thomas'
work in binutils is now compressing halfword tables jumps (switch) into
byte jumps when all the labels are within range--making switch tables much
more compact.

>
> Meanwhile, looking at my compiler, it had somehow slipped my mind that I
> also already have "BRcc Rn, Disp33s" encodings via jumbo prefixes, which
> end up being the main form used if this feature is enabled in my
> compiler (but... I had forgotten it seems...).
<

We all do that now and again.....

>
> So, it is more a tradeoff between burning encoding space, vs needing a
> 64-bit encoding for these.
<

I don't see it as an encoding space issue, I see it as a variable length constant
routing problem from instruction buffer to function unit as part of "forwarding".
So, the majority of instructions (able to be encoded) have a routing OpCode
in addition to a Calculation OpCode. Instructions with 16-bit immediates have
a canned routing OpCode.
<
You can consider the routing OpCode as treating "forwarding" as another
calculation performed prior to execution. {Not dissimilar to how DG NOVA
had shifts with integer arithmetic}

MitchAlsup

unread,

Aug 20, 2023, 7:03:38 PM8/20/23

to

On Sunday, August 20, 2023 at 4:29:05 PM UTC-5, BGB wrote:
> On 8/19/2023 2:24 PM, MitchAlsup wrote:
>
> Range-coding would be something like (also from memory):
> struct RangeCtx_s {
> byte wvals[65536]; //probability weights
> uint16_t wctx; //context of previous bits
> uint32_t rhi; //high value of range
> uint32_t rlo; //low value of range
> uint32_t rmid; //midpoint (encoded range)
> byte *cs; //encoded bitstream
> };
>
> int DecodeBit(RangeCtx *ctx)
> {
> uint32_t m, r;
> byte w, b;

// type mismatch between return(b); below and typeof(b)

intriguing.

BGB

unread,

Aug 21, 2023, 3:22:53 AM8/21/23

to

I can't say for certain I remembered it correctly (getting bitwise range
coders to encode and decode correctly is rather fiddly).

But, yeah, a similar sort of algorithm was used in LZMA.

Several video codecs also use a hybrid of range-coding and Huffman
coding, where feeding a Huffman coded stream through a range-coder can
get some compression improvements with less of a performance impact.

General idea is that weights are all initialized at a neutral value
(say, 0x80), and then adjusted based on each bit such that more common
bits reduce the range more slowly, but less common ones more rapidly
shrink the range (and, as the high and low bits converge; the encoder
pushes out the converged bits, and the decoder reads in more bits,
causing the range to expand again).

Both the encoder and decoder operate as mirrors of each other.

I had missed a case for handling a case which can pop up sometimes,
where the high and low value fail to converge to the same value and the
range collapses to a point that bits could no longer be unambiguously
encoded. Usually in this case, the entire range needs to be emitted or
read-in before the process can continue.

Note that the inctab/dectab would not be +1 or -1, but usually more of
an S-curve shape. Weight will move more quickly near the center, and
more slowly near the extremes. Tables would be set up to keep the
weights in a range of say, 0x10<=w<=0xF0 or similar, ...

...

BGB

unread,

Aug 21, 2023, 12:04:42 PM8/21/23

to

On 8/19/2023 2:17 PM, MitchAlsup wrote:
> On Saturday, August 19, 2023 at 12:21:10 PM UTC-5, BGB wrote:
>> On 8/19/2023 11:31 AM, MitchAlsup wrote:
>
>>>> The 16-bit ops would mostly hold a collection of 2R ops.
>>>>
>>>> The 24-bit ops hold a selection of Ld/St and 3R ALU ops.
>>>> iiii-isss ss0n-nnnn zzz0-0001 //LD (Rs, Disp5)
>>>> iiii-isss ss1n-nnnn zzz0-0001 //ST (Rs, Disp5)
>>>> tttt-tsss ss0n-nnnn zzz1-0001 //LD (Rs, Rt)
>>>> tttt-tsss ss1n-nnnn zzz1-0001 //ST (Rs, Rt)
>>> <
>>> I think you have sacrificed too much entropy to this particular encoding.
>>> Consider a 32-bit RISC LD/ST instruction can have a 16-bit displacement
>>> So a 24-bit one should be able to have an 8-bit displacement.
>>> <
>> Then for this encoding block, you would have *nothing* apart from LD/ST
>> ops...
> <
> 2 flavors
> a) MEM Rd,[Rb,DISP16]
> b) MEM Rd,[Rb,Ri<<s] // which have optional displacements {32,64}

There are reasons to have other types of ops as well, say, 3R ALU.

>>
>> One could note that Disp5u still typically hits roughly 50% of the time
>> in my stats. This is probably enough for the encoding to still be "useful".
> <
> Whereas, My encoding gives that "flavor" 16-bits which as you stated is good
> to the 99% level. 99% > 50% to the point the compiler does not need the
> intermediate pattern recognition cases.
>

But, Disp16 would not be viable with a 24-bit instruction format.

Disp8 would still leave "only" LD/ST ops in this case.

Where, say, LD/ST also needs 3 bits to encode the type of value to be
loaded/stored.

In my conventions:
000=SB (8-bit)
001=SW (16-bit)
010=SL (32-bit)
011=SQ (64-bit)
100=UB
101=UW
110=UL
111=X (128-bit)

Unsigned is usually N/A for store, so had often interpreted the unsigned
store cases as LEA, though this creates conflict with X.

Geanted, the idea of an ISA with prime-number-of-bytes sized
instructions is likely DOA anyways, so...

MitchAlsup

unread,

Aug 21, 2023, 2:09:11 PM8/21/23

to

On Monday, August 21, 2023 at 11:04:42 AM UTC-5, BGB wrote:
> On 8/19/2023 2:17 PM, MitchAlsup wrote:
> > On Saturday, August 19, 2023 at 12:21:10 PM UTC-5, BGB wrote:
> >> On 8/19/2023 11:31 AM, MitchAlsup wrote:
> >
> >>>> The 16-bit ops would mostly hold a collection of 2R ops.
> >>>>
> >>>> The 24-bit ops hold a selection of Ld/St and 3R ALU ops.
> >>>> iiii-isss ss0n-nnnn zzz0-0001 //LD (Rs, Disp5)
> >>>> iiii-isss ss1n-nnnn zzz0-0001 //ST (Rs, Disp5)
> >>>> tttt-tsss ss0n-nnnn zzz1-0001 //LD (Rs, Rt)
> >>>> tttt-tsss ss1n-nnnn zzz1-0001 //ST (Rs, Rt)
> >>> <
> >>> I think you have sacrificed too much entropy to this particular encoding.
> >>> Consider a 32-bit RISC LD/ST instruction can have a 16-bit displacement
> >>> So a 24-bit one should be able to have an 8-bit displacement.
> >>> <
> >> Then for this encoding block, you would have *nothing* apart from LD/ST
> >> ops...
> > <
> > 2 flavors
> > a) MEM Rd,[Rb,DISP16]
> > b) MEM Rd,[Rb,Ri<<s] // which have optional displacements {32,64}
<
> There are reasons to have other types of ops as well, say, 3R ALU.
<

In the 1980s I used the nR notation and after a while I found it confused
the readers. So I switched to the nO notation so 3R (Rd=Rs1 OP Rs2) is
now 2O (2 Operands) and since almost every instruction delivers a
result the destination field can be omitted from the notation. Readers
of my literature have found this notation less confusing. Does 3R
means 3 register operands and 1 result Rd = FMAC(Rs1,Rs2,Rs3)
or 3-1 register operands and +1 register result Rd = OP(Rs1,Rs2) ??
<
What notation would you use if an instruction delivered 2 results ??

<
> >>
> >> One could note that Disp5u still typically hits roughly 50% of the time
> >> in my stats. This is probably enough for the encoding to still be "useful".
> > <
> > Whereas, My encoding gives that "flavor" 16-bits which as you stated is good
> > to the 99% level. 99% > 50% to the point the compiler does not need the
> > intermediate pattern recognition cases.
> >
> But, Disp16 would not be viable with a 24-bit instruction format.
<

One of the reasons 24-bits was never considered.

>
> Disp8 would still leave "only" LD/ST ops in this case.
>
> Where, say, LD/ST also needs 3 bits to encode the type of value to be
> loaded/stored.
<

LD needs 3 bits, ST only needs 2. Actually LD only needs 2.8 bits
since we don't need both signed and unsigned 64-bit items. Stores
do not need signed and unsigned, just an indication of how-much
to store.

Hogege NaN

unread,

Aug 21, 2023, 3:14:48 PM8/21/23

to

On Saturday, August 5, 2023 at 1:48:38 AM UTC+9, Amine Moulay Ramdane wrote:
> Hello,
> More of my philosophy about CISC and RISC instructions..
> So we can generally consider CISC (Complex Instruction Set Computer)
> instructions of x86 architecture to be higher-level programming instructions compared to RISC (Reduced Instruction Set Computer) instructions due to their complexity.
>
> CISC instructions are designed to perform more complex operations in a single instruction. This complexity allows higher-level programming languages and compilers to generate fewer instructions to accomplish certain tasks. CISC architectures often have a broader range of instructions, some of which might even directly correspond to operations in high-level programming languages.
> In contrast, RISC instructions are designed to be simpler and more streamlined, typically performing basic operations that can be executed in a single clock cycle. It might require more instructions to accomplish the same high-level task that a CISC instruction could handle in a single operation.

CISC vs. RISC was disucussion in 80s, you can find discussion between RISC-I designer (now Esperanto Tech’s CEO) and Vax designer through IEEE transactions.

I think that it is no longer better point of view, becase;
1)Hardware design tools are advanced,
-it supports design of complex architecture and logic circuit with reasonable performance,
-it supports analysis of complex design.

2)Semiconductor process node are advanced,
-it supports enough transistor counts and realizes complex design,
-it supports lesser fabrication cost for old processes (28nm or less have best cost/performance),

3)Needs is changed,
-Application/domain-specifc architectures are rised,
-Every domain requires different specifications.

4)Logic circuit design philosophy is separated from architecture design philosophy now,
-Logic circuit should be simpler (mainly for verification cost and integration density), this is same from ancient (not only computers).
-But architecture(or mechanisms for processing) can be complex(first two reasons).

-
S. Takano

ps.
tired of social network services, so back to here :)

> Thank you,
> Amine Moulay Ramdane.

BGB

unread,

Aug 21, 2023, 3:26:58 PM8/21/23

to

OK, I used 3R, for "Three Register"; always implicitly 2-source and 1
destination.

A few instructions in BJX2 are 4R (3-source, 1 destination).

In these cases, the internal Rp and Rn ports are separate, but in most
cases Rp and Rn are assumed to be equivalent.

These generally involve an Op64 encoding, with a field that is usually
one of either:
A 4th register field;
An extra load/store displacement (RiMOV);
A rounding mode (some FPU and SIMD ops);
More opcode bits (some other ops).
Mostly depends on which "primary opcode" this prefix is used with.

At present, no instructions deliver 2 (independent) results, but it is
theoretically possible that this could be done with multi-lane ops.

>>>>
>>>> One could note that Disp5u still typically hits roughly 50% of the time
>>>> in my stats. This is probably enough for the encoding to still be "useful".
>>> <
>>> Whereas, My encoding gives that "flavor" 16-bits which as you stated is good
>>> to the 99% level. 99% > 50% to the point the compiler does not need the
>>> intermediate pattern recognition cases.
>>>
>> But, Disp16 would not be viable with a 24-bit instruction format.
> <
> One of the reasons 24-bits was never considered.

I did briefly experiment with it in BJX2 (as a possible code-size saving
feature for microcontroller like profiles), but quickly dropped the idea
due to it being fairly quickly revealed to be "a dog turd":
Code size savings fell short of expectations;
Byte alignment within the instruction stream added a whole new mess of
issues (in an ISA not otherwise designed to deal with free-form byte
alignment in the instruction stream);
...

So, it has basically been entirely dropped from the ISA.

The encoding space had later been reused to glue XGPR support onto the
BJX2 Baseline encoding.

This space entirely goes away in XG2 Mode though, which is part of what
eliminated the 40x2 encoding (which was built on top of an "invalid edge
case" of the XGPR encoding). Could maybe revive the idea at some point
under a slightly different encoding.

Then again, when 48-bit ops got knocked out of the ISA due to an
encoding change, they still haven't been revived. But, the original
form, even if it were revived, would not have fit what the ISA has
become. I would effectively need "something new", and what this would be
exactly has not taken shape (given the relative rareness of Op64 ops,
they wouldn't save much space; and their original role has effectively
been entirely subsumed by Jumbo/Op64 encodings... Given their relative
infrequency, the Op64 encodings being 33% bigger likely doesn't matter
all that much to the overall size of the binary).

Well, and also they would only be encodable in Baseline mode (the
relevant encoding space does not exist in XG2).

...

Partial issue:
Yeah, Disp5u is "not really sufficient".

This is part of why, admittedly, the original form of BJX2 had used
Disp5u Load/Store encodings, but I (fairly early) added Disp9u encodings
because of Disp5u being mostly insufficient.

For a while, had dropped the Disp5 encodings, but ended up reviving them
again because there kept being a non-zero number of edge cases where I
still ended up needing them to exist (even if the "general case" is
dominated by Disp9u and similar).

It is like:
"Why does 32-bit 'ADD Rm, Rn' exist if 'ADD Rm, Ro, Rn' also exists?";
(Pause) "Reasons..."

>>
>> Disp8 would still leave "only" LD/ST ops in this case.
>>
>> Where, say, LD/ST also needs 3 bits to encode the type of value to be
>> loaded/stored.
> <
> LD needs 3 bits, ST only needs 2. Actually LD only needs 2.8 bits
> since we don't need both signed and unsigned 64-bit items. Stores
> do not need signed and unsigned, just an indication of how-much
> to store.
>

That is why I usually ended up putting LEA's there...

A LEA operation is nice to have, but is an issue if one has an 'X'
(paired) case.

Though, depending on the ISA rules, one could skip a byte LEA and
instead encode this case as an ADD.

STB, STW, STL, STQ, LEAW, LEAL, LEAQ, STX

...

MitchAlsup

unread,

Aug 21, 2023, 6:03:24 PM8/21/23

to

On Monday, August 21, 2023 at 2:26:58 PM UTC-5, BGB wrote:
> On 8/21/2023 1:09 PM, MitchAlsup wrote:
> >
>
> It is like:
> "Why does 32-bit 'ADD Rm, Rn' exist if 'ADD Rm, Ro, Rn' also exists?";
> (Pause) "Reasons..."
<

Some things "fall out for free", such as ADD Rd,Rs1,#0 as a MOV inst
{along with similar arithmetic identities} and disallowing these costs
gates and design time for no gain.

<
> >>
> >> Disp8 would still leave "only" LD/ST ops in this case.
> >>
> >> Where, say, LD/ST also needs 3 bits to encode the type of value to be
> >> loaded/stored.
> > <
> > LD needs 3 bits, ST only needs 2. Actually LD only needs 2.8 bits
> > since we don't need both signed and unsigned 64-bit items. Stores
> > do not need signed and unsigned, just an indication of how-much
> > to store.
> >
> That is why I usually ended up putting LEA's there...
>
> A LEA operation is nice to have, but is an issue if one has an 'X'
> (paired) case.
>
> Though, depending on the ISA rules, one could skip a byte LEA and
> instead encode this case as an ADD.
>
> STB, STW, STL, STQ, LEAW, LEAL, LEAQ, STX
<

For the MEM Rd,[Rb,Disp16] case I use the signed LDD as the EXIT
instruction, and the similar place in STD as the ENTER instruction.
There is no need for LEA, here, as it is redundant with ADD.
<
For the MEM Rd,[Rb,Ri<<sc] case I use the signed LDD as LEA,
because ADD with 2 operands will be seen to be less costly in
emulating LEA, so LEA is basically reserved for 3-Operand ADDs.
>
> ...

BGB

unread,

Aug 21, 2023, 6:21:01 PM8/21/23

to

On 8/20/2023 5:57 PM, MitchAlsup wrote:
> On Sunday, August 20, 2023 at 5:27:02 PM UTC-5, BGB wrote:
>> On 8/20/2023 4:23 PM, MitchAlsup wrote:
>>> On Sunday, August 20, 2023 at 2:39:06 PM UTC-5, BGB wrote:
>>>> On 8/20/2023 10:29 AM, MitchAlsup wrote:
>>>>
>>>>> Then it is not really a 64-bit machine in a similar manner that Mc 68000
>>>>> was a 16-bit machine that could perform 32-bit calculations.
>>>>> <
>>>> The registers and ops are still 64-bits...
>>>>
>>>> Just the immediate field from the decoders remain 33 bits.
>>>>
>>> So, in order to use a 64-bit constant you consume 2/3rds of your execution lanes ?!?
>> Actually, encoding an instruction with a 64-bit constant eats *all* of
>> the lanes...
>>
>> How much space does it take to encode a 64-bit constant?
>> 96 bits.
>> How wide is the fetch?
>> 96 bits.
>> How many more ops *could* I have bundled here?
>> 0.
>>
> OK, I see the disconnect. I am fetching 128-bits wide on a 1-wide machine
> so that I can use excess I$ bandwidth to do other things (including power
> savings), while you are fetching only as wide as you can issue. Secondarily
> I am designing a scalable ISA where you are designing an ISA targeting a
> particular data path design.

Yeah, something to this effect.

A hypothetical "future machine" could do things well beyond what my
current implementation could do.

But, in the current implementation, getting 2 or 3 wide is only possible
if using 32-bit ops.

It is basically sort of like:
Fetch 96 bits;
Shove it through three 32-bit decoders;
Pick the outputs corresponding to the current bundle format.

The 16-bit decoder, RISC-V decoders, etc, also see the same input
bundle. But, if the bundle or mode doesn't match, these decoder outputs
are ignored (and any unused lanes get filled with NOPs).

There is possible wonk in my case that lane numbering is in reverse order.
Op1
Op2 | Op1
Op3 | Op2 | Op1

But, I have my reasons (my initial conclusion was that reverse-ordering
the lanes was "less bad" than forward ordering would have been, even if
forward ordering could have been cheaper).

>>
>> The 32-bit encodings can be bundled, but none of them is capable of
>> producing a full 64-bit value in the first place.
> <
> value = Operand or value = result ?
>>
>> I could almost have gotten away with a 25-bit field here...
>> The largest 32-bit encodings only encode a 25-bit immed;
>> All larger values could have been multi-lane.
> <
> And you are naming lanes of decode not lanes of execution. Gotcha.

The lanes for decode and execute are equivalent in this case.

A fancier core could have them separate, but at present, they are
equivalent.

>>
>>
>
>>> I do similar with 5-bit immediates in FP.
>>> <
>>> FDIV R9,#5,R16 // R9 = 5.0D0 / R16
>> I had interpreted the 5-bit values as E3.F2, had tried various schemes,
>> but E3.F2 ended up with the best overall hit-rate among the
>> possibilities tested.
>>
>> Hit rate still isn't particularly high though.
>>
> This would have caused problems in assembly and disassembly, So,
> after looking at the data we choose that the expansions from int->fp
> were just like (double)int_constant. Sure it limited use, but there are
> a lot of 1,2,5,10s in FP codes and while we missed things like 0.5,...
> what we did was a pure win as we still have float->double conversions
> in the "routing".

Floating point constants in ASM are represented as raw binary numbers in
my case...

I guess, assembler support for expressing floating point numbers in
decimal notation could have been possible, didn't think of or consider
it though...

But, say, maybe could be possible to, instead of writing, say:
MOV 0x3FF0000000000000, R4
FLDCH 0x3C00, R5
One could write:
MOV 1.0D, R4 //Binary64
FLDCH 1.0H, R5 //Binary16

Where, say, the ASM parser behaves as-if a hexadecimal version of the
constant had been used.

>>
>> Meanwhile, it turns out Binary16 can exactly represent a majority of the
>> floating point constants which appear in code, so the operation to
>> express a Binary16 value directly has a fairly good hit rate.
> <
> I would have done something like this, but I don't have the ability to
> spontaneously poof a 16-bit immediate onto a FP instruction.
> <
> On the other hand, having universal constants means I save crap_loads
> of instructions delivering constants as FP Operands.
> <

In my case, originally it is a "mostly normal" converter op, just with
the input routed from an immediate rather than a register.

>>
>>>
>> Meanwhile, checking some other stats:
>> Only around 5% of function-local branches are within +/- 256 bytes.
>> But, the vast majority (96%) are within +/- 4K.
> <
> An even larger number are within ¼Mb. In fact, I don't think Brian's compiler
> has run into a subroutine large enough to need a backup plan in this area.

With 1MB, it reaches 100% of all branches in most of my current test
programs (excluding some combinations of options for ROTT which can
exceed the 1MB limit).

12-bits is 96% of local (intra function) branches, but only 19% of
global branches (a mixture of function calls, and the backwards branches
for prolog/epilog compression).

>>
>> This implies that a 12-bit branch displacement would be a fair bit more
>> useful than an 8 bit displacement.
>>
> My argument is that 16-bits is even more useful than 12. Although Thomas'
> work in binutils is now compressing halfword tables jumps (switch) into
> byte jumps when all the labels are within range--making switch tables much
> more compact.

Possibly.

Though, my experience seems to imply that 8-bit displacements are fairly
limited if one does displacement calculations based on a 16-bit
instruction word. Would be limited mostly to fairly small switch blocks
(and moderately small loop bodies and similar).

Could be a little better if the 8-bit displacements are unsigned
(forward only) and assuming a 32-bit word, increasing the reach from 64
to 256 instruction words.

Granted, in my case, using 32-bit "BRA Disp20s" instructions as the
jump-table entries is probably not ideal in this sense, but was easiest
to implement in my case (well, and also avoids having any
"non-instruction" data in the ".text" section).

>>
>> Meanwhile, looking at my compiler, it had somehow slipped my mind that I
>> also already have "BRcc Rn, Disp33s" encodings via jumbo prefixes, which
>> end up being the main form used if this feature is enabled in my
>> compiler (but... I had forgotten it seems...).
> <
> We all do that now and again.....

Yeah.

Though, one other tradeoff is that these ops would mostly useful for
loops like:
while(n--) { ... }
Or:
while(p) { ... }

But, not so much:
for(i=0; i<n; i++)
{ ... }

Where, in this case, the relative usefulness of a dedicated Disp12
compare-with-0 branch would also depend on the relative usage of the
former vs the latter.

>>
>> So, it is more a tradeoff between burning encoding space, vs needing a
>> 64-bit encoding for these.
> <
> I don't see it as an encoding space issue, I see it as a variable length constant
> routing problem from instruction buffer to function unit as part of "forwarding".
> So, the majority of instructions (able to be encoded) have a routing OpCode
> in addition to a Calculation OpCode. Instructions with 16-bit immediates have
> a canned routing OpCode.
> <
> You can consider the routing OpCode as treating "forwarding" as another
> calculation performed prior to execution. {Not dissimilar to how DG NOVA
> had shifts with integer arithmetic}

Hmm... And/or (partially) separating the matter of instruction-layout
from opcode semantics?...

So, the instruction is expressed as a combination of "layout" (explicit
in the encoding) and "opcode" (which instruction should be applied to
these parameters).

This could be possible, just sort of implies that all of the major
function units accept the same general interface internally.

Seems like this would have a higher demand for encoding bits than the
strategy I had used, and would lead to a lot of combinations which "are
possible to encode but do not make sense". Though, an intermediate (more
practical) option being to define the table of opcodes per layout.

Say:
fffff-pp-oooo-nnnnn-oo-sssss-ttttt-oooo //3R
fffff-pp-oooo-nnnnn-oo-sssss-iiiii-iiii //Imm9
fffff-pp-oooo-nnnnn-ii-iiiii-iiiii-iiii //Imm16

Where:
fffff: Layout / Category
pp: Predicate
o: Opcode
n: Dest
s: Source A
t: Source B
i: Immed

EricP

unread,

Aug 21, 2023, 6:32:39 PM8/21/23

to

MitchAlsup wrote:
> On Monday, August 21, 2023 at 2:26:58 PM UTC-5, BGB wrote:
>>
>> Though, depending on the ISA rules, one could skip a byte LEA and
>> instead encode this case as an ADD.
>>
>> STB, STW, STL, STQ, LEAW, LEAL, LEAQ, STX
> <
> For the MEM Rd,[Rb,Disp16] case I use the signed LDD as the EXIT
> instruction, and the similar place in STD as the ENTER instruction.
> There is no need for LEA, here, as it is redundant with ADD.

Except if Rb is r0 it means the RIP for a LD/ST instruction,
but the data r0 value for an ADD (which would probably be the
return RIP but you can't assume that).

> <
> For the MEM Rd,[Rb,Ri<<sc] case I use the signed LDD as LEA,
> because ADD with 2 operands will be seen to be less costly in
> emulating LEA, so LEA is basically reserved for 3-Operand ADDs.
>> ...

Again for Rb is r0

BGB

unread,

Aug 21, 2023, 7:03:21 PM8/21/23

to

On 8/21/2023 5:03 PM, MitchAlsup wrote:
> On Monday, August 21, 2023 at 2:26:58 PM UTC-5, BGB wrote:
>> On 8/21/2023 1:09 PM, MitchAlsup wrote:
>>>
>>
>> It is like:
>> "Why does 32-bit 'ADD Rm, Rn' exist if 'ADD Rm, Ro, Rn' also exists?";
>> (Pause) "Reasons..."
> <
> Some things "fall out for free", such as ADD Rd,Rs1,#0 as a MOV inst
> {along with similar arithmetic identities} and disallowing these costs
> gates and design time for no gain.
> <

Yeah.
MOV Rm, Rn
ADD Rm, 0, Rn
OR Rm, 0, Rn
...
All exist as semantically equivalent ways to do the same thing.

Similarly, "ADD Rm, Rn" is semantically equivalent to "ADD Rn, Rm, Rn".
But, there may be secondary reasons for such things to exist (such as
interactions between other parts of the ISA, or between the ISA and the
compiler).

In a strict sense, you don't need:
MOV.L (Rm, Disp5u), Rn
If you also have:
MOV.L (Rm, Disp9u), Rn

The former can't express anything that the latter can't.

But, then XGPR came along:
With the former, there was an ability to express negative displacements
(an formerly unserved use-case), but the encoding scheme could not
extend 9u in a similar way.

Also quirks with RiMOV, where for the Jumbo and Op64 prefixes it ended
up making sense to have different semantics when applied to the Disp5u
and Disp9u cases (so the operations differ when prefixed, but are
redundant in the form of a basic 32-bit instruction word).

Technically, the 5u encodings also exist in PrWEX whereas the 9u
encodings do not, but this case would only matter if the ability to
encode memory ops in Lane 2 were a thing (this was experimented with
though).

But, as can be noted, BJX2 isn't really "minimalistic" in the same sense
as something like RISC-V or similar (and some instructions that would
have been unnecessary in RISC-V were necessary in BJX2, due to the lack
of an architectural zero register, ...).

>>>>
>>>> Disp8 would still leave "only" LD/ST ops in this case.
>>>>
>>>> Where, say, LD/ST also needs 3 bits to encode the type of value to be
>>>> loaded/stored.
>>> <
>>> LD needs 3 bits, ST only needs 2. Actually LD only needs 2.8 bits
>>> since we don't need both signed and unsigned 64-bit items. Stores
>>> do not need signed and unsigned, just an indication of how-much
>>> to store.
>>>
>> That is why I usually ended up putting LEA's there...
>>
>> A LEA operation is nice to have, but is an issue if one has an 'X'
>> (paired) case.
>>
>> Though, depending on the ISA rules, one could skip a byte LEA and
>> instead encode this case as an ADD.
>>
>> STB, STW, STL, STQ, LEAW, LEAL, LEAQ, STX
> <
> For the MEM Rd,[Rb,Disp16] case I use the signed LDD as the EXIT
> instruction, and the similar place in STD as the ENTER instruction.
> There is no need for LEA, here, as it is redundant with ADD.
> <
> For the MEM Rd,[Rb,Ri<<sc] case I use the signed LDD as LEA,
> because ADD with 2 operands will be seen to be less costly in
> emulating LEA, so LEA is basically reserved for 3-Operand ADDs.

OK.

I have LEA, but it isn't quite so useful as a 3-operand ADD mostly
because it does a "zero extend from low 48 bits" thing (so would only be
useful here for 32-bit unsigned operations; but then may produce
out-of-range results rather than the proper 32-bit wrapping behavior).

Well, and this is why ADDS.L / ADDU.L / SUBS.L / SUBU.L exist, to
preserve the expected wrapping semantics (and not require explicit sign
or zero extensions following the various operations).

Well, and one could argue:
EXTS.L Rm, Rn
Is unnecessary because:
ADDS.L Rm, 0, Rn
Does basically the same thing...
...

Well, and stupid stuff, like there has ended up being two semi-redundant
encodings for the BLKUTX2 instruction (as an earlier attempt to relocate
the encoding got botched in my compiler), but at present, I can't remove
either of them without breaking something.

...

Some things could have been a little better here.

But, alas...

>>
>> ...

BGB

unread,

Aug 21, 2023, 7:52:25 PM8/21/23

to

On 8/21/2023 5:32 PM, EricP wrote:
> MitchAlsup wrote:
>> On Monday, August 21, 2023 at 2:26:58 PM UTC-5, BGB wrote:
>>>
>>> Though, depending on the ISA rules, one could skip a byte LEA and
>>> instead encode this case as an ADD.
>>> STB, STW, STL, STQ, LEAW, LEAL, LEAQ, STX
>> <
>> For the MEM Rd,[Rb,Disp16] case I use the signed LDD as the EXIT
>> instruction, and the similar place in STD as the ENTER instruction.
>> There is no need for LEA, here, as it is redundant with ADD.
>
> Except if Rb is r0 it means the RIP for a LD/ST instruction,
> but the data r0 value for an ADD (which would probably be the
> return RIP but you can't assume that).
>

Similar applies in my case as well.

Base register:
R0 -> PC
R1 -> GBR

Index register (Rb != R0|R1):
R0: R0, scaled by element size.
R1: R0, but unscaled.

Or, combined:
(R0, R0) -> (PC, R0)
(R0, R1) -> (R0)
(R1, R0) -> (GBR, R0)
(R1, R1) -> (TBR, R0)

At present, the above is the only way to encode TBR as a base, but this
isn't a huge loss as typically the only reason to use TBR as a
base-register is to access context variables or TLS or similar (serving
a similar role to the FS/GS segments on x86).

>> <
>> For the MEM Rd,[Rb,Ri<<sc] case I use the signed LDD as LEA,
>> because ADD with 2 operands will be seen to be less costly in
>> emulating LEA, so LEA is basically reserved for 3-Operand ADDs.
>>> ...
>
> Again for Rb is r0
>
>

Same...

LEA and ADD will have different behaviors here in my case as well...

MitchAlsup

unread,

Aug 21, 2023, 7:56:28 PM8/21/23

to

On Monday, August 21, 2023 at 5:21:01 PM UTC-5, BGB wrote:
> On 8/20/2023 5:57 PM, MitchAlsup wrote:
>
>
> But, say, maybe could be possible to, instead of writing, say:
> MOV 0x3FF0000000000000, R4
> FLDCH 0x3C00, R5
> One could write:
> MOV 1.0D, R4 //Binary64
> FLDCH 1.0H, R5 //Binary16
<

In my case, the space efficient code is:
<
CVTSD Rd,#1 // ConVerT signed to double 1-word
or
CVTFD Rd,13.7E0 // Convert float to double 2-words
<
In practice, these rarely show up except when passing arguments to
subroutines or results back from functions.

>
>

>
> 12-bits is 96% of local (intra function) branches, but only 19% of
> global branches (a mixture of function calls, and the backwards branches
> for prolog/epilog compression).
<

Is this statically linked or dynamically linked ??

>
> Though, one other tradeoff is that these ops would mostly useful for
> loops like:
> while(n--) { ... }
> Or:
> while(p) { ... }
>
> But, not so much:
> for(i=0; i<n; i++)
> { ... }
<

My LOOP OpCodes cover all of these.

>
> Where, in this case, the relative usefulness of a dedicated Disp12
> compare-with-0 branch would also depend on the relative usage of the
> former vs the latter.
<

Compare with anything you want, use any integer comparison you like
{#0, #integer, Rc},....

> >>
> >> So, it is more a tradeoff between burning encoding space, vs needing a
> >> 64-bit encoding for these.
> > <
> > I don't see it as an encoding space issue, I see it as a variable length constant
> > routing problem from instruction buffer to function unit as part of "forwarding".
> > So, the majority of instructions (able to be encoded) have a routing OpCode
> > in addition to a Calculation OpCode. Instructions with 16-bit immediates have
> > a canned routing OpCode.
> > <
> > You can consider the routing OpCode as treating "forwarding" as another
> > calculation performed prior to execution. {Not dissimilar to how DG NOVA
> > had shifts with integer arithmetic}
<
> Hmm... And/or (partially) separating the matter of instruction-layout
> from opcode semantics?...
<

To do this efficiently in smaller implementations, the decode of this set of
bits has to be of small gate count.

>
> So, the instruction is expressed as a combination of "layout" (explicit
> in the encoding) and "opcode" (which instruction should be applied to
> these parameters).
<

I just use the word "modifiers" to access constants, change the sign,
specify which operand the constant is routed to,....

>
> This could be possible, just sort of implies that all of the major
> function units accept the same general interface internally.
<

Not at all, I have FUs that accept {1,2,3}-operand, and deliver {0,1,2}-results.
The 2nd result is special and is used to support CARRY without adding
register ports to the design.

>
> Seems like this would have a higher demand for encoding bits than the
> strategy I had used, and would lead to a lot of combinations which "are
> possible to encode but do not make sense". Though, an intermediate (more
> practical) option being to define the table of opcodes per layout.
<

It is the mapping of the bits to the decoded table of "what to do" to "where
to do it" that is important. As you should have garnered in the OpCode layout
I illustrated a couple of days ago.
>

MitchAlsup

unread,

Aug 21, 2023, 7:58:32 PM8/21/23

to

On Monday, August 21, 2023 at 5:32:39 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, August 21, 2023 at 2:26:58 PM UTC-5, BGB wrote:
> >>
> >> Though, depending on the ISA rules, one could skip a byte LEA and
> >> instead encode this case as an ADD.
> >>
> >> STB, STW, STL, STQ, LEAW, LEAL, LEAQ, STX
> > <
> > For the MEM Rd,[Rb,Disp16] case I use the signed LDD as the EXIT
> > instruction, and the similar place in STD as the ENTER instruction.
> > There is no need for LEA, here, as it is redundant with ADD.
<
> Except if Rb is r0 it means the RIP for a LD/ST instruction,
> but the data r0 value for an ADD (which would probably be the
> return RIP but you can't assume that).
<

An accepted liability.

>
> > <
> > For the MEM Rd,[Rb,Ri<<sc] case I use the signed LDD as LEA,
> > because ADD with 2 operands will be seen to be less costly in
> > emulating LEA, so LEA is basically reserved for 3-Operand ADDs.
> >> ...
> Again for Rb is r0
<

Since R0 arrives at a subroutine carrying the return address, Brian's
compiler seldom finds a need to use R0 as a GPR. So, this seldom
falls from grace.

Terje Mathisen

unread,

Aug 22, 2023, 3:04:03 PM8/22/23

to

BGB wrote:
> Paeth filter (from memory) is something like:
> P=A+B-C
> dA=abs(P-A)
> dB=abs(P-B)
> dC=abs(P-C)
> if(dA<dB)
> {
>      if(dA<dC)
>        { D=A; }
>      else if(dB<dC)
>        { D=B; }
>      else
>        { D=C; }
> }else
> {
>      if(dB<dC)
>        { D=B; }
>      else
>        { D=C; }
> }

So effectively (using 0/-1) for false/true)

a_less_b = dA<dB
a_less_c = dA<dC
b_less_c = dB<dC

select_a = a_less_b & a_less_c
select_b = ^a_less_b & b_less_c
select_c = ^a_less_c & ^b_less_c

I.e. you find the smallest of the three dX values and pick the
corresponding X?

If you have a vector MIN/MAX which is twice as wide as the values
involved, then it is tempting to put the dX values in the top half and X
in the bottom, and then just return the bottom half?

This presumes that it would be OK to return the smaller value if two
deltas are equal!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

Aug 22, 2023, 4:50:49 PM8/22/23

to

Yeah.

Paeth is basically "pick whichever of the 3 inputs is closest to the
target A+B-C prediction..."

There are ways to do it faster than the use of if/else branches on more
conventional targets, granted.

An ISA with conditional select or predication though can handle this
transform more efficiently without a need to resort to implementing it
via bit-masking or similar.

Would have also been nice if PNG also had a plain A+B-C predictor, but
alas...

Decided to leave out going into a thing about various approaches to
lossy and lossless image compression.

Eg (small summary):
PNG like, optimized for synthetic and lossless (normal PNG)
PNG like, but more optimized for natural images and lossy
No "real world" examples of this category, but can "sorta work" (*1)
JPEG like, optimized for lossless
Typically replacing DCT with WHT or similar.
JPEG like, but optimized for natural images and lossy
Eg: T.81 JPEG, some of the newer "JPEG replacements" (like WebP)
Wavelet-based formats (eg: JPEG-2000)
...

*1: Can basically ends up looking sort of like PNG with some parts from
FLAC and ADPCM glued on (namely a small FIR filter and dynamic adaptive
quantization; possibly using a Rice-coder, ...).
Have had OK results with some past experiments in these areas, but no
mainstream image formats seem to work this way.

Granted, one isn't terribly likely to dethrone JPEG in either speed or
Q/bpp with this, but it is possible to pull something like this off with
significantly less code (IME, one is looking usually at roughly 2kLOC or
so for something like a T.81 JPEG codec; vs, say, something one can
implement in around 500 lines or so).

> If you have a vector MIN/MAX which is twice as wide as the values
> involved, then it is tempting to put the dX values in the top half and X
> in the bottom, and then just return the bottom half?
>
> This presumes that it would be OK to return the smaller value if two
> deltas are equal!
>

Yeah. The "what happens if two deltas are equal" case is something one
has to get correct if they want a PNG implementation to be able to
encode/decode images without them turning into an ugly looking mess.

> Terje
>
>
>

MitchAlsup

unread,

Aug 22, 2023, 6:39:32 PM8/22/23

to

Just for fun::
<
CMP Rab,Ra,Rb
CMP Rac,Ra,Rc
CMP Rbc,Rb,Rc
SLA Ralb,Rab,<1,LT>
SLA Ralc,Rac,<1,LT>
SLA Rblc,Rbc,<1,LT>
AND Rsa,Ralb,Ralc
AND Rsb,~Ealb,Rblc
AND Rsc,~Ralb,~Rblc
// but we have not selected D yet.
<
Presto !!
<
But it occurs to me that this is even better::
<
CMP Rab,Ra,Rb
CMP Rac,Ra,Rc
CMP Rbc,Rb,Rc
SLL Ralb,Rab,<1,LT>
SLL Rblc,Rbc,<1,LT>
CMOV Rd,Ra,Rb,Ralb
CMOV Rd,Rd,Rc,Rclb
// and we have selected D

>
> I.e. you find the smallest of the three dX values and pick the
> corresponding X?
>
> If you have a vector MIN/MAX which is twice as wide as the values
> involved, then it is tempting to put the dX values in the top half and X
> in the bottom, and then just return the bottom half?
<

MIN Rd,Ra,Rb
MIN Rd,Rd,Rc
<
And we have a winner. Moral: express your code correctly.

>
> This presumes that it would be OK to return the smaller value if two
> deltas are equal!
<

Exactly what do you think "equal" means--in almost all circumstances
equal means one can replace the other (except IEEE ±0)

MitchAlsup

unread,

Aug 22, 2023, 6:42:37 PM8/22/23

to

It occurs to me that if you are doing enough of these to matter, that in
the same way one would offload Texture, or Cyphers, one would offload
image compression. Then the nuances of ISA don't matter.

Paul A. Clayton

unread,

Aug 22, 2023, 7:04:34 PM8/22/23

to

On 8/19/23 12:31 PM, MitchAlsup wrote:
[snip]
> I see not giving full access to the whole RF as a poor choice,
> Feel free to disagree with me. {There are too many register
> allocation problems without having artificial boundaries in
> use of registers. You might have set up a situation where you
> have to register allocate from one virtual RF space to another
> virtual RF space before allocating into the physical RF space.}

I disagree, but I also think the preference depends of the weight
given to various tradeoffs.

The tradeoffs will vary based on architecture targets. A more
specialized architecture (e.g., microcontroller-only) or more
focused architecture (e.g., primarily "server workloads" but with
adequate function for personal computing) could favor different
tradeoffs.

For some targets (many microcontroller uses) total code size
(including constant data) is very important. For some targets
instruction bandwidth and possibly size for cold and/or luke-warm
code is significant. For some targets code size is not
significant.

Since the compiler's work in register allocation can be
"cached"/reused for many executions, I feel spending more work at
compile time (and compiler development time) can be justified.

Limiting register names seems least problematic for uncommonly
used operations. Having a longer form of all operations that
includes all the register names would also seem to moderate the
negative effect of shorter encodings at the cost of more complex
decoding and opcode space (which can then take back some of the
code density advantage).

While you, Mitch, have argued persuasively for a unified register
set, there are some benefits to architectural specialization. Of
course, microarchitectural specialization can be applied if there
is a natural idiom which can be easily detected. An artificial
convention (optimization recommendation) can also provide such an
idiom.

E.g., providing a stack cache (or partial frame cache) would be
easier if the stack pointer was known to the microarchitecture.
(In theory, a stack pointer register could be "predicted" by
looking at the memory access pattern, but that seems pointlessly
complex and would probably make microarchitectural optimizations
based on that information not worthwhile.) In this case, there
seems little (no?) difference between convention and architecture,
but in other cases there would be.

(Even software idioms can be almost as difficult to change as
explicit interfaces. One programming concept that came to mind
which _might_ moderate this issue would be presenting a generic
expression of intent and "overloading" expressions with valid
specific implementations. This seems a little like runtime
dispatch choice for supporting non-universal features, though
such have the choice based on feature absence/presence rather
than a compiler choice based on optimization goals presented
at compile time.)

Paul A. Clayton

unread,

Aug 22, 2023, 7:05:49 PM8/22/23

to

On 8/18/23 1:52 PM, MitchAlsup wrote:
> On Friday, August 18, 2023 at 1:10:37 AM UTC-5, BGB wrote:

[snip]
>> And, admittedly, on the other side, not as many people are as likely to
>> agree to my sentiment that 9-bits for more immediate and displacement
>> fields is "mostly sufficient".
> <
> I agree it is "mostly sufficient", but wouldn't you rather have "almost entirely
> sufficient" instead of "mostly sufficient" ?? i.e., 16-bits

I think it also depends on the cost of going beyond the base
level. If one needs to use several instructions to "paste
together" a larger immediate, then "mostly sufficient" is likely
to be excessively painful. If it means a 48-bit instruction rather
than a 32-bit instruction, then "mostly sufficient" might be more
reasonable. If it means a 64-bit instruction ...

Since base immediate sizes (and ways of extending immediates)
interacts with other aspects of instruction encoding, the
tradeoffs do not seem limited to instruction count and code size
for instructions using immediates. For variable length encoding,
parcel size would influence choices. For packet-oriented encodings
(even with cross-packet borrowing of immediate bits), the choices
would likely be different.

Bits not used for immediates are available to other operand value
encodings and to opcodes. Bit field arrangement will influence
decode/operand routing complexity. (The choices are further
complicated by the possibility — for some targets — of caching a
predecoded version of instructions. Increasing cache miss latency
and cache size from predecoding might be worthwhile, depending not
only on the specific microarchitectural targets and the benefit of
the specific predecode but also considering partial cost overlaps
for other changes. Engineering seems to get complex very easily.)

Paul A. Clayton

unread,

Aug 22, 2023, 7:05:56 PM8/22/23

to

On 8/16/23 9:25 PM, MitchAlsup wrote:
[snip]
> You cannot overcome incompetence with arrogance.

That is true with respect to overcoming the _engineering_ effect
of incompetence. However, organizationally, arrogance — at least
the self-confidence aspect — seems rather effective in the short
term.

> and
> Leading with arrogance often implies a base of incompetence.

Arrogance also seems to promote incompetence by discouraging
disagreement and exploration of different perspectives.

BGB

unread,

Aug 22, 2023, 8:54:51 PM8/22/23

to

If they are on the same side...

One can have cases where two deltas are equal, but on opposite sides of
the predictor, in which case the relative order in which the selections
are chosen will matter.

Say:
A=128, B=176, C=160,
P=144

Both dA and dC would be 16, but results would differ if one selects 128
or 160.

One version may result in a correct image, another with a progressively
increasing error (starting at first as a colored streak which then
steadily increases in intensity as more errors accumulate and then
garbles the whole rest of the image).

BGB

unread,

Aug 22, 2023, 10:58:30 PM8/22/23

to

Dedicated Paeth instruction? Probably doable...
Full image codec, a little harder.

If one wanted to design a codec to make it easy to pull off a lot of the
implementation in hardware, and wanted something "sort of JPEG-like",
could make sense to build the codec around Rice-coding and a 4x4 WHT (a
4x4 WHT being a bit cheaper/easier to pull off in hardware vs an 8x8 DCT).

Partly to compensate for the smaller block size, could make sense to use
a Paeth predictor for the block DC coefficients rather than simply
encoding the difference from the previous DC.

If the format supports 1:1:1 sub-sampling and a reversible color
transform, the format can be made lossless as well.

Eg:
Y=(2*G+R+B)/4, U=(B-G)+128, V=(R-G)+128
Or:
Y=(8*G+5*R+3*B)/16, U=(B-Y)+128, V=(R-Y)+128

>>
>> *1: Can basically ends up looking sort of like PNG with some parts from
>> FLAC and ADPCM glued on (namely a small FIR filter and dynamic adaptive
>> quantization; possibly using a Rice-coder, ...).
>> Have had OK results with some past experiments in these areas, but no
>> mainstream image formats seem to work this way.
>>
>> Granted, one isn't terribly likely to dethrone JPEG in either speed or
>> Q/bpp with this, but it is possible to pull something like this off with
>> significantly less code (IME, one is looking usually at roughly 2kLOC or
>> so for something like a T.81 JPEG codec; vs, say, something one can
>> implement in around 500 lines or so).

Or, something like the above, which also shouldn't be too difficult to
hardware-accelerate.

Pseudocode for an encoder being something like:
step=0;
for(y=0; y<height; y++)
for(x=0; x<width; x++)
for(c=0; c<4; c++)
{
pr=doFilt(img, x, y, c, xstr);
px=img[(y*xstr+x)*4+c];
d=px-pr;

//if lossy
q=(d*stepRcpTab[step])>>15; //reciprocals for each step
qa=abs(q)
if((qa<qLoThresh) && (step>0))
step--;
if(qa>qHiThresh)
step++;

emitResidual(q); //AdRice+RLE or similar.
}

Thresholds would be used to tune quality, and would need to be known by
the decoder.

Filter could be something like:
if((x>0) && (y>0))
{
//P=A+B-C
pr= img[((y )*xstr+x-1)*4+c]+
img[((y-1)*xstr+x )*4+c]-
img[((y-1)*xstr+x-1)*4+c];
}else
{
pr=0;
}

A naive entropy scheme being something like:
void emitResidual(int q)
{
int qf;
qf=(q<0)?(((-q)<<1)|1):(q<<1);

if(qf)
{
if(runZeroCount)
{
if(runZeroCount>1)
{
emitAdRice(0);
emitAdRice(runZeroCount+1);
}else
{
emitAdRice(1);
}
}
emitAdRice(qf+1);
}else
{
if(runZeroCount>=RUNZEROMAX)
{
emitAdRice(0);
emitAdRice(runZeroCount+1);
runZeroCount=0;
}
runZeroCount++;
}
}

Rest of the codec mostly being stuff for the AdRice/bitstream handling
and similar.

Note that 0 could escape both an RLE run, but also be used to further
escape meta-commands or control parameters.

Basic:
1+: single residual
0, 2+: run of zeroes.
0, 1, param, value: Update control parameter
0, 0, x, ...: Command Escape.

Decoding could be single-pass with finite-state-machines for the
entropy/residual stages.

Would likely also use a length-limited AdRice encoding, where for Q:
0: Decrement Rk (if Rk>0)
1: Leave Rk as-is
2..6: Increment Rk
7: Escape, full-length N-bit symbol follows.

Where, one emits each symbol as a Q+1 bit prefix, and an Rk bit suffix.
Say, Q=2, Rk=2: 110zz (Encodes a value of 8..11).
But, after encountering this symbol, Rk would increase to 3.
Then, Q=0, Rk=3: 0zzz (0..7)
Would cause Rk to drop back to 2.
...

Where, Rice decoding can be helped along with a CTNZ instruction (Count
Trailing Non-Zero), but failing this, lookup tables also work, ...
(length-limited variants can also be decoded similar to Huffman if
needed; but unlike Huffman, the lookup tables are constant).

For simplicity sake, assume RGBA32 / RGBA8888.

For simplicity, if developed into a format, would likely borrow the
DIB/BMP packaging.

Though, I guess this does sort of point out the relative lack of formats
between "simple" formats (like TGA or PCX) and "more complex" formats
(like PNG or JPEG).

Terje Mathisen

unread,

Aug 23, 2023, 7:29:43 AM8/23/23

to

MitchAlsup wrote:
>> I.e. you find the smallest of the three dX values and pick the
>> corresponding X?
>>
>> If you have a vector MIN/MAX which is twice as wide as the values
>> involved, then it is tempting to put the dX values in the top half and X
>> in the bottom, and then just return the bottom half?
> <
> MIN Rd,Ra,Rb
> MIN Rd,Rd,Rc
> <
> And we have a winner. Moral: express your code correctly.

We do need to merge the deltas with the original values, and mask away
the top at the end, but it is obviously very fast.

>> This presumes that it would be OK to return the smaller value if two
>> deltas are equal!
> <
> Exactly what do you think "equal" means--in almost all circumstances
> equal means one can replace the other (except IEEE ±0)

The problem with a merged key is that in the case where dA == dB but A <
B, then the original logic says that B should be selected (due to <= to
dA), but now we will end up with A since ((dA<<32)|A) < ((dB<<32)|B).

BGB mentioned that this difference would mess up PNG decoding.

We need something similar to CAS2 (CMPXCHG8B) where you have both a
32-bit key and a 32-bit payload which is not part of the comparison.

BGB

unread,

Aug 24, 2023, 3:23:16 AM8/24/23

to

On 8/21/2023 6:56 PM, MitchAlsup wrote:
> On Monday, August 21, 2023 at 5:21:01 PM UTC-5, BGB wrote:
>> On 8/20/2023 5:57 PM, MitchAlsup wrote:
>>
>>
>> But, say, maybe could be possible to, instead of writing, say:
>> MOV 0x3FF0000000000000, R4
>> FLDCH 0x3C00, R5
>> One could write:
>> MOV 1.0D, R4 //Binary64
>> FLDCH 1.0H, R5 //Binary16
> <
> In my case, the space efficient code is:
> <
> CVTSD Rd,#1 // ConVerT signed to double 1-word
> or
> CVTFD Rd,13.7E0 // Convert float to double 2-words
> <
> In practice, these rarely show up except when passing arguments to
> subroutines or results back from functions.

OK.

>>
>>
>
>>
>> 12-bits is 96% of local (intra function) branches, but only 19% of
>> global branches (a mixture of function calls, and the backwards branches
>> for prolog/epilog compression).
> <
> Is this statically linked or dynamically linked ??
>

BGBCC is a "compile everything all at once" compiler design, and in this
case, static linked.

Though, something does seem anomalous, in that the "2 back branches per
function" amounting to roughly 80% of the total branches, does seem a
little suspect...

But, I can't really otherwise explain why, the displacements are showing
up a global pattern something like (roughly):
GLQuake: 8s=0.96% 12s=19.96% 16s=0.04% 20s=79.04%
Doom: 8s=1.08% 12s= 7.58% 16s=0.03% 20s=91.23% 24s=0.08%

This seems like there is a clear split between local branches and global
branches.

OTOH:
These stats are also based on a conservative "branch length estimator"
model (which selects which branch type to use by making an "educated
guess"), rather than the stats from the final binary's relocs.

May need to gather stats based on reloc time statistics as well (which
may well give a different pattern).

Goes and adds stats logic for this...

GLQuake, modeled based on distances while applying relocs:
8s=40.39% 12s=5.48% 16s=4.11% 20s=48.57% 24s=1.45% 33s=0.00%

This is, a bit different...

This stat paints an entirely different picture about the value of adding
Disp12s instructions... (And, that the branch-length estimator may be
significantly underestimating the number of short branches).

Granted, the relative cost of overestimating the required branch length
is significantly worse than underestimating.

>>
>> Though, one other tradeoff is that these ops would mostly useful for
>> loops like:
>> while(n--) { ... }
>> Or:
>> while(p) { ... }
>>
>> But, not so much:
>> for(i=0; i<n; i++)
>> { ... }
> <
> My LOOP OpCodes cover all of these.

These branches would be limited to what can be made to fit the pattern:
if(x CMP 0)
goto Lbl;

>>
>> Where, in this case, the relative usefulness of a dedicated Disp12
>> compare-with-0 branch would also depend on the relative usage of the
>> former vs the latter.
> <
> Compare with anything you want, use any integer comparison you like
> {#0, #integer, Rc},....

There are also "compare two-registers and branch" ops, but no encoding
space to expand these to a larger displacement.

Also no immediate-form.

But, an immediate form would be a problem, as the immediate field is
already in use with holding the branch displacement.

Reason both "compare two regs" and "compare with zero" variants exist,
is because of a lack of an architectural zero register.

Though, if one needs two ops anyways, and doesn't care about preserving
SR.T:
CMPxx Imm, Rn
BT/BF Lbl
Also works well...

>>>>
>>>> So, it is more a tradeoff between burning encoding space, vs needing a
>>>> 64-bit encoding for these.
>>> <
>>> I don't see it as an encoding space issue, I see it as a variable length constant
>>> routing problem from instruction buffer to function unit as part of "forwarding".
>>> So, the majority of instructions (able to be encoded) have a routing OpCode
>>> in addition to a Calculation OpCode. Instructions with 16-bit immediates have
>>> a canned routing OpCode.
>>> <
>>> You can consider the routing OpCode as treating "forwarding" as another
>>> calculation performed prior to execution. {Not dissimilar to how DG NOVA
>>> had shifts with integer arithmetic}
> <
>> Hmm... And/or (partially) separating the matter of instruction-layout
>> from opcode semantics?...
> <
> To do this efficiently in smaller implementations, the decode of this set of
> bits has to be of small gate count.

OK.

>>
>> So, the instruction is expressed as a combination of "layout" (explicit
>> in the encoding) and "opcode" (which instruction should be applied to
>> these parameters).
> <
> I just use the word "modifiers" to access constants, change the sign,
> specify which operand the constant is routed to,....

OK.

>>
>> This could be possible, just sort of implies that all of the major
>> function units accept the same general interface internally.
> <
> Not at all, I have FUs that accept {1,2,3}-operand, and deliver {0,1,2}-results.
> The 2nd result is special and is used to support CARRY without adding
> register ports to the design.

I meant, say, if one has different FUs that expect different input and
output layouts, mix/match may result in a whole lot of "this doesn't
make sense" combinations.

3R doesn't make sense for a branch, and 1R doesn't make sense for most
ALU ops, ...

>>
>> Seems like this would have a higher demand for encoding bits than the
>> strategy I had used, and would lead to a lot of combinations which "are
>> possible to encode but do not make sense". Though, an intermediate (more
>> practical) option being to define the table of opcodes per layout.
> <
> It is the mapping of the bits to the decoded table of "what to do" to "where
> to do it" that is important. As you should have garnered in the OpCode layout
> I illustrated a couple of days ago.

I didn't entirely understand it...

Your approach to the encoding sounds like it is likely very different
from my approach in these areas.

In my case, opcode drives instruction layout and unpacking, not the
other way around.

So, every possible combination of operation and instruction layout
effectively requires its own opcode (and, there not necessarily being a
correlation between where an instruction is located, and its decoding
pattern; with the partial exception of the F8 block).

Though, in the listings, I mostly left out jumbo encodings mostly except
in cases where "new" semantics were expressed. Partly because these did
follow a more straightforward pattern.

Say, I don't necessarily need to list out every combination when most
cases can be summarized as, say, "If it is Imm9/Disp9, the Imm/Disp goes
from 9 bits to 33 bits with an FE jumbo prefix", ...

...

MitchAlsup

unread,

Aug 24, 2023, 12:48:52 PM8/24/23

to

Yes, tis is what I have been trying to explain all these months. It is the
modifiers, and their need to be compact but do everything I wanted done
that drove their position in the instruction. The OpCodes, then, were fitted
in the space remaining.

>
>
> So, every possible combination of operation and instruction layout
> effectively requires its own opcode (and, there not necessarily being a
> correlation between where an instruction is located, and its decoding
> pattern; with the partial exception of the F8 block).
>

This is where doing the modifiers first wins--when you need that 3rd operand
register, it goes where it goes, and then the 3-operand instruction is then
crammed in the space which is left over; and the guiding principle is that
the gates decoding the modifier are not perturbed by all of this.

>
> Though, in the listings, I mostly left out jumbo encodings mostly except
> in cases where "new" semantics were expressed. Partly because these did
> follow a more straightforward pattern.
>
> Say, I don't necessarily need to list out every combination when most
> cases can be summarized as, say, "If it is Imm9/Disp9, the Imm/Disp goes
> from 9 bits to 33 bits with an FE jumbo prefix", ...
>
>
> ...

You have gone to great lengths to get 3-wide running at 50 MHz. I wonder if
a 1-wide at 100 MHz would actually perform better ???

BGB

unread,

Aug 24, 2023, 1:50:56 PM8/24/23

to

OK.

>>
>>
>> So, every possible combination of operation and instruction layout
>> effectively requires its own opcode (and, there not necessarily being a
>> correlation between where an instruction is located, and its decoding
>> pattern; with the partial exception of the F8 block).
>>
> This is where doing the modifiers first wins--when you need that 3rd operand
> register, it goes where it goes, and then the 3-operand instruction is then
> crammed in the space which is left over; and the guiding principle is that
> the gates decoding the modifier are not perturbed by all of this.

OK.

It may make sense to consider something like this if I were doing it
over again.

>>
>> Though, in the listings, I mostly left out jumbo encodings mostly except
>> in cases where "new" semantics were expressed. Partly because these did
>> follow a more straightforward pattern.
>>
>> Say, I don't necessarily need to list out every combination when most
>> cases can be summarized as, say, "If it is Imm9/Disp9, the Imm/Disp goes
>> from 9 bits to 33 bits with an FE jumbo prefix", ...
>>
>>
>> ...
> You have gone to great lengths to get 3-wide running at 50 MHz. I wonder if
> a 1-wide at 100 MHz would actually perform better ???

It is the other way around, actually:
I investigated 1-wide 100MHz first, but then dropped to 50MHz and went
3-wide after the fact.

Big issue seems to be the L1 caches, one is hard-pressed to get a core
with a non-trivial L1 cache running at 100 MHz, short of, short adding
extra cycles to the array loads;
For smaller L1 cache sizes, L1 miss latency dominates the overall
performance.

I have been faced with the usual issues that, with 1-cycle access to the
BRAM arrays, I can't make 16K or 32K L1 caches work at 100MHz.

Only way the L2's work is by being able to add a bunch of buffer cycles
to the BRAM array loads/stores.

It was only recently that I have started getting past the cores being
memory-access bound (where other concerns, such as bundling and
interlock penalties) have started to matter.

Semi-relatedly, in working on debugging my hardware rasterizer module, I
noted that I could nearly double its performance by widening the RAM bus
access (for RGB and Z buffers) from 64 to 128 bits.

Well, and also added an internal cache-array for texture blocks:
256x 64-bit, holds texture blocks.
Enough to store a 64x64 texture entirely.
Along with the needed 4-way buffer:
4x 64-bit, assoc 4-way block buffer.
Where, 4-way is needed for bilinear filtering to work (also added).

With the cache-array, in this case, if the 4-way buffer misses, but the
cache array hits, it loads the block from the cache-array, else it sends
a memory request (with the returned result being stored back into the
array).

The 4-way buffer may hold the needed texture blocks (in no particular
order), and may do bilinear via parallel block decoders (with inputs
routed from the matching blocks in the 4-way cache).

I am left wondering had it might have been better though to have just
put it directly on the ringbus rather than trying to route requests
through the VRAM module (though, doing so would further increase ringbus
latency...).

Does have the limitation that it only does textures that are square and
compressed, whereas non-square and uncompressed textures could allow
using it for all of the rasterization.

Mostly, in GLQuake, it is the console background, status bar, some
window textures, etc, that end up being non-square.

Textures with a 2:1 aspect could be handled, but not all of these are
2:1 either (eg, the status bar background image; GL 1.x merely requiring
that textures dimensions be power-of-2).

...

MitchAlsup

unread,

Aug 24, 2023, 3:02:54 PM8/24/23

to

On Tuesday, August 22, 2023 at 6:04:34 PM UTC-5, Paul A. Clayton wrote:
> On 8/19/23 12:31 PM, MitchAlsup wrote:
[snip]
>

> While you, Mitch, have argued persuasively for a unified register
> set, there are some benefits to architectural specialization. Of
> course, microarchitectural specialization can be applied if there
> is a natural idiom which can be easily detected. An artificial
> convention (optimization recommendation) can also provide such an
> idiom.
<

Allow me to clarify::
<
I am not trying to create and architecture which is
a) a marvelous microcontroller CPU
b) a marvelous vector supercomputer CPU
What I am trying to do is
c) a marvelous general purpose CPU
with
d) an actually Reduced instruction set.
<
And in this domain I think unified register is closer to optimal than
a similar ISA with separate or specialized register sets.
<
Outside of this domain other design points/decisions take over.
<
But within this domain I am getting VAX instruction counts with
RISC pipelineability and pipeline efficiency, and my ISA requires
only 70%± of the instruction count of RISC-V which should trans-
late into nearly a 40% performance advantage {under a whole
slew of necessary caveats} at the same operating frequency....

>
> E.g., providing a stack cache (or partial frame cache) would be
> easier if the stack pointer was known to the microarchitecture.
<

My ISA understands that R31=SP and that if one allocates space
{SP-=128} and then deallocates space {SP+=128} than the associated
cache lines do not need to be pushed into the memory hierarchy. They
CAN be, but they do not NEED to be. Whereas: those same manipulations
on any other register do not have that property--even when SW might
want that behavior. {{Different implementations are allowed to do
different things, here, just like they are allowed different cache sizes,
and sets of associativity.}}

<
> (In theory, a stack pointer register could be "predicted" by
> looking at the memory access pattern, but that seems pointlessly
> complex and would probably make microarchitectural optimizations
> based on that information not worthwhile.) In this case, there
> seems little (no?) difference between convention and architecture,
> but in other cases there would be.
<

In general:: The stack pointer is manipulated once at the entry of a
subroutine and once at the exit of a subroutine and is constant over
the execution of a subroutine. This is RISC philosophy, foreign to
PDP-11 and VAX philosophy; neither of which lived long enough to
witness the transition x86 made from VAX philosophy to RISC philo-
sophy at the switch over from x86 to x86-64. {{In block structured
languages, the subroutine boundary changes to a block boundary.}}

>
> (Even software idioms can be almost as difficult to change as
> explicit interfaces. One programming concept that came to mind
> which _might_ moderate this issue would be presenting a generic
> expression of intent and "overloading" expressions with valid
> specific implementations. This seems a little like runtime
> dispatch choice for supporting non-universal features, though
> such have the choice based on feature absence/presence rather
> than a compiler choice based on optimization goals presented
> at compile time.)
<

All of the overload resolution takes place in the compiler before
code generation and linking; and whatever code sequences the
compiler chooses becomes an idiom which could be recognized
and optimized later.
<
But you are correct in your assumption that "some of this stuff"
makes its way back into ISA--and one of the reasons I think
Quadriblock's gyrations are misguided. My 66000 ISA has improved
markedly since Brian (and now Thomas) have been contributing
{Brian doing compiler and Thomas /binutls}. Swtich statements
have a single instruction that performs range checking, table
access, default qualification; in a way that remains position
independent. Likewise dynamic linking uses a single instruction
to CALL an external subroutine that is not attackable the current
attack strategies, not does it use a trampoline to get there and back.
Fewer instructions, fewer cycles, retaining all the desired properties.
<
Along the way, several things were "invented", the later modified,
then later parts of them discarded for other "inventions". These
are the things one has to gyrate through before the ISA gets
"public" because afterwards the original mistakes become frozen
{and from x86 experience--frozen for at least 40 years}
<
I could not have understood the proper properties of these "things"
without feedback of reading the code, relearning how and why of
that functionality so that I could then figure out what embodiment
was required so that they were both small, fast, and efficient. And
this is where I think Quadriblock is going astray--you need guidance
from compiler and runtime development to make the ISA "correct".

Stephen Fuld

unread,

Aug 26, 2023, 10:04:22 AM8/26/23

to

On 8/24/2023 12:02 PM, MitchAlsup wrote:
> On Tuesday, August 22, 2023 at 6:04:34 PM UTC-5, Paul A. Clayton wrote:
>> On 8/19/23 12:31 PM, MitchAlsup wrote:
> [snip]
>>
>> While you, Mitch, have argued persuasively for a unified register
>> set, there are some benefits to architectural specialization. Of
>> course, microarchitectural specialization can be applied if there
>> is a natural idiom which can be easily detected. An artificial
>> convention (optimization recommendation) can also provide such an
>> idiom.
> <
> Allow me to clarify::
> <
> I am not trying to create and architecture which is
> a) a marvelous microcontroller CPU
> b) a marvelous vector supercomputer CPU

OK, I will ask. If you were trying to create a marvelous vector
supercomputer CPU, how would it be different from MY66000?

Specifically, would you still use the VVM mechanism? How different
would the ISA be (or would it just use more FP functional units and
perhaps more/bigger buffers for VVM use?)? Would you provide full
support for 128 bit FP? etc., etc.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Aug 26, 2023, 3:33:52 PM8/26/23

to

On Saturday, August 26, 2023 at 9:04:22 AM UTC-5, Stephen Fuld wrote:
> On 8/24/2023 12:02 PM, MitchAlsup wrote:
> > On Tuesday, August 22, 2023 at 6:04:34 PM UTC-5, Paul A. Clayton wrote:
> >> On 8/19/23 12:31 PM, MitchAlsup wrote:
> > [snip]
> >>
> >> While you, Mitch, have argued persuasively for a unified register
> >> set, there are some benefits to architectural specialization. Of
> >> course, microarchitectural specialization can be applied if there
> >> is a natural idiom which can be easily detected. An artificial
> >> convention (optimization recommendation) can also provide such an
> >> idiom.
> > <
> > Allow me to clarify::
> > <
> > I am not trying to create and architecture which is
> > a) a marvelous microcontroller CPU
> > b) a marvelous vector supercomputer CPU
<
> OK, I will ask. If you were trying to create a marvelous vector
> supercomputer CPU, how would it be different from MY66000?
<

What an intriguing question !!

>
> Specifically, would you still use the VVM mechanism? How different
> would the ISA be (or would it just use more FP functional units and
> perhaps more/bigger buffers for VVM use?)? Would you provide full
> support for 128 bit FP? etc., etc.
<

After thinking about this for an hour::
<
ISA would probably be pretty much the same, the memory system and
interconnect would be vastly beefier. I would shoot for a cache line
width of FPUs (8)×{FADD, FMAC, FDIV/SQRT} 8 to 16 cache line staging
buffers, 4 AGENs per cycle, all feeding off the 1MB 16-banked L2, taking
4 caches misses per cycle. Then over in the memory/DRAM area there
would be a minimum of 16 DIMMs (or HBMs) operating at 2 speed
grades below maximum BW DDR <of that generation> could muster.
Every lane would be capable of integer and logical calculations.
<
After writing the above and more thought, I can't see any changes in
ISA, as we already get gather (LDD->LD) and Scater (LDD-ST) falling
out for free.
<
My hope would be that it would not melt when operating at full throughput.

luke.l...@gmail.com

unread,

Aug 26, 2023, 5:54:46 PM8/26/23

to

On Monday, August 14, 2023 at 12:49:17 AM UTC+1, MitchAlsup wrote:
> On Sunday, August 13, 2023 at 4:58:01 PM UTC-5, pec...@gmail.com wrote:

> > I think that you should reconsider instruction compression, because it is quite orthogonal to all these advantages.
> <
> With the current encoding there is no where to put more than a few compressed instructions--one would have to get a significant advantage out of 12 holes in the ISA map and at the same time sacrifice expandability of the ISA in the future. This is something I am not willing to do at this time.

you can do the trick that Motorola did with VLE-Book for Power ISA:
mark a TLB page as "supporting compressed binaries" and for that
virtual memory page and that page only have an *entirely different*
ISA encoding.

of course if that compressed encoding is incomplete (does not cover
the same capability or ops as the uncompresed encoding) you then have
to have the fun-and-games of jumping to *another page* just to call
the non-compressed encoding and back again, but hey, omelette
and eggs.

l.