On 6/22/2022 8:03 PM, MitchAlsup wrote:
> I could not find the question asking me to make a list of why My 66000
> instruction set architecture is like and unlike the tenets of the original
> RISC. So I spent some time looking up what the internet is currently saying
> about RISCs. There is a short list, but I will start with a few statements
> from Hennessey and Paterson::
>
> Hennessey:: The goal of any instruction format should be: 1. simple decode,
> 2. simple decode, and 3. simple decode. Any attempts at improved code
> density at the expense of CPU performance should be ridiculed at every
> opportunity.
>
> Patterson:: more is not better -- microcode is bad
> Subroutines need low overhead
>
> RISC axioms:
> a) the ISA is primarily designed to make the pipeline simple.
> b) the ISA is primarily designed as a target for compilers.
> c) instructions only exist if they add performance.
> d) frequently accessed data is kept in registers.
>
BJX2 generally upholds the above.
While some instructions are pretty niche, most still tend to have
use-cases, and I am mostly trying to avoid adding stuff that is
(completely) useless.
> RISC tenets:
> a) 1 word == 1 instruction
> b) 1 instructions flows down the pipeline in 1 cycle
> c) 1 instruction can cause 0 or 1 exception
> d) instruction encoding uses few patterns
> e) there is a large uniformly addressable register space
>
My case, 3 out of 5.
a, 16/32 and bundle-encodings break this one.
d, Some extra complexity exists due to the lack of an architectural Zero
Register and similar, and some instructions (early on) which ended up
with both 2R and 3R encodings.
Early on, I wasn't confident, for example, that "ADD R4, R5" and "ADD
R5, R4, R5" would have been semantically equivalent in all cases.
There were some other cases (Mostly 32-bit 2R Load/Store variants) which
were dropped due to being entirely redundant with the 3R encodings (or
which became redundant once predication was added).
Some other parts of the ISA also ended up being dropped and then later
re-added a few times before becoming more-or-less permanent (and some
other features are in limbo due to not really adding enough to to
justify their existence).
> So where does My 66000 ISA stand with respect to these axioms and
> tenets::
>
> RISC axioms: My 66000 ISA embodies all of the RISC axioms
> RISC tenets: My 66000 ISA rejects ½ of RISC tenets
>
> With minor exceptions to both::
>
> My 66000 contains 32×64-bit general purpose registers. Some might
> think this is too few and a FP register file should be added. Looking
> at code such as BLASS, Livermore Loops, Linpack indicates otherwise
> -- as long as one assumes some hints of OoO pipelining. Looking at
> various C libraries this seems perfectly sufficient.
>
My case: 32|64 x 64-bit.
I am still on the fence as to whether 32 GPRs is "fully sufficient", or
whether 64 GPRs can offer enough gain (in certain use-cases) to justify
its existence. It "kinda helps" for TKRA-GL but is seemingly kinda moot
for pretty much everything else.
The way the encodings for the 64 GPR case are handled is a bit hacky,
but it was a tradeoff (I came up with something which could be done
without breaking binary compatibility or requiring a separate operating
mode). Ironically, everything still works OK so long as "most of the
code" sticks to only using the low 32 GPRs (otherwise, some of the seams
might start to show).
> My 66000 ISA contains 6 decoding patterns; 1 for each of
> {instructions with 16-bit immediates, instructions with 12-bit
> immediates, scaled memory reference, 2-operand reg-reg,
> 1-operand reg-reg, 3-operand reg-reg }
>
Hmm (8 major for 32-bit):
FZnm_ZeoZ //3R "Rm, Ro, Rn"
FZnm_ZeZZ //2R "Rm, Rn"
FZnm_Zeii //3RI (Imm9/Disp9), "Rm, Imm9, Rn" / "(Rm, Disp9), Rn"
FZnZ_Zeii //2RI (Imm10), "Imm10, Rn"
FZZZ_ZeoZ //1R (Ro treated as Rn for these)
FZZn_iiii //2RI (Imm16), "Imm16, Rn"
FZdd_Zddd //Disp20 (Branch)
FZii_iiii //"LDIz Imm24, R0"
Add a few more if one counts the 16-bit ops:
ZZnm //2R
ZZni //2RI (Imm4)
ZZnZ //1R
Znii //2RI (Imm8)
ZZdd //Disp8 (Branch)
The Jumbo and Op64 encodings may or may not be considered new forms,
however they don't actually add "new" instruction-forms per-se, but
rather modify the existing encodings in predefined ways (and reuse the
existing 32-bit decoder; just with more bits "glued on" to the instruction).
One could potentially also interpret the 32-bit encodings as zero-padded
versions of a longer internal encoding space:
FEii_iiii_FZnm_Zeii //3RI, "Rm, Imm33, Rn"
...
With a few special cases, eg:
FEii_iiii_FAii_iiii //"LDIZ Imm48, R0"
FFii_iiii_FAii_iiii //"BRA Abs48"
There are more forms if one considers "minor" patterns, but these don't
really effect instruction encoding, but more how the various parts are
interpreted and mapped to the internal pipeline:
Logically, each instruction is decoded as if it had:
3 read ports, 1 write port;
A 33-bit immediate/displacement field;
Op / Sub-Op;
...
This then combines with an outer stage that deals with the bundle as a
whole, mapping SIMD ops to two lanes, along with Abs48 and Imm64
encodings (where the immediate can't fit into a single pipeline lane).
The output of this using being the configuration for the entire pipeline.
> The 12-bit immediate format is used for shift instructions and
> for Predicate instructions and positioned such that predicate
> instructions are only 1-bit different than their corresponding
> branch instruction. This saves 6×16-bit immediate encodings.
>
Differs in my case:
Shifts and friends use Imm9 forms;
However, because one doesn't need all 9 bits for a typical shift, had
also kinda shoe-horned SIMD shuffle instructions into the mix as well.
Or, in effect, shuffle can be imagined sort of like a conjoined twin
stuck onto the shift instruction (and a variable shift imagined as
masking-off the bit that causes it to able to behave like a shuffle).
Or, one can also imagine that there could have been an alternate
universe where passing a sufficiently out-of-range value to the shift
instruction caused it to shuffle the value instead...
Predicate instructions work very differently in my case, having their
own copy of the 32-bit encoding space which mirrors the format of the
normal opcode space (just replacing the WEX bit with a True/False bit),
and the encoding spots that would have normally encoded Imm24 and Jumbo
being repurposed as Predication+WEX / "PrWEX" (but only applying to a
subset of the ISA).
> Scaled memory reference, 1-operand, 2-operand, 3-operand
> all have access to 32-bit or 64-bit immediates/displacements
> in substitution for a register. This eliminates any need to use
> instructions or waste registers pasting constants together.
>
Via Jumbo, these can all expand to 33 bits.
The 64-bit cases are a bit more limited, but not usually a huge issue.
There are also some Imm56 encodings "on paper" (these are in a similar
limbo as the 48-bit instruction encodings).
Seemingly the vast majority of what one needs a larger immediate for can
be handled via Imm33, where, say, only about 4% of the constants
actually go outside of this limit (the vast majority of these being
either MMIO pointers or irrational floating-point constants).
The Imm56 cases looking like they would be too rare to really be worth
bothering with at present.
> 1-operand, 2-operand, 3-operand instructions all have sign control
> over their operands. There is no SUB instruction My 66000 uses
> ADD Rd,Rs1,-Rs2 instead. The sign control eliminates most NEG
> instructions from execution. The 2-operand group allows the
> 5-bit register specifier to be used as a 6-bit sign extended
> immediate, making ADD Rd,#1,-Rs2 easily encoded.
>
No equivalent in my case.
> There are Compare instructions that return a bit-vector of everything
> the compare circuitry can determine, including range checks like:
> 0 < Rs1 <= Rs2, classifications {-infinity, -normal, -denormal, -zero,
> +zero, +denormal, +normal, +infinity, SNaN, QNaN} I remain tempted
> to add "any byte equal", "any halfword equal", "any word equal".
>
No equivalent.
I did the same thing as SuperH here:
CMPxx instructions twiddles the SR.T bit;
Branches / Predication / ... all operate off the SR.T bit.
Ironically, because of the way Verilog works, so much stuff hanging off
a single bit causes it to get something like 1000x more expensive.
> There are 2 kinds of conditional flow: branching and predication and
> each has 2 principle kinds of instructions:: condition is determined
> from a single bit in a register, or condition is determined by comparing
> a register with 0. In addition there are conditionless branches, jumps,
> and a special addition supporting PIC for method calls and switches.
> Compare-to-zero and branch can access certain HW know information
> that is not capable of being stored in a ISA register--this includes things
> like a query to the Memory Unit asking if it has seen any interference
> between the start of an ATOMIC sequence and "now". The exception
> and interrupts and std. return are also encoded here.
>
Internally, conditional branches are equivalent to predicated branches
in my case.
The partial extension is RISC-V style compare-and-branch instructions,
which have sort of a "shortcut" between the ALU and branch-instruction
logic (the ALU behaves as if it were performing a CMPxx, and the EX1
unit like it were doing a Branch, with the short-cut signal taking the
place of the SR.T bit).
These are effectively mandatory for RISC-V decoder support, but still
regarded as optional in BJX2 (and pretty rarely see much use in
practice, if compared with CMPxx+Bx).
It is possible that a cheaper alternative might have been to rig up
something in the decoder to allow the Bxx instruction to have been split
into a CMPxx+Bx pair in the pipeline (but, I lack a mechanism for this
at present).
> Memory reference instructions enable building of ATOMIC primitives
> that can touch as many as 8 cache lines of data in a single ATOMIC
> event. This is equivalent to the MIPS LL and SC except it operates
> over much larger chunks of data. This is sufficient to move an entry
> of a shared data structure from one place to another place in a single
> event. This minimizes the number of ATOMIC events that are needed,
> and comes with guarantees of forward progress.
>
No equivalent.
> The ST instruction can store a constant in either 5-bit sign extended
> form, or in 32-bit or 64-bit forms. No need to put a constant into a
> register in order to ST it to memory. This is along with the ability
> to use 32-bit or 64-bit displacement constants.
>
No equivalent.
Multi-op sequence required.
> There are 5 "special" memory reference instructions:: ENTER is used
> to setup a new stack, and save registers, EXIT is used to tear down the
> stack and restore registers, LDM loads multiple registers, STM stores
> multiple registers, and MM moves data from memory to memory.
> MM has the property that both cached and uncached memory smaller
> than a page is moved as single ATOMIC transfer. {PCIe can do this,
> so should CPUs attached to PCIe peripherals.} There is expected to
> be a sequencer in the memory unit that performs these out of the
> data-path.
>
No equivalent.
All this stuff needs to be handled manually via Load/Store sequences.
However, there a 128-bit Load/Store that is also used (primarily) to
load/store pairs of registers in prolog/epilog sequences. This was
actually its primary purpose, but 128-bit SIMD was a secondary use-case,
along with it able to be faster for 'memcpy()' and similar than using
64-bit load/store instructions.
> The Floating Point group includes Transcendental instructions.
> Ln, LnP1, exp, expM1, SIN, COS, TAN, ATAN and some variants
> that are only 1 constant different in the calculations. Ln2 takes
> only 14 cycles, sin takes 19 cycles. These are included because
> they actually do improve performance.
>
No equivalent, nearly all math functions done in software in my case.
Originally, there were no FDIV or FSQRT instructions either, but these
exist now.
Current timings are:
FDIV: 130 cycles
FSQRT: 384 cycles
The trig functions generally run from around 500 to 1000 cycles or so
(via unrolled Taylor expansion).
There was some older versions (that came with the C library) which
didn't unroll stuff (naive algorithm), these versions seem to weigh in
at closer to around 50000-200000 clock cycles or so (in cases where they
didn't get stuck in an infinite loop).
I replaced these versions with the unrolled versions because they were
basically unusable.
> Conversions between FP and FP or FP and INT are provided by
> 1 instruction (CVT) which has 49 variants to deal with 5 specified
> rounding modes and 1 implied rounding mode (current) any time
> a rounding could transpire. This falls into the category of "once
> you have the HW to do <say> ANINT (of FORTRAN) you have the
> 95% of the logic to do them all".
>
Separate instructions in my case.
Most FP<->FP cases are dedicated converters.
The INT<->FP cases are routed through FADD (noting as how one can do
some exponent tricks and get the FADD logic to do most of the heavy
lifting).
Rounding for the converters is generally hard-wired, usually either
Approx-RNE (scalar path), or Truncate (SIMD paths).
> The exception model is based on message passing (as is SVCs),
> rather than wandering through the high-level OS exception
> dispatcher. This model supports threads (processes or tasks)
> that are paranoid of the OS looking at their data (such as banking
> applications running on a home PC), and can indeed restrict the
> OS from looking at the address space.
>
My case:
Exception Code: { Address, Tag }
Tag is a 16-bit number
(15:12): Category
(11: 8): Target Core
( 7: 0): Magic Number
The category is used to figure out which offset relative to VBR to
branch to:
VBR basically points somewhere in kernel memory;
Holds a small table of branch instructions;
Exception dispatch branches to one of these branch instructions.
When this happens, the CPU also basically saves PC and SR to special
registers, swaps the stack pointers, and switches to a special operating
mode (which blocks further interrupts and disables the MMU and similar).
Interrupts can be routed along the Ring-Bus, as with memory requests,
intended so that CPU cores can throw interrupts at each other.
The 'RTE' instruction is actually implemented as another type of
"special" exception, which basically tries to pass control back to where
the exception came from (or, at least, where the relevant registers say
the exception came from).
> I/O devices are virtualized, and operate on the virtual address
> space of originating requestor. So while the I/O device can DMA
> directly into paranoid application address space, and while OS
> can verify the given space and bounds are acceptable, OS cannot
> look into that address space. This gets rid of the need of a secured
> mode of operation.
>
> Deferred procedure calls are handled as messages (argument
> setup + 1 instruction) with continuation. The messaging sub-system
> operates over both HyperVisor and GuestOS domains simultaneously.
> Anyone with a "method" can call that method and get a response
> even if that method is running under a different GuestOS.
>
No equivalent.
Basic IO range mapping in the physical address space.
No explicit hypervisor support, but shouldn't be too hard in-theory to
fake this part in software.
Would likely be easier than it would be for x86, given the use of a
software managed TLB.
> There is a 66-bit remapped address space--any thread can access
> 64-bits of the space. Sub-spaces are {DRAM, configuration, MMIO,
> and ROM} The address space is configured to efficiently transport
> requests over a significant network (ala HyperTransport and Intel
> equivalent). DRAM is cache coherent, configuration is strongly ordered,
> MMIO is sequentially consistent, ROM is "lax").
>
My case: 48 or 96 bit virtual, 48 bit physical.
MMIO is synchronous, the bridge to the MMIO bus will effectively "lock"
and not allow another request to pass until the former request has
completed.
All MMIO accesses are fully synchronous from the L1 cache down to the
target device (unlike normal memory), though this does mean that
accessing MMIO carries a fairly steep performance penalty relative to
normal memory accesses.
It is possible that there could be a sub-mode to mimic full 64-bit on
top of 96-bit, by treating the addressing as (32,64) rather than
(48,48). This mode would not allow shoving tag bits into the high-order
parts of base-registers though (since these bits would be interpreted as
part of the virtual address).
The Full-64 mode would effectively use 64-bit pointers, unlike 96-bit
requiring the use of 128-bit pointers. The high bits (from the quadrant
address) would effectively still function as an address-range selector
(as when using 48-bit addresses in 96-bit mode; the local 48-bit space
being a sort of isolated island within the much bigger 96-bit space).
Well, and B-Trees for address translation, because the memory overheads
of an 8-level page table would suck.
> The system repeater transports requests from chip to chip, and
> amalgamates coherence requests so that the originator counts
> responses from cores on his chip, and the number of chips in
> the system (rather than counting from every core).
>
> Memory management cannot be turned off--My 66000 imple-
> mentations come out of reset with the MMUs turned on. HostBridge
> is configured with a MMU/TLB that uses exactly the same tables as
> CPUs and can share tables as applicable. Levels in the virtual
> address space translations can be skipped! So an application as
> simple as 'cat' can be managed with a single page of translation
> overhead.
>
> Memory management is inherently HyperVisor/GuestOS. Privilege
> is determined by the assortment of root pointers in use on a per
> invocation basis.
>
> GuestOS can activate a thread (taking it from a waiting state to
> running in a core) in a single instruction and remotely. So can
> HyperVisor.
>
Very different in my case.
I can use B-Trees as page-tables, which saves memory, but this is mostly
due to this being all software, and the CPU itself doesn't really care
what the page-table looks like.
Could almost mimic x86 segmented addressing, except that pages need to
be page-aligned. I guess if one had non-page-aligned address
translation, they could rig up the MMU to pretend to be an 8086 or
something, but this would make the TLB's job a little harder (and
require more bits in the TLBE).
Most obvious would be that such a mode would come at the expense of the
ASID, reinterpreting the ASIDs bits as a displacement to add to the
translated address. Would also require physical memory pages to be
contiguous, and probably some extra hackery to deal with A20 wrap-around
behavior (I guess one would have to look into how much MS-DOS software
depends on A20 wraparound working).
> Finally, there is no notion of one thread morphing into a different
> thread over a series of instructions manipulating control registers
> one by one. For example: an ISR cleanup handler takes a thread
> off a wait state queue, places it on a run state queue, and signals
> GuestOS to see what threads should be running "right now". This
> is all 1 instruction and 1 cycle as far as the core performing the
> instruction sees.
>
Yeah, no equivalent.
> My 66000 is not just another ISA, it is a rethink of most of the components
> that make up a system. A context switch from one thread to another
> within a single GuestOS is 10 cycles. A context switch from one thread
> to a thread under a different GuestOS remains 10 cycles. The typical
> current numbers are 1,000 cycles within GuestOS, and 10,000 cycles
> across GuestOSs.
>
> OH, and BTW, The FP transcendentals are patented.
I would assume you mean FP transcendentals in hardware (in whatever way
they are implemented), as opposed to in-general.
Their existence in things like "math.h" and so on would likely preclude
any sort of patent protection in the "in general" sense.
Very different, I have doubts about how well a lot of this could be
pulled off in a low-cost implementation. Best I can come up with at the
moment would effectively amount to faking it using lots of microcode or
a software-based emulation layer.
I also consider my ISA to be "fairly unique", albeit in different ways
(and a little more conservative in terms of implementation concerns).