It is a tradeoff.
As-is, I have 96-bit, but this is likely overkill.
As noted, the registers themselves would be unstructured, but operations
on those registers could assume a certain amount of structure.
Sort of like, how 4x Binary32 SIMD ops, kinda assume that the vector is
4x Binary32.
As noted, pointer formats in my case are basically:
(47: 0): Address
(59:48): Type-Tag / Etc
(63:60): Top-Level Tag
But, for the most part, Load/Store ops, etc ignore the high 16 bits.
But, say, the top 4 bits are:
0000: Basic (Object) pointer
0001: Small Value Spaces
0010: Bound Array
0011: Bound Array
01xx: Fixnum (62 bit)
10xx: Flonum (62 bit)
1100: TagArray + Base Offset
1101: Dense Vector
1110: Typed Pointer
1111: Extended 60-bit Linear Address (Optional, 1)
1: This format would effectively map a 64-bit pointer to a range of
multiple "quadrants" (or, possibly a table of locations within the
larger 96-bit address space).
There is a 128-bit format:
( 47: 0): Address (Low 48-bits)
( 59: 48): Type-Tag / Etc
( 63: 60): Top-Level Tag
(111: 64): Address (High 48-bits)
(127:112): Additional Tag Metadata
The formats for bounded-arrays are expanded somewhat.
The Fixnum an Flonum formats would expand to 124 bits in this case.
Current format for function pointers and link register is:
( 0): Inter-ISA Bit
(47: 1): Address
(63:48): Mode Flags
If Addr(0) is 0, the high 16 bits are ignored (or Trap if the Link
Register is expected). If 1, the high bits encode the operating mode and
similar.
High bits:
* (63:56), Saved SR(15: 8), U0..U7
* (55:52), Saved SR(23:20), WX3, WX4, WM1, WM2
* (51:50), Saved SR(27:26), WXE, WX2
* (49:48), Saved SR( 1: 0), S and T
The WXn and WMn bits encode the operating mode (BJX2 vs XG2 vs RV64 vs
XG2RV).
U0..U7 are user-defined or context-dependent flag bits (2).
S and T are the values of the S and T flag bits (in the current form of
the ISA, these are saved across function calls as this makes it possible
to predicate blocks with function calls).
2: In an x86 emulator, it is possible that these could be used as a
stand-in for ALU status flags or similar. These are also preserved
across function calls. They may also be used for more complex
predication (predicating based on logical relations between U-bits).
> I am not so sure about the slowness of a superscalar. I think it may have more
> to do with how full the FPGA is, the size and style of the design. I have been
> able to hit close to 50 MHz timing according to the tools. I think 50 MHz is on
> par with many other FPGA designs. Even if it is a few MHz slower the ability of
> OoO to hide latency may make it worth it.
>
Issue is mostly "net delay" (and to an extent, "fanout"):
The bulkier/wider/etc the logic gets, the slower it gets...
And, say, one can do a "simple 16-bit machine" and run it at 200 MHz.
But, not so much anything much bigger than said simple 16-bit machine.
Or a 32-bit machine running at 100MHz (say, MicroBlaze falls in here).
But, a 3-wide 64-bit machine is limited to around 50 MHz.
Had gotten 1-wide variants running at 100 MHz, but not reliably.
A 1-wide core running at 75 MHz is a little easier to pull off though.
Making the caches bigger also make them slower, etc.
But, going from 64-bit to 128-bit is likely to result in a notable
increase in area (particularly for superscalar or VLIW), which is likely
going to make the "net delay" issues a lot worse.
For 128-bit, one might end up, say, with a scalar machine that runs at
50 MHz, but if they want to go 2-wide, they need to drop it to 33 or 25 MHz.
On the other side, 32-bit machines, while one can get them running at
potentially 100 or (maybe) 150 MHz, are more limited in some areas (and
32-bit is started to look a little "stale" at this point).
While a 16-bit core can get a higher clock-speed, a 16-bit machine is
too limited to really accomplish a whole lot (and pretty much anything
one can do to make it "actually useful" would either come at the expense
of reducing clock speed, or taking too many clock cycles to give it much
of an advantage).
But, yeah, the limitation for MHz on FPGA seems to be more about how
long it takes signals to propagate around the FPGA, rather than about
the speed of the individual LUTs and similar.
So, if a given piece of FPGA logic is small, it can be internally run at
a higher clock-speed. Though, for external IO, one is limited some in
that one can't drive the pins much faster than around 100 MHz without
using SERDES.
Decided to leave out stuff about clock speeds and external wiring
(signal integrity over wiring gets finicky as MHz increases).
> *****
> I cannot believe I ignored Cordic for so long, having discovered its beauty.
> Answering my own engineering question, 10-bits max micro-code it is; that
> should be overkill. It should not take very many, if any micro-code words for
> basic trig. Previously seven bits were used for Thor2021, but only about a half
> dozen non-math instructions were micro-coded.
>
> Having done some head scratching over cordic and how to calculate the tan,
> I am thinking of just providing a cordic instruction that takes all three
> arguments and calculates away, and then leave it up to the programmer to
> decide on operands and what is being calculated. I get how to calculate sin
> and cosine via cordic and I think tan finally. Specifying that x, y must be
> fractions between 0 and 1, and that the angle must be radians between 0
> and 2 pi.
>
> I gather that modern processor do not use Cordic, and use polynomial
> expansions instead.
>
Yeah.
I mostly used Taylor Series expansions and similar...
In my own fiddling, didn't find much that is both faster than a Taylor
expansion and also gave similarly good accuracy.
"Lookup-Table and Interpolate" is faster, but falls short in terms of
accuracy.
A "reasonable balance" being to use a few table lookups followed by
cubic-spline interpolation.
This would "mostly be good enough" for use in games or similar, but the
C library doesn't really define any "fast but approximate" math
functions, leaving this sort of thing as non-standard extensions.
Say:
//do 'sin(x)', but faster and less accurate...
double sin_fast(double x);
One wouldn't do this with the normal "sin(x)" though as the assumption
is that these give an accurate value, rather than a fast value (and
programs that need "fast but inaccurate" sin/cos usually provide their
own lookup-table versions).
And, it is also not always clean where the optimal balance point should
be for "fast" variants (programs likely still using their own if the C
library version is slower than their own; or it being unusable if the
accuracy isn't good enough for for a given calculation).
Faster but less accurate:
Single lookup, don't bother with interpolation.
Slower but more accurate:
Taylor expansion but with reduced stages (vs the full version).
> For Thor Cordic is calculated out to 61-bits as that is eight more than the
> 53-bit significand.
>
> 13k LUTs to implement parallel cordic. So, switched to sequential cordic:
> 1.5 k LUTs. Might be supportable.
>
Possible. Not sure how easy it would be to glue onto my shift-add unit.
As noted, my FPU doesn't do trigonometric functions itself though, but
leaves all this up to software.
From what I can gather, it doesn't really look like Cordic is
(particularly) likely to beat out a Taylor expansion in terms of speed.
But, if it were up to me, the fundamental FPU operations would mostly be:
FADD, FSUB, FMUL
Maybe: FMAC (Z=Z+X*Y)
I can do FDIV with the shift-add unit, but:
It is slower than approximate versions;
Say, a crude approximation followed by 1 or 2 Newton-Raphson stages.
It is only slightly faster than the "full Newton-Raphson version".
But, does give correct results in the low 4 bits of the result.
Software N-R seemingly unable to fully converge the last 4 bits (1).
Though, it being slightly faster in the generic "double x,y,z; z=x/y;"
case, and slightly more accurate, is at least "sort of useful".
So, ironically, the "fast" version is "do it in software".
1: Once it gets sufficiently close to the target value, "Brownian
Motion" seems to take over instead (and switches back to heading towards
the exact value whenever it gets outside of ~ +/- 7 ULP or so).
Software Shift-Add could give an exact FDIV, but would be slower than
either of the above.