On 11/27/2023 3:17 AM, Anton Ertl wrote:
> BGB <
cr8...@gmail.com> writes:
>> On 11/26/2023 7:29 PM, Quadibloc wrote:
>>> On Sat, 25 Nov 2023 19:55:59 +0000, MitchAlsup wrote:
>>>
>>>> But Integer Multiply and Divide can share the FPU that does these.
>>>
>>> But giving each one its own multiplier means more superscalar
>>> goodness!
>
> Having two multipliers that serve both purposes means even more
> superscalar goodness for similar area cost. However, there is the
> issue of latency. The Willamette ships the integers over to the FPU
> for multiplication, and the result back, crossing several clock
> domains (at one clock loss per domain crossing), resulting in a
> 10-cycle latency for integer multiplication. I think that these days
> every high-performance core with real silicon invests into separate
> GPR and FP/SIMD (including integer SIMD) multipliers.
>
I ended up with different multipliers mostly because the requirements
are different...
>> In most code, FPU ops are comparably sparse
>
> In terms of executed ops, that depends very much on the code. GP
> cores have acquired SIMD cores primarily for FP ops, as can be seen by
> both SSE and supporting only FP at first, and only later adding
> integer stuff, because it cost little extra. Plus, we have added GPUs
> that are now capable of doing huge amounts of FP ops, with uses in
> graphics rendering, HPC and AI.
>
Yeah, this was not to say there is not FPU dense code, or that FP-SIMD
is not useful, but rather that in most general code, ALU and LD/ST ops
tend to dominate by a fair margin.
Similarly, SIMD ops may still be useful, even if they are a relative
minority of the executed instructions (even in code sequences which are
actively using SIMD ops...).
Like, typically, for every SIMD op used, there are also things like:
The loads and stores to get the value from memory and put the results in
memory;
ALU ops to calculate the index into the array or similar that we are
loading from or storing to;
...
Well, along with other ops, like shuffles and similar to get the SIMD
elements into the desired order, etc.
Like, some of this is why it is difficult to get anywhere near the
theoretical 200 MFLOP of the SIMD unit, apart from very contrived
use-cases (such as running neural net code), and had involved wonk like
operations which combined a SIMD shuffle into the SIMD ADD/SUB/MUL ops.
For a lot of other use cases, I can just be happy enough that the SIMD
ops are "not slow".
>> Otherwise, one can debate whether or not having DIV/MOD in hardware
>> makes sense at all (and if they do have it, "cheap-ish" 68 cycle DIV is
>> at least "probably" faster than a generic software-only solution).
>
> That debate has been held, and MIPS has hardware integer divide, Alpha
> and IA-64 don't have a hardware integer divide; they both have FP
> divide instructions.
>
Technically, also, BJX2 has ended up having both Integer DIV and FDIV
instructions. But, they don't gain all that much, so are still left as
optional features.
The integer divide isn't very fast, but it doesn't matter if it isn't
used all that often.
The FDIV is effectively a boat anchor (around 122 clock cycles).
Though, mostly this was based on the observation that with some minor
tweaks, the integer divide unit could be made to perform floating-point
divide as well.
The main merit though (over a software N-R divider) is that it can
apparently give exact results (my N-R dividers generally can't converge
past the low order 4 bits).
> However, looking at more recent architectures, the RISC-V M extension
> (which is part of RV64G and RV32G, i.e., a standard extension) has not
> just multiply instructions (MUL, MULH, MULHU, MULHSU, MULW), but also
> integer divide instructions: DIV, DIVU, REM, REMU, DIVW, DIVUW, REMW,
> and REMUW. ARM A64 also has divide instructions (SDIV, UDIV), but
> RISC-V seems significant to me because there the philosophy seems to
> be to go for minimalism. So the debate has apparently come to the
> conclusion that for general-purpose architectures, you include an
> integer divide instruction.
>
Yeah.
I mostly ended up adding integer divide so that the RISC-V mode could
support the 'M' extension, and if I have it for RISC-V, may as well also
add it in BJX2 as its own optional extension.
Had also added a "FAZ DIV" special case that made the integer divide
faster, where over a limited range of input values, the integer divide
would be turned into a multiply by reciprocal.
So, say, DIV latency:
64-bit: 68 cycle
32-bit: 36 cycle
32-bit FAZ: 3 cycle.
As it so happens, FAZ also covers a similar range to that typically used
for a rasterizer, but is more limited in that it only handles a range of
values it can calculate exactly. For my hardware rasterizer, a similar
strategy was used, but extended to support bigger divisors at the
tradeoff that it becomes inexact with larger divisors.
However, adding a "Divide two integers quickly, but may give an
inaccurate result" instruction would be a bit niche (and normal C code
couldn't use it without an intrinsic or similar).
Partial observation is that mostly, the actual bit patterns in the
reciprocals tends to be fairly repetitive, so it is possible to
synthesize a larger range of reciprocals using lookup tables and
shift-adjustments.
In the hardware rasterizer, I had experimented with a fixed-point 1/Z
divider for "more accurate" perspective-correct rasterization, but this
feature isn't cheap (fixed-point reciprocal is more complicated), and
ended up not using it for now (in favor of merely dividing S and T by Z
and then multiplying by the interpolated Z again during rasterization,
as a sort of less accurate "poor man's" version).
The "poor man's perspective correct" strategy doesn't eliminate the need
to subdivide primitives, but does allow the primitives to be somewhat
larger without having as much distortion (mostly relevant as my GL
renderer is still mostly CPU bound in the transform and subdivision
stages, *1).
In theory, proper perspective correct would entirely eliminate the need
to subdivide primitives, but would require clipping geometry to the
viewing frustum (otherwise, the texture coordinates freak out for
primitives crossing outside the frustum or crossing the near plane).
Approximate FDIV is possible, but I have typically used a different
strategy for the reciprocal.
*1: Though, ironically, this does also mean that, via multi-texturing,
it is semi-viable to also use lightmap lighting again in GLQuake (since
this doesn't add too much CPU penalty over the non-multitexture case).
However, dynamic lightmaps still aren't viable, as the CPU-side cost of
drawing the dynamic light-sources to the lightmaps, and then uploading
them, is still a bit too much.
I suspect, probably in any case, GLQuake wasn't likely written to run on
a 50MHz CPU.
Might have been a little easier if the games had poly-counts and and
draw distances more on par with PlayStation games (say, if the whole
scene is kept less than 500 polys; rather than 1k-2k polys).
Say, Quake being generally around 1k-2k polys for the scene, and roughly
300-500 triangles per alias model, ... Despite the scenes looking crude
with large flat surfaces, most of these surfaces had already been hacked
up a fair bit by the BSP algorithm (though, theoretically, some
annoyances could be reduced if the Quake1 maps were rebuilt using a
modified Q3BSP or similar, as the Quake3 tools natively support vertex
lighting and also don't hack up the geometry quite as much as the
original Quake1 BSP tool did, but alas...).
Wouldn't save much, as I still couldn't legally redistribute the data
files (and my GLQuake port isn't particularly usable absent using a
process to convert all the textures into DDS files and rebuilding the
PAK's and similar, so...).
To have something redistributable, would need to replace all of the
textures, sound effects, alias models, etc. An old (probably long dead)
"OpenQuartz" project had partly done a lot of this, but "got creative"
with the character models in a way I didn't like (would have preferred
something at least "in the same general spirit" as the original models).
Similar sort of annoyances as with FreeDoom, but at least FreeDoom
stayed closer to the original in these areas.
Also, possibly, I may need to rework some things in terms of how TKRA-GL
works to better match up with more recent developments (all this is
still a bit hacky; effectively linking the whole GL implementation to
the binary that uses GL).
Granted, it is also possible I may need to at some point move away from
linking the whole TestKern kernel to any programs that use it as well,
with the tradeoff that programs would no longer be able to launch "bare
metal" in the emulator (but would always need the kernel to also be
present).
>> For cases like divide-by-constant, also, typically it is possible to
>> turn it in the compiler into a multiply by reciprocal.
>
> But for that you want a multiply instruction, which in the RISC-V case
> means including the M extension, which also includes divide
> instructions. Multiplying by the reciprocal may still be faster.
>
Yeah.
For the relevant cases, widening multiply for 32-bit integers generally
has a 3 cycle latency in my case.
Whereas DIV will often have a 36 cycle latency in the general case.
So, the choice is obvious.
Granted, this still generally needs a shift instruction following the
multiply.
Having a "Do a widening 32-bit multiply and then right shift the results
32/33/34/35 bits" would be possible in theory, but a little wonky.
Could possibly make sense if it could also encode a 32-bit immediate:
DMULSH33S Rm, Imm32u, Rn //Int32: Rn=(Rm*Imm32)>>33
DMULSH33U Rm, Imm32u, Rn //UInt32: Rn=(Rm*Imm32)>>33
Probably don't really have the encoding space to pull this off though
(with a 64-bit encoding; Could be doable in theory with an 96-bit
FE-FF-OP encoding though).
Though, as can be noted, this could likely borrow some of the logic from
the FAZ-DIV mechanism or similar.
Would allow handling "y=x/C;" cases in 3L/1T though, but, integer
division still isn't really all that common (at least, not enough to
justify going through all this just to save using a right-shift op).
> - anton