On 1/13/2024 3:38 AM, Marcus wrote:
> On 2024-01-07 22:07, Brian G. Lucas wrote:
>> Several posters on comp.arch are running their cpu designs on FPGAs.
>
> Yes. I use it for my MRISC32 project. I implement a CPU (MRISC32-A1) as
> a soft processor for FPGA use. Furthermore I implement a kind of
> computer around the CPU, to provide ROM, RAM, I/O, graphics, etc.
>
> Here's a recent video of the computer running Quake:
>
>
https://vimeo.com/901506667
>
> I use/target two different FPGA boards, but I mainly use one of them for
> development.
>
At least it is going fast...
If I run the emulator at 110 MHz:
Software Quake gets ~ 10.4 fps.
GLQuake gets 12.7 fps.
Though, the GLQuake performance partly took a hit recently as I had been
moving away from running the GL backend directly in the program, to
instead run it via system calls.
I had partly integrated some features from the version that was stuck
onto the Quake engine into the other branch which was modified to work
inside the TKGDI process, such as support for the rasterizer module.
Looking in the profile output, it appears it is still doing a bit of the
GL rendering via the software span-drawing though.
Though, in the process, it has gone from the use of "hybrid poor-mans
perspective correct" back to plain "affine texturing with dynamic
subdivision", with a comparably finer subdivision.
So, proper perspective correct would involve:
Divide ST coords by Z before rasterization;
Interpolate as 1/Z;
Dynamically calculate "Z" via "1/(interpolated 1/Z)";
Scale ST coords by Z during rasterization.
Poor man's version:
Divide ST coords by Z before rasterization;
Interpolate as Z;
Scale ST coords by Z during rasterization.
This version isn't as good as the proper version, and adds some of its
own issues vs affine.
Affine:
Interpolate ST coords directly (no Z scaling).
However, larger primitives (*) with affine texturing need to be split
apart into smaller primitives during rendering, which adds cost in terms
of transform/projection, which it seems is a more significant part of
the cost when using the hardware rasterizer module.
Actual perspective-correct could be better here, but the "quickly and
semi-accurately calculate 1/(1/Z) part" is a challenge.
*: At the moment, basically any triangle with a circumference larger
than ~ 21 pixels, or 28 pixels for a quad, will be subdivided. Going too
much bigger makes the affine warping a lot more obvious.
The Software Quake in this case, is a modified version of the Quake C
software renderer:
Was modified early on to use 16-bit pixels rather than 8-bit pixels;
Initially, this was YUV655, but then went to RGB555.
A few functions were rewritten in ASM.
Though, still basically all scalar code;
The bulk of the renderer is still C though.
There is still some weirdness in a few places where the math still
assumes YUV, which leads to things like the menu background blending
being the wrong color (never got around to fixing this), ...
(The video there seemed to show a dithered effect, which is a little
different from a color-blend).
Did gain some alpha blended effects (such as a translucent console),
because these seemed cool at the time, and isn't too hard to pull off
with RGB pixels.
Note that my GLQuake port is still faster than the Quake
software-renderer, even with software-rasterized OpenGL.
Does sort of imply a faster software renderer could still be possible...
Though, in my Doom port, I did eventually go from the use of
color-blending (for things like screen flashes) to the use of
integrating the color-flash into the active "colormap" table (*), which
is used every time a span or column is drawn in Doom (not so much in SW
Quake; where texturing+lighting is precalculated via a "surface cache").
*: It being computationally faster to RGB blend the current version of
the colormap table, than to RGB blend the final screen image (with
menus/status-bar/etc being drawn via the unblended colormap).
Though, I did once experiment with eliminating the colormap table
entirely in Doom, and using purely RGB modulation (like one might do in
an GL style rasterizer), but this was slower than using the colormap table.
At the moment, Doom at least mostly holds over 20 fps (at 50MHz), having
gained a few fps on average with a recent experimental optimization:
Temporary variables which are used exclusively as function-call inputs
may have the expression output directly to the register corresponding to
the function argument, rather than first going to a callee-save register
and then being MOV'ed to the final argument register.
Effect seems to be:
Makes binary 3% smaller;
Makes Doom roughly 9% faster;
Drops "MOV Reg,Reg" from being ~ 16% of total ops, to ~ 13%;
Cause the number of bundled instructions to drop by 1% though;
...
Note that this only applies to temporaries, not to expressions performed
via local variables or similar, which still use callee-save registers.
Had sort of hoped it would save more, but it seems like many of the
"MOV's" for function arguments are coming from local variables rather
than temporaries (but, unlike a temporary, the contents of a local
variable still need to still be intact after the function call).
After a fair bit of debugging (to get the built program to not be
entirely broken), this change has a more obvious effect on the
performance of ROTT (which gets around 70% faster and ~ 6% smaller).
(Though, there is some other unresolved, less-recent bug, that has
seemed to cause MIDI playback in ROTT to sound like broken garbage).
Though, for ROTT this wasn't isolated from another few recent optimizations:
Eliminating initial condition-check with "for()" loops of the form:
for(i=M; i<N; i++)
When M<N, and both are constant.
Reworking "*ptr++" in the RIL-IR stage to eliminate an extra "MOV";
Also eliminates using an extra temporary (manifest as the "MOV").
Involved detecting and handling this as a single operation.
And generating the RIL3 stack-operations in a different order.
Didn't bother detecting/handling preincrement cases yet though.
Making expressions like "x=*ptr;" not use an extra temporary;
...
Well, and other changes:
Making the size-limit for inline "memcpy()" smaller,
added a copy-slide and generated memcpy's for intermediate cases.
Was:
< 128 byte: generate inline.
< 512 byte: maybe special-case inline (if speed-optimized).
Now:
< 64 byte: generate inline
< 512: call a generated unrolled copy-loop/slide.
This mostly being because handling larger cases inline is bulky.
It takes around 512 bytes of ".text" to copy 512 bytes inline...
Some of this is basically a case of going through some debug ASM and
looking for "stupid instruction sequences", and trying to figure out
what causes them and how to fix it.
However, "obvious cases that save lots of instructions" are becoming
much less common.
And, some other optimizations, such as "constant propagation" would be a
lot more difficult to pull off... Where, say, the value of a constant
would be seen via a variable rather than a "#define" or similar; my
compiler already has the optimization of replacing expressions like
"2+3" with "5".
The big problem with constant propagation is that whether or not a
constant can be propagated depends on local visibility and control flow
(and would likely be of very limited effectiveness if it could not cross
boundaries between basic-blocks).
For example, if it could not cross a basic-block boundary, it would have
still been N/A for the previous "for() loop" optimization (which in this
case was handled via AST level pattern matching).
Some of the remaining inefficiencies cross multiple levels in the
compiler, which is annoying...
Then there are a lot of things that GCC does, that I have little idea
how to pull off at the moment.
For example, it assigns local variables to registers which seem to be
localized and flow across basic-block boundaries; currently BGBCC does
nothing of the sort (closest it can do is rank the most-used variables,
and static-assign them to registers for the scope of the whole function;
anything else using spill-and-fill via the stack frame).
Sadly, despite having 64 GPRs, still have not entirely eliminated the
use of spill and fill. The mechanism that can eliminate spill-and-fill
on a function scale (by assigning everything to registers), is basically
defeated as soon as anything takes the address of a local variable or
similar (whole function falls back to the generic strategy; the local
variable in question going over to not caching the value in a register
at all, and instead using spill/fill every time that variable is
accessed, anywhere in the function...).
...
>> I have several questions:
>> 1. Which particular FPGA chip? (not just family but the particular SKU)
>
> a) Intel Cyclone-V 5CEBA4F23C7N (my main development FPGA)
>
> b) Intel MAX 10 10M50DAF484 (this is the smaller one of the two)
>
Mostly still XC7A100T and XC7A200T.
Advantage of the latter in this case that I can fit multiple cores.
Where the single-core config uses around 70% of an XC7A100T.
With some limitations, can sorta shoe-horn it into an XC7S50, though not
with the entire feature-set.
>> 2. On what development board?
>
> a) Terasic DE0-CV
>
> b) Terasic DE10-Lite
>
Had once looked into these, but didn't get them as they weren't super
cheap, and were different enough as to require some porting effort.
Did at one point synthesize the BJX2 core in Quartus though...
>> 3. Using what tools?
>
> Development: Sublime Text + VS Code + GHDL + gtkwave (all free).
>
> Programming: Intel Quartus Prime Lite Edition, v19.1.0 (it's free).
>
All Verilog here...
Seems the version I am using is some sort of intermediate between
Verilog and SystemVerilog. Vivado accepts it as Verilog, but for Quartus
I needed to tell it that it was SystemVerilog.
Otherwise, seems to work fine in Verilator and similar as well.
>>
>> Thanks,
>> brian
>
>