For the BJX2 core, mostly on an XC7A100T, I can fit a Double Precision
FPU (1x Binary64) and also a 4x Binary32 SIMD unit (albeit the later
using a hard-wired truncate-only rounding mode).
Originally, had done Binary32 SIMD by pipelining it through the main
FPU, but with a dedicated low-precision SIMD unit, can get a fair bit of
a speedup: 20 MFLOPs -> 200 MFLOPs, at 50MHz.
And was able to stretch it enough to "more or less faithfully" handle
full Binary32 precision.
This is along with the the MMU, and 3-wide execute unit.
Don't really have enough LUTs left over for a second core or any real
sort of GPU.
I could potentially fit dual cores and boost clock speeds to 75 MHz on
an XC7A200T, but this is expensive (and boards with these have been
mostly out-of-stock for a while).
An XC7K325T or similar would possibly allow quad-core and/or a dedicated
GPU, as well as 100 or 150MHz. But, like, I don't really have the money
for something like this... (And this is basically the largest and
fastest FPGA supported by Vivado WebPack).
I can also fit the BJX2 core onto an XC7S50, but need to scale it back
slightly to make it fit.
I have also used an XC7S25 as well, but generally I can only seem to fit
simpler RISC style cores on this. Typically no FPU or MMU.
On the XC7S25 or XC7A35T, one could make a strong case for RISC-V
though, as what one can fit on these FPGAs is pretty much in-line for a
simple scalar RISC-V core or similar (and a RISC-like subset of BJX2 has
no real practical advantage over RV64I or similar).
A stronger case could probably be made for RV32I or maybe RV32IM or
similar on this class of FPGA.
And, if I were doing a GPU on an FPGA, something akin to my current
BJX2-XG2 mode could make sense as a base (possibly with wider SIMD and
also SIMD'ing the memory loads and stores, *).
*: Say, loads where the index register is a vector encoding multiple
indices, each of which is loaded into a subset of the loaded vector. The
ISA would look basically the same as it is now, except nearly everything
would be "doubled".
So, say:
ADD R16, R23, R39
Would actually add 128-bit vectors containing a pair of 64-bit values
(and the current 128-bit SIMD ops would effectively expand to being
8-wide 256-bit operations). Most Loads/Stores would also be doubled
(idea being to schedule loop iterations into each element of the vector;
with only a subset of ops being "directly aware" of the registers being
SIMD vectors; possibly with a mode flag to enable/disable side-effects
from the high-half of the vector).
But, as noted, I would likely need a Kintex or similar to have the LUT
budget for something like this...
> Besides, if he can't fit something into device, it
> does not mean that experienced FPGA guy will have the same difficulties.
> But let's leave it aside and concentrate on devices.
> So, 23,360 logic cells and 80 DSP slices.
> For comparison, the biggest device in the same decade old Xilinx 7
> series is Virtex XC7VH870T with 876,160 logic cells and 2520 DSP slices.
>
A Virtex-7 is also several orders of magnitude more expensive...
If a chip costs more than a typical person will have during their
lifetime, it almost may as well not exist as far as they are concerned.
...
I guess a person can "rent" access to Virtex devices via remote cloud
servers. Still not very practical though.
> Newer high-end FPGA devices are bigger by another order of magnitude
> although not in DSP slices area: Virtex ULTRAScale+ XCVU19P has 8,938,000
> logic cells and 3,840 DSP slices. But that device is also not quite new.
>
> In more recent years Xilinx lost interest in traditional FPGAs and is trying
> to push "Adaptive Compute Acceleration Platform" (AGAPs). Some of those have
> rather insane amount of multipliers, so big that they stopped counting DSP
> slices and are now counting DSP Engines. I am too lazy to dig deeper and find
> out what it really means, but one thing is sure - there are a lot of compute
> resources in these devices. May be not as much as in leading edge GPUs,
> but it's the same order of magnitude.
>
> Were are not talking about 10 or 100 or 1000 FPUs on the high end AGAPs.
> More like 10,000.
>
Hmm...
These look probably if-anything more relevant to "AI" and/or "bitcoin
mining" than to traditional FPGA use cases.
Looks less relevant personally, as "do whole lots of FPU math" isn't
really the typical bottleneck in my projects.
Granted, if one does want "thing that does lots of FPU math", this could
make sense.
>> Some of these FPGAs are more expensive than a CPU running at decent
>> but not extraordinary frequency.
>
> That's another understatement.
General case performance even on par with a RasPi is difficult...
Ironically, it is a little easier to compete with a RasPi for software
OpenGL, as the RasPi just sorta sucks at this (it effectively "face
plants" so hard as to offset its clock-speed advantage).
In a "performance per clock" sense at this task, my BJX2 core somewhat
beats out my Ryzen 7 for this as well. For the NN tests, it almost gets
almost a little absurd...
For more "general purpose" code, the BJX2 core kinda gets its crap
handed back to it though.
Though, in more realistic scenarios, hard to get an Artix-7 anywhere
near the speeds of my desktop PC, and even then, only in contrived
scenarios (the cost of emulating Binary16 FP-SIMD and similar on a PC
via bit-twiddling is higher than the clock-speed delta, ~ 74x).
This sort of thing sometimes poses issues for my emulator, as some
instructions are harder to emulate efficiently.
For example, some of the compressed texture instructions and similar
only "keep up" as they secretly cache recently decoded blocks. A direct
implementation of the approach used in the Verilog implementation would
be too slow to emulate in real time.
Things like emulating cache latency is a double-edged sword, as it isn't
super cheap to evaluate cache-hits and misses, but the cache misses
reduce how much work the emulator needs to do to keep up.
But, yeah, otherwise it would appear that a 150 MHz BJX2 core would be
fast enough to run Quake 3 Arena and similar with software rasterized
OpenGL at "fairly playable" framerates...
But, this will likely need to remain "in theory", as I don't have $k to
drop on a Kintex board or similar to find out...
But, say, if some company or whatever wanted to throw $k my way (both
for the FPGA board, and for "cost of living" reasons), could probably
make it happen (otherwise, I am otherwise divided in my efforts, also
needing to spend a chunk of time out in a machine shop).
Probably not going to happen though.
...
Though, compared with an "actual PC", it isn't very practical.
And, even a RasPi can run Quake 3 pretty easily (and a lot cheaper) if
one can make use of its integrated GPU (main annoyance being that it
uses GLES 2.0 rather than OpenGL 1.x).
For Quake 1, to get it as good as it is, had to resort to some trickery
like rewriting "BoxOnPlaneSide" and similar using ASM, ...
To get much more speed, would likely need a differently organized 3D engine.
Likely, per-texture quad arrays which are rebuilt only when the camera
moves into a different PVS or similar (rather than walking the BSP and
similar every frame).