Jason Cong's future of high performance computing

199 views
Skip to first unread message

JimBrakefield

unread,
Jan 28, 2023, 7:11:30 PMJan 28
to
https://www.youtube.com/watch?v=-XuMWvGUocI&t=123s
Starts at the 2 minute mark. He argues that computer performance
has plateaued and that FPGAs offer a route to higher performance.
At about 13 minutes he states his goal to turn software programmers
into FPGA application engineers. The rest of the talk is about how this
can be achieved. It's a technical talk to an advanced audience.

Given his previous accomplishments and those of the VAST group at UCLA,
https://vast.cs.ucla.edu/, they can probably pull it off.
That is, in five or ten years this is what "big iron" computing will look like?

MitchAlsup

unread,
Jan 28, 2023, 8:16:48 PMJan 28
to
On Saturday, January 28, 2023 at 6:11:30 PM UTC-6, JimBrakefield wrote:
> https://www.youtube.com/watch?v=-XuMWvGUocI&t=123s
> Starts at the 2 minute mark. He argues that computer performance
> has plateaued and that FPGAs offer a route to higher performance.
<
In my best Homer Simpson voice:: "Well Duh"
<
When CPU packages were limited to 100W of power; there is only a
maximum number of gates one can switch per unit time and stay
under this power limit. Technology increases are simply driving the
dissipation per gate down--after we maxed out IPL consumption.
<
> At about 13 minutes he states his goal to turn software programmers
> into FPGA application engineers. The rest of the talk is about how this
> can be achieved. It's a technical talk to an advanced audience.
<
Software people are trained to think as vonNeumann:: do one thing and then
do one more thing; repeat until done. This is manifest in the way they debug
programs--by single stepping.
<
Hardware is not like this at all:: Nothing is preventing all of the first gates after
a flip flop from sensing its inputs and generating its output simultaneously.
HW designers often (perversely) write Verilog code backwards knowing that
the compiler will rearrange the code by net-list dependency.
<
I should note: you cannot debug HW by single stepping !! as there is no definition
of what single stepping means at the gate level. No, HW designer use simulators
where they can stop at ½ clock intervals and then examine millions of signals--
some of them X (unknown value) and Z (high impeadence).
<
I have serious doubts that one can teach the 99% of software engineers to
think in ways that truly are concurrent--and the first thing that HW designers
have to come to grips with is that there is no single stepping, the minimal
advance is ½ clocks--and here a billion gates can change their output signals.
>
> Given his previous accomplishments and those of the VAST group at UCLA,
> https://vast.cs.ucla.edu/, they can probably pull it off.
<
> That is, in five or ten years this is what "big iron" computing will look like?
<
My bet is that if his efforts succeed, it will be 20 years before whatever it
becomes is available in a computer you buy in "Best Buy" and take home to
use.
<
Then again, there is the problem of applications to make use of the new
capabilities. Where do these come from.
<
----------------------------------------------------------------------------------------------------------------------
<
When I started as a professional in the computer world, I went to a company
sponsored lecture about Artificial intelligence, and how it would revolutionize
what computers do and how they do it and how it was only 5 years off into
the future.........this was 1982! 40 years later (8× longer than stated) we are
on the cusp of AI being useful to the average person not using Google as a
search engine. Yet the application has nothing to do with computing but with
another <nearly> daily activity:: Driving.

Quadibloc

unread,
Jan 28, 2023, 8:22:47 PMJan 28
to
On Saturday, January 28, 2023 at 5:11:30 PM UTC-7, JimBrakefield wrote:
> Starts at the 2 minute mark. He argues that computer performance
> has plateaued and that FPGAs offer a route to higher performance.
> At about 13 minutes he states his goal to turn software programmers
> into FPGA application engineers.

At present, FPGAs can make _some_ computer programs more
efficient. If a computer program involves things like bit manipulation,
that an FPGA can do well, but CPUs do poorly, it can be a good fit.

It would be possible to design FPGAs that are better suited to
problems that CPUs already do well, so that they could be done
even better on the FPGA. For example, if a problem involves a lot
of 64-bit floating-point arithmetic, put a lot of double-precision
FP ALUs in the FPGA. For some reason, such parts are not available
at the moment.

John Savard

MitchAlsup

unread,
Jan 28, 2023, 8:30:45 PMJan 28
to
I see it a bit different:: FPGA applications should target things CPUs do
rather poorly.
<
As BGB indicates, he has had a hard time getting his FPU small enough
for his FPGA and it still runs slowly (compared to Intel or AMD CPUs).
So, in order to get speedup, you would need an FPGA that supports 10
FPUs (more likely 100 FPUs) and enough pins to feed it the BW it requires.
Some of these FPGAs are more expensive than a CPU running at decent
but not extraordinary frequency.

robf...@gmail.com

unread,
Jan 28, 2023, 11:37:49 PMJan 28
to
>At about 13 minutes he states his goal to turn software programmers
>into FPGA application engineers. The rest of the talk is about how this
>can be achieved. It's a technical talk to an advanced audience.

I do not think its a great idea to turn software programmers in FPGA app
engineers. It requires very different thinking, essentially two skill sets. I suspect
most people would want to specialize in one area or another. It may be good to
be able to identify where an FPGA solution could work better. But that is more
like finding a better algorithm.

>I should note: you cannot debug HW by single stepping !! as there is no definition
>of what single stepping means at the gate level. No, HW designer use simulators
>where they can stop at ½ clock intervals and then examine millions of signals--
>some of them X (unknown value) and Z (high impeadence).

I single step through FGPA logic sometimes trying to find bugs, although sometimes
it does not work the best. As you say, output values are only valid at clock intervals. Single
step is available in the Vivado simulator. But there are a couple of caveats. Stepping
will occur sequentially in the same always statement, but once it hits the end of
statement or other exit point it may jump around to another always block seemingly
at random. Best bet is to use it with a breakpoint then single step for only a few lines
Also variables are not set until the clock edge occurs, so one must single step though
all possible steps, hit the clock edge, then look at all the variables. One can set two
breakpoints, one in each successive clock edge to see how variables changed.
******
I think FPGA’s will always be at least an order of magnitude slower than custom CPU
for many compute tasks. Each has it owns area I think. If an FPGA task is common
enough, I think it would eventually get implemented in custom logic rather than using
all lookup tables with switchable routing.
I think FPGAs are great for prototyping and one-off solutions, not so sure beyond that.

Terje Mathisen

unread,
Jan 29, 2023, 4:45:16 AMJan 29
to
NO, and once again, NO. FPGA starts out with at least an order of
magnitude speed disadvantage, so you need problems where the algorithms
are really unsuited for a SW implementation.

The only way for FPGA to become relevant in the mass market is because
it turns up as a standard feature in every Intel/Apple/ARM cpu,
otherwise it will forever be relegated to the narrow valley between what
you can do by just throwing a few more OoO cores at the problem, and
when you turn to full VLSI.

I.e. FPGA is for prototyping and low count custom systems. However, even
though cell phone towers have used FPGA to allow relatively easy
(remote) upgrades to handle new radio protocols as they become
finalized, even those systems (in the 100K+ range of installations) will
put everything baseline into VLSI.

I also don't think you can make FPGA context switching even remotely
fast, further limiting possible usage scenarios.

Terje


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Michael S

unread,
Jan 29, 2023, 7:17:27 AMJan 29
to
FPGAs *are* mass market for more than 2 decades.
But they are mass market not in role of compute accelerator and
I agree with you that they will never become a mass market in that role.
In the previous century mass market FPGAs were called simply FPGA.
In this century they are called "low end" or similar derogatory names.
Wall Street does not care about them just like Wall Street does not care
about micro-controllers, but like micro-controllers they are a cornerstone
of the industry. Well, I am exaggerating a little, somewhat less
then micro-controllers, but a cornerstone nevertheless.

Michael S

unread,
Jan 29, 2023, 7:48:21 AMJan 29
to
BGB plays with Artix-7. Probably XC7A25T that has 23,360 logic cells
and 80 DSP slices. Besides, if he can't fit something into device, it
does not mean that experienced FPGA guy will have the same difficulties.
But let's leave it aside and concentrate on devices.
So, 23,360 logic cells and 80 DSP slices.
For comparison, the biggest device in the same decade old Xilinx 7
series is Virtex XC7VH870T with 876,160 logic cells and 2520 DSP slices.

Newer high-end FPGA devices are bigger by another order of magnitude
although not in DSP slices area: Virtex ULTRAScale+ XCVU19P has 8,938,000
logic cells and 3,840 DSP slices. But that device is also not quite new.

In more recent years Xilinx lost interest in traditional FPGAs and is trying
to push "Adaptive Compute Acceleration Platform" (AGAPs). Some of those have
rather insane amount of multipliers, so big that they stopped counting DSP
slices and are now counting DSP Engines. I am too lazy to dig deeper and find
out what it really means, but one thing is sure - there are a lot of compute
resources in these devices. May be not as much as in leading edge GPUs,
but it's the same order of magnitude.

Were are not talking about 10 or 100 or 1000 FPUs on the high end AGAPs.
More like 10,000.

> Some of these FPGAs are more expensive than a CPU running at decent
> but not extraordinary frequency.

That's another understatement.

Scott Lurndal

unread,
Jan 29, 2023, 9:13:22 AMJan 29
to
JimBrakefield <jim.bra...@ieee.org> writes:
>https://www.youtube.com/watch?v=-XuMWvGUocI&t=123s
>Starts at the 2 minute mark. He argues that computer performance
>has plateaued and that FPGAs offer a route to higher performance.
>At about 13 minutes he states his goal to turn software programmers
>into FPGA application engineers. The rest of the talk is about how this
>can be achieved. It's a technical talk to an advanced audience.

FPGAs have been used for decades to provide higher performance
for certain workloads. That was one of the reasons both Intel and
AMD have each purchased on of the big FPGA guys.

Although now, there are a number of custom ASIC houses that will
be happy to add custom logic to standard processor packages
(such as Marvell).

Scott Lurndal

unread,
Jan 29, 2023, 9:16:10 AMJan 29
to
MitchAlsup <Mitch...@aol.com> writes:
>On Saturday, January 28, 2023 at 6:11:30 PM UTC-6, JimBrakefield wrote:
>> https://www.youtube.com/watch?v=3D-XuMWvGUocI&t=3D123s=20
>> Starts at the 2 minute mark. He argues that computer performance=20
>> has plateaued and that FPGAs offer a route to higher performance.=20
><

><
>I should note: you cannot debug HW by single stepping !! as there is no def=
>inition=20
>of what single stepping means at the gate level. No, HW designer use simula=
>tors=20
>where they can stop at =C2=BD clock intervals and then examine millions of =
>signals--
>some of them X (unknown value) and Z (high impeadence).

Actually, that's how we debugged the Burroughs mainframes, by stepping a
single cycle at a time (using an external "maintenance processor" to
drive the processor logic using scan chains). This was late 70's.

Modern systems include JTAG facilities for similar debugability. Yes
a lot happens each cycle, but a good scan chain will include enough
flops to make debug rather straightforward.

MitchAlsup

unread,
Jan 29, 2023, 11:54:51 AMJan 29
to
The point was that single stepping a SW program sees a single state
change per instruction, whereas single clocking a CPU sees thousands
if not tens of thousands of state changes per smallest stepping.

Quadibloc

unread,
Jan 29, 2023, 12:07:44 PMJan 29
to
On Saturday, January 28, 2023 at 6:30:45 PM UTC-7, MitchAlsup wrote:

> I see it a bit different:: FPGA applications should target things CPUs do
> rather poorly.
> <
> As BGB indicates, he has had a hard time getting his FPU small enough
> for his FPGA and it still runs slowly (compared to Intel or AMD CPUs).
> So, in order to get speedup, you would need an FPGA that supports 10
> FPUs (more likely 100 FPUs) and enough pins to feed it the BW it requires.
> Some of these FPGAs are more expensive than a CPU running at decent
> but not extraordinary frequency.

I feel that FPGAs won't really take off unless they are good at applications
that are common - and those are ones well adapted to CPUs.

So I was advocating FPGAs with real FPUs as a component, not synthesizing
an FPU on an FPGA, which is much slower.

John Savard

MitchAlsup

unread,
Jan 29, 2023, 12:31:45 PMJan 29
to
The IP to which might cost even more than the FPGA it goes in.
>
> John Savard

BGB

unread,
Jan 29, 2023, 12:31:55 PMJan 29
to
For the BJX2 core, mostly on an XC7A100T, I can fit a Double Precision
FPU (1x Binary64) and also a 4x Binary32 SIMD unit (albeit the later
using a hard-wired truncate-only rounding mode).


Originally, had done Binary32 SIMD by pipelining it through the main
FPU, but with a dedicated low-precision SIMD unit, can get a fair bit of
a speedup: 20 MFLOPs -> 200 MFLOPs, at 50MHz.
And was able to stretch it enough to "more or less faithfully" handle
full Binary32 precision.


This is along with the the MMU, and 3-wide execute unit.

Don't really have enough LUTs left over for a second core or any real
sort of GPU.

I could potentially fit dual cores and boost clock speeds to 75 MHz on
an XC7A200T, but this is expensive (and boards with these have been
mostly out-of-stock for a while).

An XC7K325T or similar would possibly allow quad-core and/or a dedicated
GPU, as well as 100 or 150MHz. But, like, I don't really have the money
for something like this... (And this is basically the largest and
fastest FPGA supported by Vivado WebPack).


I can also fit the BJX2 core onto an XC7S50, but need to scale it back
slightly to make it fit.


I have also used an XC7S25 as well, but generally I can only seem to fit
simpler RISC style cores on this. Typically no FPU or MMU.


On the XC7S25 or XC7A35T, one could make a strong case for RISC-V
though, as what one can fit on these FPGAs is pretty much in-line for a
simple scalar RISC-V core or similar (and a RISC-like subset of BJX2 has
no real practical advantage over RV64I or similar).

A stronger case could probably be made for RV32I or maybe RV32IM or
similar on this class of FPGA.



And, if I were doing a GPU on an FPGA, something akin to my current
BJX2-XG2 mode could make sense as a base (possibly with wider SIMD and
also SIMD'ing the memory loads and stores, *).

*: Say, loads where the index register is a vector encoding multiple
indices, each of which is loaded into a subset of the loaded vector. The
ISA would look basically the same as it is now, except nearly everything
would be "doubled".

So, say:
ADD R16, R23, R39
Would actually add 128-bit vectors containing a pair of 64-bit values
(and the current 128-bit SIMD ops would effectively expand to being
8-wide 256-bit operations). Most Loads/Stores would also be doubled
(idea being to schedule loop iterations into each element of the vector;
with only a subset of ops being "directly aware" of the registers being
SIMD vectors; possibly with a mode flag to enable/disable side-effects
from the high-half of the vector).

But, as noted, I would likely need a Kintex or similar to have the LUT
budget for something like this...


> Besides, if he can't fit something into device, it
> does not mean that experienced FPGA guy will have the same difficulties.
> But let's leave it aside and concentrate on devices.
> So, 23,360 logic cells and 80 DSP slices.
> For comparison, the biggest device in the same decade old Xilinx 7
> series is Virtex XC7VH870T with 876,160 logic cells and 2520 DSP slices.
>

A Virtex-7 is also several orders of magnitude more expensive...

If a chip costs more than a typical person will have during their
lifetime, it almost may as well not exist as far as they are concerned.

...

I guess a person can "rent" access to Virtex devices via remote cloud
servers. Still not very practical though.


> Newer high-end FPGA devices are bigger by another order of magnitude
> although not in DSP slices area: Virtex ULTRAScale+ XCVU19P has 8,938,000
> logic cells and 3,840 DSP slices. But that device is also not quite new.
>
> In more recent years Xilinx lost interest in traditional FPGAs and is trying
> to push "Adaptive Compute Acceleration Platform" (AGAPs). Some of those have
> rather insane amount of multipliers, so big that they stopped counting DSP
> slices and are now counting DSP Engines. I am too lazy to dig deeper and find
> out what it really means, but one thing is sure - there are a lot of compute
> resources in these devices. May be not as much as in leading edge GPUs,
> but it's the same order of magnitude.
>
> Were are not talking about 10 or 100 or 1000 FPUs on the high end AGAPs.
> More like 10,000.
>

Hmm...

These look probably if-anything more relevant to "AI" and/or "bitcoin
mining" than to traditional FPGA use cases.

Looks less relevant personally, as "do whole lots of FPU math" isn't
really the typical bottleneck in my projects.


Granted, if one does want "thing that does lots of FPU math", this could
make sense.


>> Some of these FPGAs are more expensive than a CPU running at decent
>> but not extraordinary frequency.
>
> That's another understatement.


General case performance even on par with a RasPi is difficult...

Ironically, it is a little easier to compete with a RasPi for software
OpenGL, as the RasPi just sorta sucks at this (it effectively "face
plants" so hard as to offset its clock-speed advantage).


In a "performance per clock" sense at this task, my BJX2 core somewhat
beats out my Ryzen 7 for this as well. For the NN tests, it almost gets
almost a little absurd...

For more "general purpose" code, the BJX2 core kinda gets its crap
handed back to it though.



Though, in more realistic scenarios, hard to get an Artix-7 anywhere
near the speeds of my desktop PC, and even then, only in contrived
scenarios (the cost of emulating Binary16 FP-SIMD and similar on a PC
via bit-twiddling is higher than the clock-speed delta, ~ 74x).

This sort of thing sometimes poses issues for my emulator, as some
instructions are harder to emulate efficiently.

For example, some of the compressed texture instructions and similar
only "keep up" as they secretly cache recently decoded blocks. A direct
implementation of the approach used in the Verilog implementation would
be too slow to emulate in real time.


Things like emulating cache latency is a double-edged sword, as it isn't
super cheap to evaluate cache-hits and misses, but the cache misses
reduce how much work the emulator needs to do to keep up.



But, yeah, otherwise it would appear that a 150 MHz BJX2 core would be
fast enough to run Quake 3 Arena and similar with software rasterized
OpenGL at "fairly playable" framerates...

But, this will likely need to remain "in theory", as I don't have $k to
drop on a Kintex board or similar to find out...


But, say, if some company or whatever wanted to throw $k my way (both
for the FPGA board, and for "cost of living" reasons), could probably
make it happen (otherwise, I am otherwise divided in my efforts, also
needing to spend a chunk of time out in a machine shop).

Probably not going to happen though.

...


Though, compared with an "actual PC", it isn't very practical.

And, even a RasPi can run Quake 3 pretty easily (and a lot cheaper) if
one can make use of its integrated GPU (main annoyance being that it
uses GLES 2.0 rather than OpenGL 1.x).


For Quake 1, to get it as good as it is, had to resort to some trickery
like rewriting "BoxOnPlaneSide" and similar using ASM, ...

To get much more speed, would likely need a differently organized 3D engine.


Likely, per-texture quad arrays which are rebuilt only when the camera
moves into a different PVS or similar (rather than walking the BSP and
similar every frame).

JimBrakefield

unread,
Jan 29, 2023, 12:38:36 PMJan 29
to
On Sunday, January 29, 2023 at 11:07:44 AM UTC-6, Quadibloc wrote:
The Intel-Altera X series devices offer single precision add/multiply, no denorm support
The AMD-Xilinx Versal series offers single precision add/multiply in their SIMD/RISC cores.

Most of these chips have five digit price tags, except the three digit Arria X and Cyclone X GX families.

Some of these series and families offer 10K+ DSP units, 1M+ LUTs, 50+MB block RAM,
and HBM.

MitchAlsup

unread,
Jan 29, 2023, 12:51:29 PMJan 29
to
On Sunday, January 29, 2023 at 11:31:55 AM UTC-6, BGB wrote:
> On 1/29/2023 6:48 AM, Michael S wrote:

> > BGB plays with Artix-7. Probably XC7A25T that has 23,360 logic cells
> > and 80 DSP slices.
> For the BJX2 core, mostly on an XC7A100T, I can fit a Double Precision
> FPU (1x Binary64) and also a 4x Binary32 SIMD unit (albeit the later
> using a hard-wired truncate-only rounding mode).
>
But there is some reason you don't correctly compute FMULD--like
using 3/4 of a multiplier tree or something. Was tis due to lack of
gates (LUTs) or lack of DSPs or perceived unnecessary of getting
the right answer ?

JimBrakefield

unread,
Jan 29, 2023, 12:54:35 PMJan 29
to
There are some issues that microprocessors have:
The high energy cost of access to main memory
A way to utilize dozens of cores and threads within most programming languages.
e.g. parallelism is unsolved and difficult.

BGB

unread,
Jan 29, 2023, 12:57:19 PMJan 29
to
Depends on what one expects from an FPU.

Binary32 units could make sense alongside (or as an extension of) the
existing DSPs. Binary64 units would likely be a little more of a stretch.

Another balancing act of this would be to have "just enough" FPUs.

But, say, if an FPGA could have, say:
4x Binary64 MAC
32x Binary32 MAC

In addition to, say, 80k+ LUTs, 8Mb of BRAM, ... This could be a "pretty
nice" FPGA (particularly if it had a 32-bit DRAM interface, etc, *).


*: The Artix-7 boards are mostly using a 16-bit RAM interfaces, apart
from a few smaller boards using QSPI SRAM's and similar. Seemingly only
higher-end boards having a 32-bit RAM interface.

A rare few boards also use 8-bit DDR or SDRAM.


Could also be interesting if an FPGA board could utilize an M.2 SSD
interface or similar.


> John Savard

Anton Ertl

unread,
Jan 29, 2023, 1:15:48 PMJan 29
to
sc...@slp53.sl.home (Scott Lurndal) writes:
>FPGAs have been used for decades to provide higher performance
>for certain workloads.

What workloads?

My impression is that if a workload is structured such that an FPGA
beats software on something like a CPU or a GPGPU, and if the
workload's performance is important enough, or there are enough
customers for the workload, people may prototype on FPGA, but they
then go for custom silicon for another speedup by an order of
magnitude and for a similar reduction in marginal cost. For FPGAs
this leaves only prototypes, and low-volume uses where performance is
not paramout. There is still enough volume there for significant
revenue for Xilinx and Altera. But the idea that you switch to FPGA
for performance is absurd.

For HPC CPUs and GPGPUs look fine. HPC performs memory accesses,
where FPGAs provide no advantage, and FP operations (FLOPs) where the
custom logic of CPUs and GPGPUs beats FPGAs clearly. You may think
that FPGA provides an advantage in passing data from one FLOP to
another, but CPUs have optimized the case of passing the result of one
instruction to another quite well, with their bypass networks. So
even if you have a field programmable FPU array (FPFA), I doubt that
you will see an advantage over CPUs. And I expect that it's similar
for GPGPUs.

To compare the performance of lower-grade hardware (but still
full-custom rather than FPGA) to software on a high-performace CPU, I
ran the rv8-bench <https://github.com/michaeljclark/rv8-bench/> C
programs (Compiling the C++ program failed) on a VisionFive 1 (1GHz
U74 cores) and compared the results to those on the RV8 simulator
running on a Core i7-5557U (3.4GHz Broadwell) taken from
<https://michaeljclark.github.io/bench>:

aach64 rv64g rv64g rv64g AMD64
qemu qemu rv8 U74 Broadwell
aes 1.31 2.16 1.49 3.30 0.32
dhrystone 0.98 0.57 0.20 1.109 0.10
miniz 2.66 2.21 1.53 8.766 0.77
norx 0.60 1.17 0.99 1.974 0.22
primes 2.09 1.26 0.65 18.686 0.60
qsort 7.38 4.76 1.21 5.218 0.64
sha512 0.64 1.24 0.81 2.048 0.24

The U74 column contains user time (total CPU time is higher by 0-20%),
not sure what the other results are. The interesting columns here are
the rv8 column and the U74 column, but also the rv64-qemu column; they
show that both software software emulation of RV64G on a 2015-vintage
high-end laptop CPU from Intel beats the custom silicon implementation
on the Visionfive 1. For an FPGA implementation you cannot expect a
1GHz clock rate, from what I hear 200MHz would be a good number. So
with the same microarchitecture you get a result that's even slower
than the VisionFive 1 by a factor ~5.

One might naively think that architecture implementation is the kind
of workload where FPGAs beat software on high-end CPUs, but that's
obviously not the case.

>That was one of the reasons both Intel and
>AMD have each purchased on of the big FPGA guys.

If so, IMO they did it for the wrong reason. I was certainly
wondering about the huge amount of money that AMD spent on Xilinx.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,
Jan 29, 2023, 1:40:58 PMJan 29
to
Combination of factors.

I could get a full FMUL result, but it would come at the expense of
spending more LUTs, more DSPs, and having a higher latency.

Though, by themselves, the DSPs aren't as much of an issue, as the FPGA
I am using has "more than enough" DSPs, but I am basically at the limit
of timing latency (sneeze on the thing too hard and it fails timing).


The fraction of a bit of rounding error being "not worth it" if it means
needing to make FMUL slower (and a fairly obvious increase in terms of
resource cost).

And, say, I don't really want an 7 or 8-cycle FMUL...



Likewise for the hard-wired truncate on the SIMD ops:
For most of my use-cases, it straight up "doesn't matter".


I did have a reason to go from a truncated 24-bit floating point format
to full width Binary32, but mostly this is because:
Quake has physics glitches if things are calculated using a truncated
format;
This allowed making the default Binary32 SIMD faster;
My BtMini2 engine was "obviously broken" (*1) if one takes the camera
64km from the origin with 24-bit floating point (but was OK doing this
with Binary32);
...

But, rounding error doesn't really make much of a difference.


*1: At 64k from the origin, the map geometry turns into a jittering dog
chewed mess.

But, this is not a huge surprise when the effective ULP was ~ 1 meter.
There us a lot less jitter when the ULP is closer to 0.8cm.

But, whether or not rounding was performed (or correct), the effective
ULP would still be 0.8 cm.


Similarly for SIMD having MUL and ADD but no MAC:
While in theory, I could do a MAC, I can't do it in 3 clock cycles;
The moment I need more than 3 cycles, doing 3C/1T is broken.

...

MitchAlsup

unread,
Jan 29, 2023, 1:48:57 PMJan 29
to
On Sunday, January 29, 2023 at 12:15:48 PM UTC-6, Anton Ertl wrote:
> sc...@slp53.sl.home (Scott Lurndal) writes:
> >FPGAs have been used for decades to provide higher performance
> >for certain workloads.
> What workloads?
>
> My impression is that if a workload is structured such that an FPGA
> beats software on something like a CPU or a GPGPU, and if the
> workload's performance is important enough, or there are enough
> customers for the workload, people may prototype on FPGA, but they
> then go for custom silicon for another speedup by an order of
> magnitude and for a similar reduction in marginal cost. For FPGAs
> this leaves only prototypes, and low-volume uses where performance is
> not paramout. There is still enough volume there for significant
> revenue for Xilinx and Altera. But the idea that you switch to FPGA
> for performance is absurd.
>
> For HPC CPUs and GPGPUs look fine. HPC performs memory accesses,
> where FPGAs provide no advantage, and FP operations (FLOPs) where the
> custom logic of CPUs and GPGPUs beats FPGAs clearly. You may think
> that FPGA provides an advantage in passing data from one FLOP to
> another, but CPUs have optimized the case of passing the result of one
> instruction to another quite well, with their bypass networks. So
<
I am going to push back here.
<
Take Interpolation* performed in a GPU. Interpolation takes the {x,y,z,w}^3
coordinates of a triangle, and identifies the coordinates of a series of
pixels this triangle maps to. This takes 29 FP calculations (last time I
looked) per pixel, yet, GPUs produce these 8-to-32 pixels per cycle. For
an effective throughput of 232 spFLOPs per cycle from 1 fixed function
unit; and a GPU contains one of these interpolators per Shader Core.
<
There is no way SW is competitive, here (instruction count)--nor are FPGAs
(frequency).
<
(*) Interpolation is a part of rasterization.
<
-----------------------------------------------------------------------------------------------------------------------
<
Secondarily, integer arithmetic is but 8-11% of CPU power dissipation.
FP is even lower, leaving a majority of energy consumption in a) the
clock tree, b) fetch-decode-issue, c) schedule-execute, d) retire; none
of which add to the bottom line of performance--it is just that they
provide the infrastructure* on which the instruction can be executed
with considerable width.
<
(*) Think of a pipeline like a conveyor belt or an assembly line.
<

BGB

unread,
Jan 29, 2023, 2:08:34 PMJan 29
to
VLSI/ASIC only really makes sense if one has a mountain of money to burn.


>> I also don't think you can make FPGA context switching even remotely
>> fast, further limiting possible usage scenarios.
>>
>> Terje
>>
>>
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"
>
> FPGAs *are* mass market for more than 2 decades.
> But they are mass market not in role of compute accelerator and
> I agree with you that they will never become a mass market in that role.
> In the previous century mass market FPGAs were called simply FPGA.
> In this century they are called "low end" or similar derogatory names.

Makes sense.


Say:
FPGA on a standalone board with an SDcard slot and VGA port and similar;
FPGA board meant to fit a "DIP40" form factor or similar;
FPGA on a PCIe or M.2 card with no external IO interfaces.

Represent somewhat different use cases...


There are FPGA boards for PCIe and M.2, but personally, I have not as
much use for them.

For most general computational tasks, a Ryzen or similar is going to run
circles around whatever one can put on an Artix.


> Wall Street does not care about them just like Wall Street does not care
> about micro-controllers, but like micro-controllers they are a cornerstone
> of the industry. Well, I am exaggerating a little, somewhat less
> then micro-controllers, but a cornerstone nevertheless.

Yeah.

A world without microcontrollers would likely more resemble the 60s or
70s than it does the modern world. Even desktop PCs as we know them
could not exist without microcontrollers.

There is also a non-zero overlap between FPGAs and microcontrollers.

...

EricP

unread,
Jan 29, 2023, 2:46:15 PMJan 29
to
Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Saturday, January 28, 2023 at 6:11:30 PM UTC-6, JimBrakefield wrote:
>>> https://www.youtube.com/watch?v=3D-XuMWvGUocI&t=3D123s=20
>>> Starts at the 2 minute mark. He argues that computer performance=20
>>> has plateaued and that FPGAs offer a route to higher performance.=20
>> <
>
>> <
>> I should note: you cannot debug HW by single stepping !! as there is no def=
>> inition=20
>> of what single stepping means at the gate level. No, HW designer use simula=
>> tors=20
>> where they can stop at =C2=BD clock intervals and then examine millions of =
>> signals--
>> some of them X (unknown value) and Z (high impeadence).
>
> Actually, that's how we debugged the Burroughs mainframes, by stepping a
> single cycle at a time (using an external "maintenance processor" to
> drive the processor logic using scan chains). This was late 70's.

Luxury! I used a dual trace oscilloscope and an In-Circuit Emulator.


EricP

unread,
Jan 29, 2023, 2:46:15 PMJan 29
to
People have also been investigating C/C++ languages to FPGA synthesis
for quite some time (a quicky search finds references back to 1996)
in order to make them more accessible to the general market.

The result is probably not as efficient as Verilog
but if it works users may not care.

Michael S

unread,
Jan 29, 2023, 3:09:14 PMJan 29
to
Pay attention that programming FPGAs in Verilog is almost exclusively
USA trait. The rest of the world does it in VHDL.

Quadibloc

unread,
Jan 29, 2023, 4:11:06 PMJan 29
to
Wouldn't that be an argument that the cost of CPUs and GPUs
would (also) be prohibitive?

I am being serious here. An FPGA that included a large number of
full-bore 64-bit floating point ALUs could indeed be designed to
accelerate the inner loops of a lot of programs, particularly in
scientific computing, which is the field that makes the most use
of HPC.

That might still be a special-purpose device, but no more so - and
from some viewpoints, considerably less so - than the typical FPGA,
which seems only to be applicable to things which are otherwise
difficult to do on a CPU.

I suppose a joke to the effect that a special-purpose computing
device is one that's good for somebody else's purpose might fit in
here.

John Savard

MitchAlsup

unread,
Jan 29, 2023, 5:07:11 PMJan 29
to
On Sunday, January 29, 2023 at 3:11:06 PM UTC-6, Quadibloc wrote:
> On Sunday, January 29, 2023 at 10:31:45 AM UTC-7, MitchAlsup wrote:
> > On Sunday, January 29, 2023 at 11:07:44 AM UTC-6, Quadibloc wrote:
>
> > > So I was advocating FPGAs with real FPUs as a component, not synthesizing
> > > an FPU on an FPGA, which is much slower.
>
> > The IP to which might cost even more than the FPGA it goes in.
<
> Wouldn't that be an argument that the cost of CPUs and GPUs
> would (also) be prohibitive?
<
This falls into the category where there might be excellent engineering
reasons that something should be done with an FPGA added to a
system, but the practicality of getting there is impracticable (licensing,
legal, intellectual property $$$s,...)

Thomas Koenig

unread,
Jan 29, 2023, 5:17:48 PMJan 29
to

Scott Lurndal

unread,
Jan 29, 2023, 6:03:55 PMJan 29
to
Have you priced out a 3mm mask recently?

Scott Lurndal

unread,
Jan 29, 2023, 6:04:44 PMJan 29
to
^^ ---> nm.

MitchAlsup

unread,
Jan 29, 2023, 7:04:57 PMJan 29
to
22nm and 14nm are not that expensive right now.

Anton Ertl

unread,
Jan 30, 2023, 3:39:35 AMJan 30
to
Michael S <already...@yahoo.com> writes:
>Pay attention that programming FPGAs in Verilog is almost exclusively
>USA trait. The rest of the world does it in VHDL.

Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
<https://github.com/forthy42/b16-small>. It has been used in custom
silicon, not (to my knowledge) in FPGA, but does that make a
difference?

From the HOPL talk about Verilog, my impression is: Around 2000 all
the buzz was for VHDL, and that Verilog was doomed. Verilog survived
and won in large projects, because it was designed for efficient
implementation of simulators, while the design of VHDL necessarily
leads to less efficiency. For large projects this efficiency is very
important, while for smaller projects the VHDL simulators are fast
enough.

Michael S

unread,
Jan 30, 2023, 10:23:08 AMJan 30
to
On Monday, January 30, 2023 at 10:39:35 AM UTC+2, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >Pay attention that programming FPGAs in Verilog is almost exclusively
> >USA trait. The rest of the world does it in VHDL.
> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
> <https://github.com/forthy42/b16-small>. It has been used in custom
> silicon, not (to my knowledge) in FPGA, but does that make a
> difference?
>

It absolutely does.
FPGA development and ASIC development are different cultures.
Naturally, use of FPGAs for ASIC prototyping is part of ASIC culture.

I could imagine that "FPGAs as compute accelerators" is yet another
culture if there are enough people involved to form the culture.
Likely with different set of preferred tools. I know nothing about
it except that I know that it does not really work. But even that
knowledge is not 1st hand.

Scott Lurndal

unread,
Jan 30, 2023, 12:15:35 PMJan 30
to
Michael S <already...@yahoo.com> writes:
>On Monday, January 30, 2023 at 10:39:35 AM UTC+2, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>> >Pay attention that programming FPGAs in Verilog is almost exclusively
>> >USA trait. The rest of the world does it in VHDL.
>> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
>> <https://github.com/forthy42/b16-small>. It has been used in custom
>> silicon, not (to my knowledge) in FPGA, but does that make a
>> difference?
>>

>
>I could imagine that "FPGAs as compute accelerators" is yet another
>culture if there are enough people involved to form the culture.
>Likely with different set of preferred tools. I know nothing about
>it except that I know that it does not really work. But even that
>knowledge is not 1st hand.

You "know that it does not really work". But not from first-hand
experience. So, what data (other than off-hand anecdotal data) do
you have to support your position?

https://www.researchgate.net/publication/354063174_FPGA-based_HPC_accelerators_An_evaluation_on_performance_and_energy_efficiency
"Results show that while FPGAs struggle to compete in absolute
terms with GPUs on memory- and compute- intensive kernels,
they require far less power and can deliver nearly the same
energy efficiency."

https://ieeexplore.ieee.org/document/9556357

"FPGAs are already known to provide interesting speedups in
several application fields, but to estimate their expected
performance in the context of typical HPC workloads is not
straightforward."

https://evision-systems.com/high-performance-computing/

FPGAs have been promient at SC for the last twenty years,
see the program for SC22, e.g.


Advances in FPGA Programming and Technology for HPC

"FPGAs have gone from niche components to being a central
part of many data centers worldwide to being considered for
core HPC installations. The last year has seen tremendous advances in
FPGA programmability and technology, and FPGAs for general HPC is
apparently within reach."

Task Scheduling on FPGA-Based Accelerators without Partial Reconfiguration

etc.

John Dallman

unread,
Jan 30, 2023, 3:13:12 PMJan 30