Jason Cong's future of high performance computing

JimBrakefield

unread,

Jan 28, 2023, 7:11:30 PM1/28/23

to

https://www.youtube.com/watch?v=-XuMWvGUocI&t=123s
Starts at the 2 minute mark. He argues that computer performance
has plateaued and that FPGAs offer a route to higher performance.
At about 13 minutes he states his goal to turn software programmers
into FPGA application engineers. The rest of the talk is about how this
can be achieved. It's a technical talk to an advanced audience.

Given his previous accomplishments and those of the VAST group at UCLA,
https://vast.cs.ucla.edu/, they can probably pull it off.
That is, in five or ten years this is what "big iron" computing will look like?

MitchAlsup

unread,

Jan 28, 2023, 8:16:48 PM1/28/23

to

On Saturday, January 28, 2023 at 6:11:30 PM UTC-6, JimBrakefield wrote:
> https://www.youtube.com/watch?v=-XuMWvGUocI&t=123s
> Starts at the 2 minute mark. He argues that computer performance
> has plateaued and that FPGAs offer a route to higher performance.
<

In my best Homer Simpson voice:: "Well Duh"
<
When CPU packages were limited to 100W of power; there is only a
maximum number of gates one can switch per unit time and stay
under this power limit. Technology increases are simply driving the
dissipation per gate down--after we maxed out IPL consumption.

<
> At about 13 minutes he states his goal to turn software programmers
> into FPGA application engineers. The rest of the talk is about how this
> can be achieved. It's a technical talk to an advanced audience.
<

Software people are trained to think as vonNeumann:: do one thing and then
do one more thing; repeat until done. This is manifest in the way they debug
programs--by single stepping.
<
Hardware is not like this at all:: Nothing is preventing all of the first gates after
a flip flop from sensing its inputs and generating its output simultaneously.
HW designers often (perversely) write Verilog code backwards knowing that
the compiler will rearrange the code by net-list dependency.
<
I should note: you cannot debug HW by single stepping !! as there is no definition
of what single stepping means at the gate level. No, HW designer use simulators
where they can stop at ½ clock intervals and then examine millions of signals--
some of them X (unknown value) and Z (high impeadence).
<
I have serious doubts that one can teach the 99% of software engineers to
think in ways that truly are concurrent--and the first thing that HW designers
have to come to grips with is that there is no single stepping, the minimal
advance is ½ clocks--and here a billion gates can change their output signals.

>
> Given his previous accomplishments and those of the VAST group at UCLA,
> https://vast.cs.ucla.edu/, they can probably pull it off.
<
> That is, in five or ten years this is what "big iron" computing will look like?
<

My bet is that if his efforts succeed, it will be 20 years before whatever it
becomes is available in a computer you buy in "Best Buy" and take home to
use.
<
Then again, there is the problem of applications to make use of the new
capabilities. Where do these come from.
<
----------------------------------------------------------------------------------------------------------------------
<
When I started as a professional in the computer world, I went to a company
sponsored lecture about Artificial intelligence, and how it would revolutionize
what computers do and how they do it and how it was only 5 years off into
the future.........this was 1982! 40 years later (8× longer than stated) we are
on the cusp of AI being useful to the average person not using Google as a
search engine. Yet the application has nothing to do with computing but with
another <nearly> daily activity:: Driving.

Quadibloc

unread,

Jan 28, 2023, 8:22:47 PM1/28/23

to

On Saturday, January 28, 2023 at 5:11:30 PM UTC-7, JimBrakefield wrote:
> Starts at the 2 minute mark. He argues that computer performance
> has plateaued and that FPGAs offer a route to higher performance.
> At about 13 minutes he states his goal to turn software programmers
> into FPGA application engineers.

At present, FPGAs can make _some_ computer programs more
efficient. If a computer program involves things like bit manipulation,
that an FPGA can do well, but CPUs do poorly, it can be a good fit.

It would be possible to design FPGAs that are better suited to
problems that CPUs already do well, so that they could be done
even better on the FPGA. For example, if a problem involves a lot
of 64-bit floating-point arithmetic, put a lot of double-precision
FP ALUs in the FPGA. For some reason, such parts are not available
at the moment.

John Savard

MitchAlsup

unread,

Jan 28, 2023, 8:30:45 PM1/28/23

to

I see it a bit different:: FPGA applications should target things CPUs do
rather poorly.
<
As BGB indicates, he has had a hard time getting his FPU small enough
for his FPGA and it still runs slowly (compared to Intel or AMD CPUs).
So, in order to get speedup, you would need an FPGA that supports 10
FPUs (more likely 100 FPUs) and enough pins to feed it the BW it requires.
Some of these FPGAs are more expensive than a CPU running at decent
but not extraordinary frequency.

robf...@gmail.com

unread,

Jan 28, 2023, 11:37:49 PM1/28/23

to

>At about 13 minutes he states his goal to turn software programmers

>into FPGA application engineers. The rest of the talk is about how this
>can be achieved. It's a technical talk to an advanced audience.

I do not think its a great idea to turn software programmers in FPGA app
engineers. It requires very different thinking, essentially two skill sets. I suspect
most people would want to specialize in one area or another. It may be good to
be able to identify where an FPGA solution could work better. But that is more
like finding a better algorithm.

>I should note: you cannot debug HW by single stepping !! as there is no definition
>of what single stepping means at the gate level. No, HW designer use simulators
>where they can stop at ½ clock intervals and then examine millions of signals--
>some of them X (unknown value) and Z (high impeadence).

I single step through FGPA logic sometimes trying to find bugs, although sometimes
it does not work the best. As you say, output values are only valid at clock intervals. Single
step is available in the Vivado simulator. But there are a couple of caveats. Stepping
will occur sequentially in the same always statement, but once it hits the end of
statement or other exit point it may jump around to another always block seemingly
at random. Best bet is to use it with a breakpoint then single step for only a few lines
Also variables are not set until the clock edge occurs, so one must single step though
all possible steps, hit the clock edge, then look at all the variables. One can set two
breakpoints, one in each successive clock edge to see how variables changed.
******
I think FPGA’s will always be at least an order of magnitude slower than custom CPU
for many compute tasks. Each has it owns area I think. If an FPGA task is common
enough, I think it would eventually get implemented in custom logic rather than using
all lookup tables with switchable routing.
I think FPGAs are great for prototyping and one-off solutions, not so sure beyond that.

Terje Mathisen

unread,

Jan 29, 2023, 4:45:16 AM1/29/23

to

NO, and once again, NO. FPGA starts out with at least an order of
magnitude speed disadvantage, so you need problems where the algorithms
are really unsuited for a SW implementation.

The only way for FPGA to become relevant in the mass market is because
it turns up as a standard feature in every Intel/Apple/ARM cpu,
otherwise it will forever be relegated to the narrow valley between what
you can do by just throwing a few more OoO cores at the problem, and
when you turn to full VLSI.

I.e. FPGA is for prototyping and low count custom systems. However, even
though cell phone towers have used FPGA to allow relatively easy
(remote) upgrades to handle new radio protocols as they become
finalized, even those systems (in the 100K+ range of installations) will
put everything baseline into VLSI.

I also don't think you can make FPGA context switching even remotely
fast, further limiting possible usage scenarios.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Michael S

unread,

Jan 29, 2023, 7:17:27 AM1/29/23

to

FPGAs *are* mass market for more than 2 decades.
But they are mass market not in role of compute accelerator and
I agree with you that they will never become a mass market in that role.
In the previous century mass market FPGAs were called simply FPGA.
In this century they are called "low end" or similar derogatory names.
Wall Street does not care about them just like Wall Street does not care
about micro-controllers, but like micro-controllers they are a cornerstone
of the industry. Well, I am exaggerating a little, somewhat less
then micro-controllers, but a cornerstone nevertheless.

Michael S

unread,

Jan 29, 2023, 7:48:21 AM1/29/23

to

BGB plays with Artix-7. Probably XC7A25T that has 23,360 logic cells
and 80 DSP slices. Besides, if he can't fit something into device, it
does not mean that experienced FPGA guy will have the same difficulties.
But let's leave it aside and concentrate on devices.
So, 23,360 logic cells and 80 DSP slices.
For comparison, the biggest device in the same decade old Xilinx 7
series is Virtex XC7VH870T with 876,160 logic cells and 2520 DSP slices.

Newer high-end FPGA devices are bigger by another order of magnitude
although not in DSP slices area: Virtex ULTRAScale+ XCVU19P has 8,938,000
logic cells and 3,840 DSP slices. But that device is also not quite new.

In more recent years Xilinx lost interest in traditional FPGAs and is trying
to push "Adaptive Compute Acceleration Platform" (AGAPs). Some of those have
rather insane amount of multipliers, so big that they stopped counting DSP
slices and are now counting DSP Engines. I am too lazy to dig deeper and find
out what it really means, but one thing is sure - there are a lot of compute
resources in these devices. May be not as much as in leading edge GPUs,
but it's the same order of magnitude.

Were are not talking about 10 or 100 or 1000 FPUs on the high end AGAPs.
More like 10,000.

> Some of these FPGAs are more expensive than a CPU running at decent
> but not extraordinary frequency.

That's another understatement.

Scott Lurndal

unread,

Jan 29, 2023, 9:13:22 AM1/29/23

to

JimBrakefield <jim.bra...@ieee.org> writes:
>https://www.youtube.com/watch?v=-XuMWvGUocI&t=123s
>Starts at the 2 minute mark. He argues that computer performance
>has plateaued and that FPGAs offer a route to higher performance.
>At about 13 minutes he states his goal to turn software programmers
>into FPGA application engineers. The rest of the talk is about how this
>can be achieved. It's a technical talk to an advanced audience.

FPGAs have been used for decades to provide higher performance
for certain workloads. That was one of the reasons both Intel and
AMD have each purchased on of the big FPGA guys.

Although now, there are a number of custom ASIC houses that will
be happy to add custom logic to standard processor packages
(such as Marvell).

Scott Lurndal

unread,

Jan 29, 2023, 9:16:10 AM1/29/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Saturday, January 28, 2023 at 6:11:30 PM UTC-6, JimBrakefield wrote:

>> https://www.youtube.com/watch?v=3D-XuMWvGUocI&t=3D123s=20
>> Starts at the 2 minute mark. He argues that computer performance=20
>> has plateaued and that FPGAs offer a route to higher performance.=20
><

><
>I should note: you cannot debug HW by single stepping !! as there is no def=
>inition=20
>of what single stepping means at the gate level. No, HW designer use simula=
>tors=20
>where they can stop at =C2=BD clock intervals and then examine millions of =

>signals--
>some of them X (unknown value) and Z (high impeadence).

Actually, that's how we debugged the Burroughs mainframes, by stepping a
single cycle at a time (using an external "maintenance processor" to
drive the processor logic using scan chains). This was late 70's.

Modern systems include JTAG facilities for similar debugability. Yes
a lot happens each cycle, but a good scan chain will include enough
flops to make debug rather straightforward.

MitchAlsup

unread,

Jan 29, 2023, 11:54:51 AM1/29/23

to

The point was that single stepping a SW program sees a single state
change per instruction, whereas single clocking a CPU sees thousands
if not tens of thousands of state changes per smallest stepping.

Quadibloc

unread,

Jan 29, 2023, 12:07:44 PM1/29/23

to

On Saturday, January 28, 2023 at 6:30:45 PM UTC-7, MitchAlsup wrote:

> I see it a bit different:: FPGA applications should target things CPUs do
> rather poorly.
> <
> As BGB indicates, he has had a hard time getting his FPU small enough
> for his FPGA and it still runs slowly (compared to Intel or AMD CPUs).
> So, in order to get speedup, you would need an FPGA that supports 10
> FPUs (more likely 100 FPUs) and enough pins to feed it the BW it requires.
> Some of these FPGAs are more expensive than a CPU running at decent
> but not extraordinary frequency.

I feel that FPGAs won't really take off unless they are good at applications
that are common - and those are ones well adapted to CPUs.

So I was advocating FPGAs with real FPUs as a component, not synthesizing
an FPU on an FPGA, which is much slower.

John Savard

MitchAlsup

unread,

Jan 29, 2023, 12:31:45 PM1/29/23

to

The IP to which might cost even more than the FPGA it goes in.
>
> John Savard

BGB

unread,

Jan 29, 2023, 12:31:55 PM1/29/23

to

For the BJX2 core, mostly on an XC7A100T, I can fit a Double Precision
FPU (1x Binary64) and also a 4x Binary32 SIMD unit (albeit the later
using a hard-wired truncate-only rounding mode).

Originally, had done Binary32 SIMD by pipelining it through the main
FPU, but with a dedicated low-precision SIMD unit, can get a fair bit of
a speedup: 20 MFLOPs -> 200 MFLOPs, at 50MHz.
And was able to stretch it enough to "more or less faithfully" handle
full Binary32 precision.

This is along with the the MMU, and 3-wide execute unit.

Don't really have enough LUTs left over for a second core or any real
sort of GPU.

I could potentially fit dual cores and boost clock speeds to 75 MHz on
an XC7A200T, but this is expensive (and boards with these have been
mostly out-of-stock for a while).

An XC7K325T or similar would possibly allow quad-core and/or a dedicated
GPU, as well as 100 or 150MHz. But, like, I don't really have the money
for something like this... (And this is basically the largest and
fastest FPGA supported by Vivado WebPack).

I can also fit the BJX2 core onto an XC7S50, but need to scale it back
slightly to make it fit.

I have also used an XC7S25 as well, but generally I can only seem to fit
simpler RISC style cores on this. Typically no FPU or MMU.

On the XC7S25 or XC7A35T, one could make a strong case for RISC-V
though, as what one can fit on these FPGAs is pretty much in-line for a
simple scalar RISC-V core or similar (and a RISC-like subset of BJX2 has
no real practical advantage over RV64I or similar).

A stronger case could probably be made for RV32I or maybe RV32IM or
similar on this class of FPGA.

And, if I were doing a GPU on an FPGA, something akin to my current
BJX2-XG2 mode could make sense as a base (possibly with wider SIMD and
also SIMD'ing the memory loads and stores, *).

*: Say, loads where the index register is a vector encoding multiple
indices, each of which is loaded into a subset of the loaded vector. The
ISA would look basically the same as it is now, except nearly everything
would be "doubled".

So, say:
ADD R16, R23, R39
Would actually add 128-bit vectors containing a pair of 64-bit values
(and the current 128-bit SIMD ops would effectively expand to being
8-wide 256-bit operations). Most Loads/Stores would also be doubled
(idea being to schedule loop iterations into each element of the vector;
with only a subset of ops being "directly aware" of the registers being
SIMD vectors; possibly with a mode flag to enable/disable side-effects
from the high-half of the vector).

But, as noted, I would likely need a Kintex or similar to have the LUT
budget for something like this...

> Besides, if he can't fit something into device, it
> does not mean that experienced FPGA guy will have the same difficulties.
> But let's leave it aside and concentrate on devices.
> So, 23,360 logic cells and 80 DSP slices.
> For comparison, the biggest device in the same decade old Xilinx 7
> series is Virtex XC7VH870T with 876,160 logic cells and 2520 DSP slices.
>

A Virtex-7 is also several orders of magnitude more expensive...

If a chip costs more than a typical person will have during their
lifetime, it almost may as well not exist as far as they are concerned.

...

I guess a person can "rent" access to Virtex devices via remote cloud
servers. Still not very practical though.

> Newer high-end FPGA devices are bigger by another order of magnitude
> although not in DSP slices area: Virtex ULTRAScale+ XCVU19P has 8,938,000
> logic cells and 3,840 DSP slices. But that device is also not quite new.
>
> In more recent years Xilinx lost interest in traditional FPGAs and is trying
> to push "Adaptive Compute Acceleration Platform" (AGAPs). Some of those have
> rather insane amount of multipliers, so big that they stopped counting DSP
> slices and are now counting DSP Engines. I am too lazy to dig deeper and find
> out what it really means, but one thing is sure - there are a lot of compute
> resources in these devices. May be not as much as in leading edge GPUs,
> but it's the same order of magnitude.
>
> Were are not talking about 10 or 100 or 1000 FPUs on the high end AGAPs.
> More like 10,000.
>

Hmm...

These look probably if-anything more relevant to "AI" and/or "bitcoin
mining" than to traditional FPGA use cases.

Looks less relevant personally, as "do whole lots of FPU math" isn't
really the typical bottleneck in my projects.

Granted, if one does want "thing that does lots of FPU math", this could
make sense.

>> Some of these FPGAs are more expensive than a CPU running at decent
>> but not extraordinary frequency.
>
> That's another understatement.

General case performance even on par with a RasPi is difficult...

Ironically, it is a little easier to compete with a RasPi for software
OpenGL, as the RasPi just sorta sucks at this (it effectively "face
plants" so hard as to offset its clock-speed advantage).

In a "performance per clock" sense at this task, my BJX2 core somewhat
beats out my Ryzen 7 for this as well. For the NN tests, it almost gets
almost a little absurd...

For more "general purpose" code, the BJX2 core kinda gets its crap
handed back to it though.

Though, in more realistic scenarios, hard to get an Artix-7 anywhere
near the speeds of my desktop PC, and even then, only in contrived
scenarios (the cost of emulating Binary16 FP-SIMD and similar on a PC
via bit-twiddling is higher than the clock-speed delta, ~ 74x).

This sort of thing sometimes poses issues for my emulator, as some
instructions are harder to emulate efficiently.

For example, some of the compressed texture instructions and similar
only "keep up" as they secretly cache recently decoded blocks. A direct
implementation of the approach used in the Verilog implementation would
be too slow to emulate in real time.

Things like emulating cache latency is a double-edged sword, as it isn't
super cheap to evaluate cache-hits and misses, but the cache misses
reduce how much work the emulator needs to do to keep up.

But, yeah, otherwise it would appear that a 150 MHz BJX2 core would be
fast enough to run Quake 3 Arena and similar with software rasterized
OpenGL at "fairly playable" framerates...

But, this will likely need to remain "in theory", as I don't have $k to
drop on a Kintex board or similar to find out...

But, say, if some company or whatever wanted to throw $k my way (both
for the FPGA board, and for "cost of living" reasons), could probably
make it happen (otherwise, I am otherwise divided in my efforts, also
needing to spend a chunk of time out in a machine shop).

Probably not going to happen though.

...

Though, compared with an "actual PC", it isn't very practical.

And, even a RasPi can run Quake 3 pretty easily (and a lot cheaper) if
one can make use of its integrated GPU (main annoyance being that it
uses GLES 2.0 rather than OpenGL 1.x).

For Quake 1, to get it as good as it is, had to resort to some trickery
like rewriting "BoxOnPlaneSide" and similar using ASM, ...

To get much more speed, would likely need a differently organized 3D engine.

Likely, per-texture quad arrays which are rebuilt only when the camera
moves into a different PVS or similar (rather than walking the BSP and
similar every frame).

JimBrakefield

unread,

Jan 29, 2023, 12:38:36 PM1/29/23

to

On Sunday, January 29, 2023 at 11:07:44 AM UTC-6, Quadibloc wrote:

The Intel-Altera X series devices offer single precision add/multiply, no denorm support
The AMD-Xilinx Versal series offers single precision add/multiply in their SIMD/RISC cores.

Most of these chips have five digit price tags, except the three digit Arria X and Cyclone X GX families.

Some of these series and families offer 10K+ DSP units, 1M+ LUTs, 50+MB block RAM,
and HBM.

MitchAlsup

unread,

Jan 29, 2023, 12:51:29 PM1/29/23

to

On Sunday, January 29, 2023 at 11:31:55 AM UTC-6, BGB wrote:
> On 1/29/2023 6:48 AM, Michael S wrote:

> > BGB plays with Artix-7. Probably XC7A25T that has 23,360 logic cells
> > and 80 DSP slices.
> For the BJX2 core, mostly on an XC7A100T, I can fit a Double Precision
> FPU (1x Binary64) and also a 4x Binary32 SIMD unit (albeit the later
> using a hard-wired truncate-only rounding mode).
>

But there is some reason you don't correctly compute FMULD--like
using 3/4 of a multiplier tree or something. Was tis due to lack of
gates (LUTs) or lack of DSPs or perceived unnecessary of getting
the right answer ?

JimBrakefield

unread,

Jan 29, 2023, 12:54:35 PM1/29/23

to

There are some issues that microprocessors have:
The high energy cost of access to main memory
A way to utilize dozens of cores and threads within most programming languages.
e.g. parallelism is unsolved and difficult.

BGB

unread,

Jan 29, 2023, 12:57:19 PM1/29/23

to

Depends on what one expects from an FPU.

Binary32 units could make sense alongside (or as an extension of) the
existing DSPs. Binary64 units would likely be a little more of a stretch.

Another balancing act of this would be to have "just enough" FPUs.

But, say, if an FPGA could have, say:
4x Binary64 MAC
32x Binary32 MAC

In addition to, say, 80k+ LUTs, 8Mb of BRAM, ... This could be a "pretty
nice" FPGA (particularly if it had a 32-bit DRAM interface, etc, *).

*: The Artix-7 boards are mostly using a 16-bit RAM interfaces, apart
from a few smaller boards using QSPI SRAM's and similar. Seemingly only
higher-end boards having a 32-bit RAM interface.

A rare few boards also use 8-bit DDR or SDRAM.

Could also be interesting if an FPGA board could utilize an M.2 SSD
interface or similar.

> John Savard

Anton Ertl

unread,

Jan 29, 2023, 1:15:48 PM1/29/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
>FPGAs have been used for decades to provide higher performance
>for certain workloads.

What workloads?

My impression is that if a workload is structured such that an FPGA
beats software on something like a CPU or a GPGPU, and if the
workload's performance is important enough, or there are enough
customers for the workload, people may prototype on FPGA, but they
then go for custom silicon for another speedup by an order of
magnitude and for a similar reduction in marginal cost. For FPGAs
this leaves only prototypes, and low-volume uses where performance is
not paramout. There is still enough volume there for significant
revenue for Xilinx and Altera. But the idea that you switch to FPGA
for performance is absurd.

For HPC CPUs and GPGPUs look fine. HPC performs memory accesses,
where FPGAs provide no advantage, and FP operations (FLOPs) where the
custom logic of CPUs and GPGPUs beats FPGAs clearly. You may think
that FPGA provides an advantage in passing data from one FLOP to
another, but CPUs have optimized the case of passing the result of one
instruction to another quite well, with their bypass networks. So
even if you have a field programmable FPU array (FPFA), I doubt that
you will see an advantage over CPUs. And I expect that it's similar
for GPGPUs.

To compare the performance of lower-grade hardware (but still
full-custom rather than FPGA) to software on a high-performace CPU, I
ran the rv8-bench <https://github.com/michaeljclark/rv8-bench/> C
programs (Compiling the C++ program failed) on a VisionFive 1 (1GHz
U74 cores) and compared the results to those on the RV8 simulator
running on a Core i7-5557U (3.4GHz Broadwell) taken from
<https://michaeljclark.github.io/bench>:

aach64 rv64g rv64g rv64g AMD64
qemu qemu rv8 U74 Broadwell
aes 1.31 2.16 1.49 3.30 0.32
dhrystone 0.98 0.57 0.20 1.109 0.10
miniz 2.66 2.21 1.53 8.766 0.77
norx 0.60 1.17 0.99 1.974 0.22
primes 2.09 1.26 0.65 18.686 0.60
qsort 7.38 4.76 1.21 5.218 0.64
sha512 0.64 1.24 0.81 2.048 0.24

The U74 column contains user time (total CPU time is higher by 0-20%),
not sure what the other results are. The interesting columns here are
the rv8 column and the U74 column, but also the rv64-qemu column; they
show that both software software emulation of RV64G on a 2015-vintage
high-end laptop CPU from Intel beats the custom silicon implementation
on the Visionfive 1. For an FPGA implementation you cannot expect a
1GHz clock rate, from what I hear 200MHz would be a good number. So
with the same microarchitecture you get a result that's even slower
than the VisionFive 1 by a factor ~5.

One might naively think that architecture implementation is the kind
of workload where FPGAs beat software on high-end CPUs, but that's
obviously not the case.

>That was one of the reasons both Intel and
>AMD have each purchased on of the big FPGA guys.

If so, IMO they did it for the wrong reason. I was certainly
wondering about the huge amount of money that AMD spent on Xilinx.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

Jan 29, 2023, 1:40:58 PM1/29/23

to

Combination of factors.

I could get a full FMUL result, but it would come at the expense of
spending more LUTs, more DSPs, and having a higher latency.

Though, by themselves, the DSPs aren't as much of an issue, as the FPGA
I am using has "more than enough" DSPs, but I am basically at the limit
of timing latency (sneeze on the thing too hard and it fails timing).

The fraction of a bit of rounding error being "not worth it" if it means
needing to make FMUL slower (and a fairly obvious increase in terms of
resource cost).

And, say, I don't really want an 7 or 8-cycle FMUL...

Likewise for the hard-wired truncate on the SIMD ops:
For most of my use-cases, it straight up "doesn't matter".

I did have a reason to go from a truncated 24-bit floating point format
to full width Binary32, but mostly this is because:
Quake has physics glitches if things are calculated using a truncated
format;
This allowed making the default Binary32 SIMD faster;
My BtMini2 engine was "obviously broken" (*1) if one takes the camera
64km from the origin with 24-bit floating point (but was OK doing this
with Binary32);
...

But, rounding error doesn't really make much of a difference.

*1: At 64k from the origin, the map geometry turns into a jittering dog
chewed mess.

But, this is not a huge surprise when the effective ULP was ~ 1 meter.
There us a lot less jitter when the ULP is closer to 0.8cm.

But, whether or not rounding was performed (or correct), the effective
ULP would still be 0.8 cm.

Similarly for SIMD having MUL and ADD but no MAC:
While in theory, I could do a MAC, I can't do it in 3 clock cycles;
The moment I need more than 3 cycles, doing 3C/1T is broken.

...

MitchAlsup

unread,

Jan 29, 2023, 1:48:57 PM1/29/23

to

On Sunday, January 29, 2023 at 12:15:48 PM UTC-6, Anton Ertl wrote:
> sc...@slp53.sl.home (Scott Lurndal) writes:
> >FPGAs have been used for decades to provide higher performance
> >for certain workloads.
> What workloads?
>
> My impression is that if a workload is structured such that an FPGA
> beats software on something like a CPU or a GPGPU, and if the
> workload's performance is important enough, or there are enough
> customers for the workload, people may prototype on FPGA, but they
> then go for custom silicon for another speedup by an order of
> magnitude and for a similar reduction in marginal cost. For FPGAs
> this leaves only prototypes, and low-volume uses where performance is
> not paramout. There is still enough volume there for significant
> revenue for Xilinx and Altera. But the idea that you switch to FPGA
> for performance is absurd.
>
> For HPC CPUs and GPGPUs look fine. HPC performs memory accesses,
> where FPGAs provide no advantage, and FP operations (FLOPs) where the
> custom logic of CPUs and GPGPUs beats FPGAs clearly. You may think
> that FPGA provides an advantage in passing data from one FLOP to
> another, but CPUs have optimized the case of passing the result of one
> instruction to another quite well, with their bypass networks. So
<

I am going to push back here.
<
Take Interpolation* performed in a GPU. Interpolation takes the {x,y,z,w}^3
coordinates of a triangle, and identifies the coordinates of a series of
pixels this triangle maps to. This takes 29 FP calculations (last time I
looked) per pixel, yet, GPUs produce these 8-to-32 pixels per cycle. For
an effective throughput of 232 spFLOPs per cycle from 1 fixed function
unit; and a GPU contains one of these interpolators per Shader Core.
<
There is no way SW is competitive, here (instruction count)--nor are FPGAs
(frequency).
<
(*) Interpolation is a part of rasterization.
<
-----------------------------------------------------------------------------------------------------------------------
<
Secondarily, integer arithmetic is but 8-11% of CPU power dissipation.
FP is even lower, leaving a majority of energy consumption in a) the
clock tree, b) fetch-decode-issue, c) schedule-execute, d) retire; none
of which add to the bottom line of performance--it is just that they
provide the infrastructure* on which the instruction can be executed
with considerable width.
<
(*) Think of a pipeline like a conveyor belt or an assembly line.
<

BGB

unread,

Jan 29, 2023, 2:08:34 PM1/29/23

to

VLSI/ASIC only really makes sense if one has a mountain of money to burn.

>> I also don't think you can make FPGA context switching even remotely
>> fast, further limiting possible usage scenarios.
>>
>> Terje
>>
>>
>> --
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"
>
> FPGAs *are* mass market for more than 2 decades.
> But they are mass market not in role of compute accelerator and
> I agree with you that they will never become a mass market in that role.
> In the previous century mass market FPGAs were called simply FPGA.
> In this century they are called "low end" or similar derogatory names.

Makes sense.

Say:
FPGA on a standalone board with an SDcard slot and VGA port and similar;
FPGA board meant to fit a "DIP40" form factor or similar;
FPGA on a PCIe or M.2 card with no external IO interfaces.

Represent somewhat different use cases...

There are FPGA boards for PCIe and M.2, but personally, I have not as
much use for them.

For most general computational tasks, a Ryzen or similar is going to run
circles around whatever one can put on an Artix.

> Wall Street does not care about them just like Wall Street does not care
> about micro-controllers, but like micro-controllers they are a cornerstone
> of the industry. Well, I am exaggerating a little, somewhat less
> then micro-controllers, but a cornerstone nevertheless.

Yeah.

A world without microcontrollers would likely more resemble the 60s or
70s than it does the modern world. Even desktop PCs as we know them
could not exist without microcontrollers.

There is also a non-zero overlap between FPGAs and microcontrollers.

...

EricP

unread,

Jan 29, 2023, 2:46:15 PM1/29/23

to

Scott Lurndal wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Saturday, January 28, 2023 at 6:11:30 PM UTC-6, JimBrakefield wrote:
>>> https://www.youtube.com/watch?v=3D-XuMWvGUocI&t=3D123s=20
>>> Starts at the 2 minute mark. He argues that computer performance=20
>>> has plateaued and that FPGAs offer a route to higher performance.=20
>> <
>
>> <
>> I should note: you cannot debug HW by single stepping !! as there is no def=
>> inition=20
>> of what single stepping means at the gate level. No, HW designer use simula=
>> tors=20
>> where they can stop at =C2=BD clock intervals and then examine millions of =
>> signals--
>> some of them X (unknown value) and Z (high impeadence).
>
> Actually, that's how we debugged the Burroughs mainframes, by stepping a
> single cycle at a time (using an external "maintenance processor" to
> drive the processor logic using scan chains). This was late 70's.

Luxury! I used a dual trace oscilloscope and an In-Circuit Emulator.

EricP

unread,

Jan 29, 2023, 2:46:15 PM1/29/23

to

People have also been investigating C/C++ languages to FPGA synthesis
for quite some time (a quicky search finds references back to 1996)
in order to make them more accessible to the general market.

The result is probably not as efficient as Verilog
but if it works users may not care.

Michael S

unread,

Jan 29, 2023, 3:09:14 PM1/29/23

to

Pay attention that programming FPGAs in Verilog is almost exclusively
USA trait. The rest of the world does it in VHDL.

Quadibloc

unread,

Jan 29, 2023, 4:11:06 PM1/29/23

to

Wouldn't that be an argument that the cost of CPUs and GPUs
would (also) be prohibitive?

I am being serious here. An FPGA that included a large number of
full-bore 64-bit floating point ALUs could indeed be designed to
accelerate the inner loops of a lot of programs, particularly in
scientific computing, which is the field that makes the most use
of HPC.

That might still be a special-purpose device, but no more so - and
from some viewpoints, considerably less so - than the typical FPGA,
which seems only to be applicable to things which are otherwise
difficult to do on a CPU.

I suppose a joke to the effect that a special-purpose computing
device is one that's good for somebody else's purpose might fit in
here.

John Savard

MitchAlsup

unread,

Jan 29, 2023, 5:07:11 PM1/29/23

to

On Sunday, January 29, 2023 at 3:11:06 PM UTC-6, Quadibloc wrote:
> On Sunday, January 29, 2023 at 10:31:45 AM UTC-7, MitchAlsup wrote:
> > On Sunday, January 29, 2023 at 11:07:44 AM UTC-6, Quadibloc wrote:
>
> > > So I was advocating FPGAs with real FPUs as a component, not synthesizing
> > > an FPU on an FPGA, which is much slower.
>
> > The IP to which might cost even more than the FPGA it goes in.
<
> Wouldn't that be an argument that the cost of CPUs and GPUs
> would (also) be prohibitive?
<

This falls into the category where there might be excellent engineering
reasons that something should be done with an FPGA added to a
system, but the practicality of getting there is impracticable (licensing,
legal, intellectual property $$$s,...)

Thomas Koenig

unread,

Jan 29, 2023, 5:17:48 PM1/29/23

to

EricP <ThatWould...@thevillage.com> schrieb:

https://dilbert.com/strip/1992-09-08 comes to mind.

Scott Lurndal

unread,

Jan 29, 2023, 6:03:55 PM1/29/23

to

Have you priced out a 3mm mask recently?

Scott Lurndal

unread,

Jan 29, 2023, 6:04:44 PM1/29/23

to

^^ ---> nm.

MitchAlsup

unread,

Jan 29, 2023, 7:04:57 PM1/29/23

to

22nm and 14nm are not that expensive right now.

Anton Ertl

unread,

Jan 30, 2023, 3:39:35 AM1/30/23

to

Michael S <already...@yahoo.com> writes:
>Pay attention that programming FPGAs in Verilog is almost exclusively
>USA trait. The rest of the world does it in VHDL.

Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
<https://github.com/forthy42/b16-small>. It has been used in custom
silicon, not (to my knowledge) in FPGA, but does that make a
difference?

From the HOPL talk about Verilog, my impression is: Around 2000 all
the buzz was for VHDL, and that Verilog was doomed. Verilog survived
and won in large projects, because it was designed for efficient
implementation of simulators, while the design of VHDL necessarily
leads to less efficiency. For large projects this efficiency is very
important, while for smaller projects the VHDL simulators are fast
enough.

Michael S

unread,

Jan 30, 2023, 10:23:08 AM1/30/23

to

On Monday, January 30, 2023 at 10:39:35 AM UTC+2, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >Pay attention that programming FPGAs in Verilog is almost exclusively
> >USA trait. The rest of the world does it in VHDL.
> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
> <https://github.com/forthy42/b16-small>. It has been used in custom
> silicon, not (to my knowledge) in FPGA, but does that make a
> difference?
>

It absolutely does.
FPGA development and ASIC development are different cultures.
Naturally, use of FPGAs for ASIC prototyping is part of ASIC culture.

I could imagine that "FPGAs as compute accelerators" is yet another
culture if there are enough people involved to form the culture.
Likely with different set of preferred tools. I know nothing about
it except that I know that it does not really work. But even that
knowledge is not 1st hand.

Scott Lurndal

unread,

Jan 30, 2023, 12:15:35 PM1/30/23

to

Michael S <already...@yahoo.com> writes:
>On Monday, January 30, 2023 at 10:39:35 AM UTC+2, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>> >Pay attention that programming FPGAs in Verilog is almost exclusively
>> >USA trait. The rest of the world does it in VHDL.
>> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
>> <https://github.com/forthy42/b16-small>. It has been used in custom
>> silicon, not (to my knowledge) in FPGA, but does that make a
>> difference?
>>

>

>I could imagine that "FPGAs as compute accelerators" is yet another
>culture if there are enough people involved to form the culture.
>Likely with different set of preferred tools. I know nothing about
>it except that I know that it does not really work. But even that
>knowledge is not 1st hand.

You "know that it does not really work". But not from first-hand
experience. So, what data (other than off-hand anecdotal data) do
you have to support your position?

https://www.researchgate.net/publication/354063174_FPGA-based_HPC_accelerators_An_evaluation_on_performance_and_energy_efficiency
"Results show that while FPGAs struggle to compete in absolute
terms with GPUs on memory- and compute- intensive kernels,
they require far less power and can deliver nearly the same
energy efficiency."

https://ieeexplore.ieee.org/document/9556357

"FPGAs are already known to provide interesting speedups in
several application fields, but to estimate their expected
performance in the context of typical HPC workloads is not
straightforward."

https://evision-systems.com/high-performance-computing/

FPGAs have been promient at SC for the last twenty years,
see the program for SC22, e.g.

Advances in FPGA Programming and Technology for HPC

"FPGAs have gone from niche components to being a central
part of many data centers worldwide to being considered for
core HPC installations. The last year has seen tremendous advances in
FPGA programmability and technology, and FPGAs for general HPC is
apparently within reach."

Task Scheduling on FPGA-Based Accelerators without Partial Reconfiguration

etc.

John Dallman

unread,

Jan 30, 2023, 3:13:12 PM1/30/23

to

In article <UWSBL.445233$8_id....@fx09.iad>, sc...@slp53.sl.home (Scott

Lurndal) wrote:

> You "know that it does not really work". But not from first-hand
> experience.

From the last time I looked at add-on accelerators, how fast does data
get in and out of them? What's the minimum size block of doubles when you
can save time, overall, on getting the data into the FPGA, processing it,
and getting it back out again?

John

Scott Lurndal

unread,

Jan 30, 2023, 3:43:49 PM1/30/23

to

Most use PCI-Express. So the bandwidth depends on which generation
and the number of lanes. Many FPGA have hard PCI-E gen 4 or Gen5
using 25Gbps SERDES.

MitchAlsup

unread,

Jan 30, 2023, 3:58:34 PM1/30/23

to

On Monday, January 30, 2023 at 2:39:35 AM UTC-6, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >Pay attention that programming FPGAs in Verilog is almost exclusively
> >USA trait. The rest of the world does it in VHDL.
> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
> <https://github.com/forthy42/b16-small>. It has been used in custom
> silicon, not (to my knowledge) in FPGA, but does that make a
> difference?
>
> From the HOPL talk about Verilog, my impression is: Around 2000 all
> the buzz was for VHDL, and that Verilog was doomed. Verilog survived
> and won in large projects, because it was designed for efficient
> implementation of simulators, while the design of VHDL necessarily
> leads to less efficiency. For large projects this efficiency is very
> important, while for smaller projects the VHDL simulators are fast
> enough.
<

In other words--with todays fast CPUs--one can dispense with the
pipeline timing simulators written in C and proceed directly to System
Verilog which can be used as pipeline timing simulator, FPGA output
ASIC, and Standard cell library implementations. Saving design team
effort.

MitchAlsup

unread,

Jan 30, 2023, 4:03:44 PM1/30/23

to

Yes, but (the BIG but) the cores on the "chip(s)" access DRAM with ½
cache line width busses at full core speeds, while PCIe has a) lots
of added latency, b) way less than 256-bits per cycle, and c) slower
cycles than the cores.
<
So, it if is DRAM bound application, FPGAs are not going to win
accessing DRAM via PCIe.

Scott Lurndal

unread,

Jan 30, 2023, 4:25:06 PM1/30/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, January 30, 2023 at 2:43:49 PM UTC-6, Scott Lurndal wrote:

>> j...@cix.co.uk (John Dallman) writes:=20
>> >In article <UWSBL.445233$8_id....@fx09.iad>, sc...@slp53.sl.home (Scott=
>=20
>> >Lurndal) wrote:=20
>> >=20
>> >> You "know that it does not really work". But not from first-hand=20
>> >> experience.=20
>> >=20
>> >From the last time I looked at add-on accelerators, how fast does data=
>=20
>> >get in and out of them? What's the minimum size block of doubles when yo=
>u=20
>> >can save time, overall, on getting the data into the FPGA, processing it=
>,=20

>> >and getting it back out again?
><

>> Most use PCI-Express. So the bandwidth depends on which generation=20
>> and the number of lanes. Many FPGA have hard PCI-E gen 4 or Gen5=20
>> using 25Gbps SERDES.
><
>Yes, but (the BIG but) the cores on the "chip(s)" access DRAM with =C2=BD

>cache line width busses at full core speeds, while PCIe has a) lots

>of added latency, b) way less than 256-bits per cycle, and c) slower=20
>cycles than the cores.
><
>So, it if is DRAM bound application, FPGAs are not going to win=20
>accessing DRAM via PCIe.

Have you looked at CXL at all? That's all transported by PCIe and
is expected to be used for DRAM-bound applications. The FPGA
can accesses _coherently_ the PCI host memory directly, no DMA
required.

MitchAlsup

unread,

Jan 30, 2023, 5:08:37 PM1/30/23

to

I have looked at it in brief. What CLX to me is is coherence across
PCIe PHYs (i.e., pins).
It still has PCIe latency which is more like 50 clocks instead of 10
clocks to the DRAM controller.
And it has PCIe width limitations which the on-die DRAM controller
does not.
So, you go from <say> two (2) 128-bit HBM at 2GHz DDR channels
down to 8-differential-pins at 5GHz and you start to see the BW
problem. The latency through PCIe is "not all that great".
<
If you can afford to spend 128-PCIe-pins on DRAM, you can probably
make CLX word reasonably for BW limited applications but not for
latency bound applications.
<
You may or may not get the added reliability of PCIe data transfers.
<
On-die DRAM has the advantage by spending lots of pins on the
DRAM DIMM profile busses connecting the DIMMs to the die.
If you can afford this number of pins for PCIe DRAM, then you can
get into the comparative performance game; but latency remains
a stumbling block.

JimBrakefield

unread,

Jan 30, 2023, 9:59:11 PM1/30/23

to

Most high performance FPGAs show support for external DDR memory.
Looks like a complicated topic, in particular,
https://www.xilinx.com/products/intellectual-property/ddr4.html
appears to be licensed IP.
I would find a board or module with DDR and with example RTL.
Similar situation for Intel/Altera.

BGB

unread,

Jan 31, 2023, 12:31:01 AM1/31/23

to

On 1/30/2023 2:58 PM, MitchAlsup wrote:
> On Monday, January 30, 2023 at 2:39:35 AM UTC-6, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>>> Pay attention that programming FPGAs in Verilog is almost exclusively
>>> USA trait. The rest of the world does it in VHDL.
>> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
>> <https://github.com/forthy42/b16-small>. It has been used in custom
>> silicon, not (to my knowledge) in FPGA, but does that make a
>> difference?
>>
>> From the HOPL talk about Verilog, my impression is: Around 2000 all
>> the buzz was for VHDL, and that Verilog was doomed. Verilog survived
>> and won in large projects, because it was designed for efficient
>> implementation of simulators, while the design of VHDL necessarily
>> leads to less efficiency. For large projects this efficiency is very
>> important, while for smaller projects the VHDL simulators are fast
>> enough.
> <
> In other words--with todays fast CPUs--one can dispense with the
> pipeline timing simulators written in C and proceed directly to System
> Verilog which can be used as pipeline timing simulator, FPGA output
> ASIC, and Standard cell library implementations. Saving design team
> effort.
> <

In my case, I am mostly using a combination of an emulator and simulations.

The emulator is written in C, and also models the pipeline timing,
branch predictor, cache hierarchy, and similar. Modeling this stuff
isn't ideal for emulation performance, but given its main goal is mostly
to emulate a CPU running at 50MHz, it works. To be useful, it does
generally need to be fast enough to keep up with real-time.

Early on, this is easier, but with some newer and "more complicated"
instructions, maintaining real-time emulation speed is more difficult
(this would mostly include things like Binary16 SIMD ops and
compressed-texture instructions and similar, which are relatively
expensive to emulate).

The simulations mostly run the Verilog code (via Verilator), and are
further divided:
Partial simulation only simulates the CPU core, with the bus and all
MMIO devices being implemented in a mix of C and C++;
Full simulation runs everything that would run in the FPGA in Verilog,
mostly providing an interface at the level of external components (DDR
RAM module, SDcard pins, VGA pins, ...).

Former is ~ 200x slower than real-time (*1), latter is ~ 1000x slower
than real-time.

Generally seems to work...

*1: Despite operating in kHz territory, for the most part its
command-like interface is still surprisingly responsive.
However, something like Doom is "one frame every 10 to 15 seconds or
so.", and GLQuake is roughly "one frame per minute".

...

I guess arguably, if one had an FPGA accelerator card, they could use it
to run Verilog at somewhat faster speeds than if using a simulation...

But, likely, one would be limited to one simulation at a time, vs on my
PC where I can often run 5 or 6 simulations at the same time (mostly to
keep watch for bugs and crashes).

Some bugs remain elusive though. Despite my efforts, I have not resolved
the "alias models in GLQuake sometimes get mangled" bug.

Have at least resolved the "robot enemies are broken in ROTT" bug:
It turns out a bug in the "expression reducer" was causing expressions
to sometimes be reduced in ways which did not respect lexical binding
semantics (so it would resolve a symbol in terms of an outer scope
before the inner scope copes into being; in cases where the intended
inner scope variable shadows a definition in an outer scope).

This turns out to have also been the cause of "Third demo in Doom
desyncs on BJX2 in a different way than on x86" bug.

Had also found and fixed another bug where "label lookup for a given
address" in BGBCC would sometimes fail if the label was at the same
address as a line-number. This was also resulting in a few minor bugs
(mostly involving the WEXifier incorrectly shuffling instructions across
label boundaries).

Anton Ertl

unread,

Jan 31, 2023, 5:09:09 AM1/31/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, January 30, 2023 at 2:39:35 AM UTC-6, Anton Ertl wrote:
>> From the HOPL talk about Verilog, my impression is: Around 2000 all
>> the buzz was for VHDL, and that Verilog was doomed. Verilog survived
>> and won in large projects, because it was designed for efficient
>> implementation of simulators, while the design of VHDL necessarily
>> leads to less efficiency. For large projects this efficiency is very
>> important, while for smaller projects the VHDL simulators are fast
>> enough.
><
>In other words--with todays fast CPUs--one can dispense with the
>pipeline timing simulators written in C and proceed directly to System
>Verilog which can be used as pipeline timing simulator, FPGA output
>ASIC, and Standard cell library implementations. Saving design team
>effort.

I am not an expert, but

1) "todays fast CPUs" don't help, because programs are now written to
require faster CPUs, and therefore faster simulators. If the
simulator is slower by a factor X, the development of faster CPUs
means that the simulator is still a factor X slower than the CPU you
want to simulate.

2) I expect you still want architecture simulators, microarchitecture
simulators, and circuit-level simulators. You seem to be discussing
microarchitecture simulators above. Switching from C to System
Verilog for that may be useful for consistency between the
microarchitecture simulator and the circuit-level simulator (in
Verilog), but otherwise has little to do with what I wrote.

Michael S

unread,

Jan 31, 2023, 6:29:33 AM1/31/23

to

On Monday, January 30, 2023 at 10:39:35 AM UTC+2, Anton Ertl wrote:

> Michael S <already...@yahoo.com> writes:
> >Pay attention that programming FPGAs in Verilog is almost exclusively
> >USA trait. The rest of the world does it in VHDL.
> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
> <https://github.com/forthy42/b16-small>. It has been used in custom
> silicon, not (to my knowledge) in FPGA, but does that make a
> difference?
>
> From the HOPL talk about Verilog, my impression is: Around 2000 all
> the buzz was for VHDL, and that Verilog was doomed. Verilog survived
> and won in large projects, because it was designed for efficient
> implementation of simulators, while the design of VHDL necessarily
> leads to less efficiency.

According to my understanding, VHDL is hard to simulate efficiently
with interpreted simulators. On so called compiled-code simulators
the speed of simulation either does not depend at all on the HDL
used or depends very little.

Scott Lurndal

unread,

Jan 31, 2023, 10:36:19 AM1/31/23

to

JimBrakefield <jim.bra...@ieee.org> writes:
>On Monday, January 30, 2023 at 4:08:37 PM UTC-6, MitchAlsup wrote:
>> On Monday, January 30, 2023 at 3:25:06 PM UTC-6, Scott Lurndal wrote:
>> > MitchAlsup <Mitch...@aol.com> writes:

>> <
>> If you can afford to spend 128-PCIe-pins on DRAM, you can probably
>> make CLX word reasonably for BW limited applications but not for
>> latency bound applications.

A x16 requires 82 pins. A x4, 21. PCIe 5.0 bandwidth is 4GB/sec per lane,
so x4 gives you 16GB/sec; x16 64GB/sec.

David Brown

unread,

Jan 31, 2023, 11:29:35 AM1/31/23

to

On 31/01/2023 12:29, Michael S wrote:
> On Monday, January 30, 2023 at 10:39:35 AM UTC+2, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>>> Pay attention that programming FPGAs in Verilog is almost exclusively
>>> USA trait. The rest of the world does it in VHDL.
>> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
>> <https://github.com/forthy42/b16-small>. It has been used in custom
>> silicon, not (to my knowledge) in FPGA, but does that make a
>> difference?
>>
>> From the HOPL talk about Verilog, my impression is: Around 2000 all
>> the buzz was for VHDL, and that Verilog was doomed. Verilog survived
>> and won in large projects, because it was designed for efficient
>> implementation of simulators, while the design of VHDL necessarily
>> leads to less efficiency.
>
> According to my understanding, VHDL is hard to simulate efficiently
> with interpreted simulators. On so called compiled-code simulators
> the speed of simulation either does not depend at all on the HDL
> used or depends very little.
>
>> For large projects this efficiency is very
>> important, while for smaller projects the VHDL simulators are fast
>> enough.

My understanding is that for big projects, neither Verilog nor VHDL are
used because both languages are designed for modelling analogue
circuits, not designing digital circuits. There are a variety of
high-level digital design languages that are used that are far easier to
write correctly (and in some cases, prove correctness). Simulation is
orders of magnitude more efficient. They generally output VHDL and/or
Verilog so that synthesis tools can generate FPGA bitstreams.

An example of such a language would be SpinalHDL, which has been used
for RISC-V implementations: <https://github.com/SpinalHDL>

Scott Lurndal

unread,

Jan 31, 2023, 11:58:30 AM1/31/23

to

It is effectively a NUMA system, not much different from westmere latencies.

Niklas Holsti

unread,

Jan 31, 2023, 12:02:36 PM1/31/23

to

Not so for VHDL, at least. For example, the standard ESA on-board
spacecraft processors for the last decade and currently are/were the
ERC32 and LEON series of SPARC v8 processors which were designed in VHDL
by ESA and Gaisler Research. Those are of course digital, not analogue.
Whether they can be considered "big" is subjective.

> There are a variety of high-level digital design languages that are
> used that are far easier to write correctly (and in some cases, prove
> correctness). Simulation is orders of magnitude more efficient.
> They generally output VHDL and/or Verilog so that synthesis tools can
> generate FPGA bitstreams.
>
> An example of such a language would be SpinalHDL, which has been used
> for RISC-V implementations: <https://github.com/SpinalHDL>

How could that work if VHDL and Verilog were intended for analogue
circuits? RISC-V is a digital system.

MitchAlsup

unread,

Jan 31, 2023, 12:49:54 PM1/31/23

to

On Tuesday, January 31, 2023 at 4:09:09 AM UTC-6, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On Monday, January 30, 2023 at 2:39:35 AM UTC-6, Anton Ertl wrote:
> >> From the HOPL talk about Verilog, my impression is: Around 2000 all
> >> the buzz was for VHDL, and that Verilog was doomed. Verilog survived
> >> and won in large projects, because it was designed for efficient
> >> implementation of simulators, while the design of VHDL necessarily
> >> leads to less efficiency. For large projects this efficiency is very
> >> important, while for smaller projects the VHDL simulators are fast
> >> enough.
> ><
> >In other words--with todays fast CPUs--one can dispense with the
> >pipeline timing simulators written in C and proceed directly to System
> >Verilog which can be used as pipeline timing simulator, FPGA output
> >ASIC, and Standard cell library implementations. Saving design team
> >effort.
> I am not an expert, but
>
> 1) "todays fast CPUs" don't help, because programs are now written to
> require faster CPUs, and therefore faster simulators. If the
> simulator is slower by a factor X, the development of faster CPUs
> means that the simulator is still a factor X slower than the CPU you
> want to simulate.
<

It is not so much speed as having every nuance of the microarchitecture;
such as:: interrupts, exceptions, privilege's, protection, all cycle accurate
with the microarchitecture. You want the simulator able to BOOT the Hyper-
Visor and fork off GuestOSs (in less than an hour.) You want the simulator
to be capable of debugging the Operating System, the compilers, the linker
the dynamic linker, the file system, and the network, as well as the I/O MMU.

MitchAlsup

unread,

Jan 31, 2023, 12:52:31 PM1/31/23

to

Neither one is suitable for analog--design of Operational Amplifiers,
analog comparators, analog multipliers:: but they may be suitable
for some parts of A/D and D/A converters.

Scott Lurndal

unread,

Jan 31, 2023, 1:05:09 PM1/31/23

to

Indeed, although even more important is to support the custom
logical blocks in the simulator to allow driver development in
advance of tapeout.

Michael S

unread,

Jan 31, 2023, 1:20:32 PM1/31/23

to

May be, Westmere-EX in 8+ sockets box. But not Westmere-EP dual-socket.
And certainly not the "normal" desktop/laptop Westmere where average
main memory latency was in 50-60 ns range.

Michael S

unread,

Jan 31, 2023, 1:23:18 PM1/31/23

to

Your understanding is wrong.

> There are a variety of
> high-level digital design languages that are used that are far easier to
> write correctly (and in some cases, prove correctness). Simulation is
> orders of magnitude more efficient. They generally output VHDL and/or
> Verilog so that synthesis tools can generate FPGA bitstreams.
>

MyHDL, Chisel and such? They are niche, not mainstream.

> An example of such a language would be SpinalHDL, which has been used
> for RISC-V implementations: <https://github.com/SpinalHDL>

I never heard about it.

MitchAlsup

unread,

Jan 31, 2023, 1:53:57 PM1/31/23

to

That too. SW cycle accurate models give you a performance number with
an error bounds (all the things not simulated) what you want is the error
bounds to be zero to do all the elements of design and engineering that
are not related to that performance number.
<
Doing this all in SV means you are not spending your time figuring out whether
the SW simulation is giving the correct answer or SV is giving the correct answer.
Note: You can only "tape out" the SV version--which is the end goal. Thus you
do not need 2 simulators and 2 simulation teams.

David Brown

unread,

Jan 31, 2023, 2:08:26 PM1/31/23

to

As far as I know, both Verilog and VHDL were originally intended for
modelling (not designing) hardware behaviour. My understanding is that
they were originally targeting analogue, but of course they also handle
digital. Synthesizable code uses only a small subset of the languages.

Obviously lots of people make lots of digital designs with these
languages, but they were not made with that purpose in mind. There are
many aspects of the languages that fit poorly with modern FPGA design,
such as tri-state signals, it is very easy to make mistakes such as
generating latches accidentally, and there is a great deal of
boiler-plate for simple and common structures such as registers with a
global clock and reset.

Scott Lurndal

unread,

Jan 31, 2023, 2:36:11 PM1/31/23

to

There are microarchitectural simulators, most cycle based. Slow but accurate,
although using tools like Verilator can speed up RTL-based simulations.

Then there are functional simulators which accurately simulate the software-visible
aspects of the architecture (e.g. SimNow!, Spike, Imperas, and couple that I've developed
over the last four decades for mainframes and SoCs, multithreaded to model large-scale
SMP implementations efficiently).

Then there are hybrids like Synopsys "Virtualizer" which leverage SystemC
and model at a transaction level. Slower than a purely functional model,
faster than cycle-accurate simulations. Single threaded, so performance
scales inversely with simulated core count.

Niklas Holsti

unread,

Jan 31, 2023, 2:52:33 PM1/31/23

to

It seems Wikipedia agrees with that, ...

> My understanding is that they were originally targeting analogue,

... but not with that.

The Wikipedia entry for VHDL says: "In 1983, VHDL was originally
developed at the behest of the U.S. Department of Defense in order to
document the behavior of the ASICs that supplier companies were
including in equipment." I would understand "ASIC" to mean a /digital/
application-specific IC.

The Wikipedia entry describes VHDL as "a hardware description language
(HDL) that can model the behavior and structure of digital systems at
multiple levels of abstraction", and later says that "To model analog
and mixed-signal systems, an IEEE-standardized HDL based on VHDL called
VHDL-AMS (officially IEEE 1076.1) has been developed." This indicates
that VHDL, itself, was and is limited to digital systems.

For Verilog Wikipedia says: "It is most commonly used in the design and
verification of digital circuits at the register-transfer level of
abstraction. It is also used in the verification of analog circuits and
mixed-signal circuits, ...". But again, it seems that the original
Verilog was limited to digital systems, and Wikipedia says, of later
Verilog versions, "A separate part of the Verilog standard, Verilog-AMS,
attempts to integrate analog and mixed signal modeling with traditional
Verilog."

MitchAlsup

unread,

Jan 31, 2023, 3:12:07 PM1/31/23

to

On Tuesday, January 31, 2023 at 1:08:26 PM UTC-6, David Brown wrote:
> On 31/01/2023 18:02, Niklas Holsti wrote:

> > How could that work if VHDL and Verilog were intended for analogue
> > circuits? RISC-V is a digital system.
> >
> As far as I know, both Verilog and VHDL were originally intended for
> modelling (not designing) hardware behaviour. My understanding is that
> they were originally targeting analogue, but of course they also handle
> digital. Synthesizable code uses only a small subset of the languages.
<

Verilog (and I suspect VHDL) support 4-value signals--{0,1,z,x}. This is
sufficient to design digital gates; but not possible to design analog
(0.374V). 0 means gate is driving a low signal, 1 means gate is driving
a high signal, z means gate is not driving (high impedance) and x
means we don't know the value of the signal (uninitialized).
<
To design in analog one needs values of both current and voltage
at the same time at the same place.

>
> Obviously lots of people make lots of digital designs with these
> languages, but they were not made with that purpose in mind. There are
> many aspects of the languages that fit poorly with modern FPGA design,
> such as tri-state signals,
<

This a fault of FPGAs, but not of digital logic (when allowed by digital
design management). Digital logic can have tri-state gates (generally
used as distributed multiplexers)

<
> it is very easy to make mistakes such as
> generating latches accidentally,
<

Digital latches are de rigueur in ease of construction. Unfortunately
the logic analysis tools are not capable of using latches as they can
be utilized in a functioning design. This was important in 2µ it is not
important in 2nm at all. When designing with digital latches (such as
a latch array) one needs the rule whereby on gate input observes a
latch output signal while the gate is transparent. So a latch-vector
of <say> 8 latches with 2 counters, one controlling the transparent
latch waiting for a value and a counter controlling the last latch
containing a stable value, and presto, you have a storage array with
out race conditions and it is perfectly fine--except the tools can't
deal with it. Latches are a problem of the tools not of the logic
<family>.

Michael S

unread,

Jan 31, 2023, 3:45:03 PM1/31/23

to

On Tuesday, January 31, 2023 at 10:12:07 PM UTC+2, MitchAlsup wrote:
> On Tuesday, January 31, 2023 at 1:08:26 PM UTC-6, David Brown wrote:
> > On 31/01/2023 18:02, Niklas Holsti wrote:
>
> > > How could that work if VHDL and Verilog were intended for analogue
> > > circuits? RISC-V is a digital system.
> > >
> > As far as I know, both Verilog and VHDL were originally intended for
> > modelling (not designing) hardware behaviour. My understanding is that
> > they were originally targeting analogue, but of course they also handle
> > digital. Synthesizable code uses only a small subset of the languages.
> <
> Verilog (and I suspect VHDL) support 4-value signals--{0,1,z,x}.

VHDL supports 9-value signals. I'd guess the main reason for it is design
by committee But in practice of FPGA design 6 out of 9 levels can be safely
ignored (5 out of 9 ignored in case of pins). Never used values do not affect
syntesys and hopefully don't slow down modern simulators.

This is
> sufficient to design digital gates; but not possible to design analog
> (0.374V). 0 means gate is driving a low signal, 1 means gate is driving
> a high signal, z means gate is not driving (high impedance) and x
> means we don't know the value of the signal (uninitialized).
> <
> To design in analog one needs values of both current and voltage
> at the same time at the same place.
> >
> > Obviously lots of people make lots of digital designs with these
> > languages, but they were not made with that purpose in mind. There are
> > many aspects of the languages that fit poorly with modern FPGA design,
> > such as tri-state signals,
> <
> This a fault of FPGAs, but not of digital logic (when allowed by digital
> design management). Digital logic can have tri-state gates (generally
> used as distributed multiplexers)

One can make tristate node within CMOS VLSI, but [I was told] that
one can not make it fast. I mean, transfer from active driving to highZ
and back is inherently slow.
So, FPGA synthesis tool emulates tristate with real multiplexors and
everybody are reasonably happy with results.

> <
> > it is very easy to make mistakes such as
> > generating latches accidentally,
> <

Good FPGA tools are able to warn about it.

MitchAlsup

unread,

Jan 31, 2023, 4:27:36 PM1/31/23

to

On Tuesday, January 31, 2023 at 2:45:03 PM UTC-6, Michael S wrote:
> On Tuesday, January 31, 2023 at 10:12:07 PM UTC+2, MitchAlsup wrote:
> > On Tuesday, January 31, 2023 at 1:08:26 PM UTC-6, David Brown wrote:
> > > On 31/01/2023 18:02, Niklas Holsti wrote:
> >
> > > > How could that work if VHDL and Verilog were intended for analogue
> > > > circuits? RISC-V is a digital system.
> > > >
> > > As far as I know, both Verilog and VHDL were originally intended for
> > > modelling (not designing) hardware behaviour. My understanding is that
> > > they were originally targeting analogue, but of course they also handle
> > > digital. Synthesizable code uses only a small subset of the languages.
> > <
> > Verilog (and I suspect VHDL) support 4-value signals--{0,1,z,x}.
<
> VHDL supports 9-value signals. I'd guess the main reason for it is design
> by committee But in practice of FPGA design 6 out of 9 levels can be safely
> ignored (5 out of 9 ignored in case of pins). Never used values do not affect
> syntesys and hopefully don't slow down modern simulators.
> This is
<

High reliability semiconductor design only uses 2 value system {0 and 1}
May FABs are now requiring no tri-state logic and no latch logic.

<
> > sufficient to design digital gates; but not possible to design analog
> > (0.374V). 0 means gate is driving a low signal, 1 means gate is driving
> > a high signal, z means gate is not driving (high impedance) and x
> > means we don't know the value of the signal (uninitialized).
> > <
> > To design in analog one needs values of both current and voltage
> > at the same time at the same place.
> > >
> > > Obviously lots of people make lots of digital designs with these
> > > languages, but they were not made with that purpose in mind. There are
> > > many aspects of the languages that fit poorly with modern FPGA design,
> > > such as tri-state signals,
> > <
> > This a fault of FPGAs, but not of digital logic (when allowed by digital
> > design management). Digital logic can have tri-state gates (generally
> > used as distributed multiplexers)
<
> One can make tristate node within CMOS VLSI, but [I was told] that
> one can not make it fast. I mean, transfer from active driving to highZ
> and back is inherently slow.
<

Depends on what you think fast is ?
<
But, true, most design now does not use tri-state logic, not because
it is not fast, but because it is subject to differential clock skew.
The multiple-asserters on a wire (bus) see different clock skew,
and different resistivity and short-range capacitance than other
asserting gates.
<
Maybe this is what the people who told you that meant.
<
Tristate at physical pins takes 2-flight-times down the actual wire
before another asserter can go low-Z. On-die wires do not have
this problem as wire-resistance basically kills of the L×C effects.

JimBrakefield

unread,

Jan 31, 2023, 5:15:08 PM1/31/23

to

Some opinions:
It is difficult to precisely adjust net delay in an FPGA (wiring changes with each
design change) whereas ASIC tools could handle it? (evidently no longer so).
Thus latches are difficult to have work correctly in an FPGA.
The tools are better at placing multiplexors than the designer, as the tools have
a better understanding of the timing situation. Thus the muxes in whatever form
disappear into the overall logic? Not much use for planning the data path and fitting
the control logic around it?
The tools can duplicate DFF to improve routability and timing. Again the tools have
better visibility into when this can help, plus there are plenty of DFF available in an FPGA.

Leading to a final question: visibility into the inner workings of a design's place and route:
What has been duplicated, where are the muxes and in what form, where is the routing
congestion? Perhaps the trend towards open software for the tools will allow these type
of questions to rise in awareness instead of being proprietary secrets.

Anton Ertl

unread,

Feb 1, 2023, 6:43:33 AM2/1/23

to

Michael S <already...@yahoo.com> writes:
>On Tuesday, January 31, 2023 at 10:12:07 PM UTC+2, MitchAlsup wrote:

>> On Tuesday, January 31, 2023 at 1:08:26 PM UTC-6, David Brown wrote:=20
>> > On 31/01/2023 18:02, Niklas Holsti wrote:=20
>>=20
>> > > How could that work if VHDL and Verilog were intended for analogue=20
>> > > circuits? RISC-V is a digital system.=20
>> > >=20
>> > As far as I know, both Verilog and VHDL were originally intended for=20
>> > modelling (not designing) hardware behaviour. My understanding is that=
>=20
>> > they were originally targeting analogue, but of course they also handle=
>=20
>> > digital. Synthesizable code uses only a small subset of the languages.=
>=20
>> <
>> Verilog (and I suspect VHDL) support 4-value signals--{0,1,z,x}.=20

>
>VHDL supports 9-value signals. I'd guess the main reason for it is design

>by committee But in practice of FPGA design 6 out of 9 levels can be safel=
>y
>ignored (5 out of 9 ignored in case of pins). Never used values do not affe=

>ct
>syntesys and hopefully don't slow down modern simulators.

The impression I have from the HOPL-IV Verilog talk is that the
efficiency of dealing with the 4 states was essential for saving
Verilog from being overrun by VHDL (which was expected even by Verilog
people some time ago).

>One can make tristate node within CMOS VLSI, but [I was told] that
>one can not make it fast. >I mean, transfer from active driving to highZ
>and back is inherently slow.

Why? Driving from high to low means turning off the pull-up
transistor and turning on the pull-down transistor. Going from
driving anything means turning off both transistors (one of which is
already on). This is as fast as switching.

I guess that the issue is that the load on a shared (tri-state-based)
bus is high, so you would have to make all the driving transistors on
all the devices on the bus big, or the driving (whether switching or
to tri-state) will be slow. If speed is needed, you probably prefer
to do a multiplexer with small transistors in front of a big driver
that drives the signal across the bus.

BGB

unread,

Feb 3, 2023, 2:51:36 AM2/3/23

to

WTF? No, there is pretty much *nothing* analog about either language...

For analog, there are things like SPICE variants and similar, but
presumably pretty much no one designs a large digital circuit in SPICE...

Main ways of doing "analog" in Verilog are either using an N-bit PCM
scheme, or PWM/PDM.

With tools like Verilator, one can also compile Verilog to C++ and link
it against C and C++ code for testing purposes.

For VHDL, there is also GHDL which does something vaguely similar to
Verilator (but, haven't used it much, as I don't really write stuff in
VHDL).

> There are a variety of
> high-level digital design languages that are used that are far easier to
> write correctly (and in some cases, prove correctness). Simulation is
> orders of magnitude more efficient. They generally output VHDL and/or
> Verilog so that synthesis tools can generate FPGA bitstreams.
>

Also, the FPGA's are basically entirely digital as well (well, excluding
things like weird/spooky effects when crossing clock domains, or the
arcane nature of things like timing constraints on IO pins).

> An example of such a language would be SpinalHDL, which has been used
> for RISC-V implementations: <https://github.com/SpinalHDL>
>

Errm, this is a lot more niche than "just write stuff in Verilog".

It is along similar lines to, one could try writing their code in
something like Zig or similar, or just use C.

Works well enough to write stuff in Verilog.

VHDL was another option, but I went for Verilog mostly as (initially) I
could make some sense of what was going on in Verilog, but VHDL didn't
really make a whole lot of sense to me.

Haven't had a huge reason to make the jump to VHDL.

Thomas Koenig

unread,

Feb 11, 2023, 5:59:39 AM2/11/23

to

Michael S <already...@yahoo.com> schrieb:

> On Monday, January 30, 2023 at 10:39:35 AM UTC+2, Anton Ertl wrote:
>> Michael S <already...@yahoo.com> writes:
>> >Pay attention that programming FPGAs in Verilog is almost exclusively
>> >USA trait. The rest of the world does it in VHDL.
>> Bernd Paysan from Europe wrote b16(-dsp) and b16-small in Verilog
>> <https://github.com/forthy42/b16-small>. It has been used in custom
>> silicon, not (to my knowledge) in FPGA, but does that make a
>> difference?
>>
>> From the HOPL talk about Verilog, my impression is: Around 2000 all
>> the buzz was for VHDL, and that Verilog was doomed. Verilog survived
>> and won in large projects, because it was designed for efficient
>> implementation of simulators, while the design of VHDL necessarily
>> leads to less efficiency.
>
> According to my understanding, VHDL is hard to simulate efficiently
> with interpreted simulators. On so called compiled-code simulators
> the speed of simulation either does not depend at all on the HDL
> used or depends very little.

I believe GHDL is such a compiler, it can serve as a front end for
both gcc and clang.