The Computer of the Future

Quadibloc

unread,

Feb 15, 2022, 5:48:03 AM2/15/22

to

I noticed that in another thread I might have seemed to have contradicted myself.

So I will clarify.

In the near term, in two or three years, I think that it's entirely possible that we
will have dies that combine four "GBOoO" performance cores with sixteen in-
order efficiency cores, and chips that have four of those dies in a package, to
give good performance on both number-crunching and database workloads.

In the _longer_ term, though, when Moore's Law finally runs out of steam,
and chips are even bigger... instead of putting more than 64 or 128 cores
on a chip, if that becomes possible with the most advanced silicon
process attainable... I think the number of cores will top out, and instead
of having 4,096 in-order cores on a chip, we'll see perhaps 64 out-of-order
cores (of modest size) instead for the efficiency core contingent.

That's because eventually they'll bump into constraints with memory
bandwidth, but there's still some headroom left.

I do think that eventually a Cray-like vector architecture is something that
should be considered as we look for ways to make chips more powerful.
After all, we've gone from MMX all the way to AVX-512 on the one hand,
and on the other hand, efforts have been made to make GPU computing
more versatile and flexible.

Today, some Intel chips slow down their clock rates when doing AVX-512
operations. This reduces, but does not eliminate, the performance
increase in going from AVX-256 to AVX-512.

What I'm thinking a chip of the future, aimed at the ultimate in high
performance might do is this:

It would have vector instructions similar to those of the Cray-I or its
successors.

These instructions would use floating point ALUs that run more slowly
than the regular main floating-point ALU on the chip, which are organized
into something that _somewhat_ resembles a GPU (but not exactly, so as
to be versatile and flexible enough to handle everything a vector
supercomputer can do).

To avoid the problem current Intel chips have of E-cores that can't handle
AVX-512 instructions, I think it might be sensible to take one leaf out of
Bulldozer.

Let's have a core complex that looks like this:

One performance core.
One long (Cray-like) vector unit.
Four to eight efficiency cores.

So if you _have_ vector instructions in your cores, when you switch from
the performance cores to the efficiency cores, you just leave the vector
unit turned on (instead of trying to copy the contents of its big vector
registers).

John Savard

MitchAlsup

unread,

Feb 15, 2022, 10:42:40 AM2/15/22

to

On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:
> I noticed that in another thread I might have seemed to have contradicted myself.
>
> So I will clarify.
>
> In the near term, in two or three years, I think that it's entirely possible that we
> will have dies that combine four "GBOoO" performance cores with sixteen in-
> order efficiency cores, and chips that have four of those dies in a package, to
> give good performance on both number-crunching and database workloads.
>
> In the _longer_ term, though, when Moore's Law finally runs out of steam,
> and chips are even bigger... instead of putting more than 64 or 128 cores
> on a chip, if that becomes possible with the most advanced silicon
> process attainable... I think the number of cores will top out, and instead
> of having 4,096 in-order cores on a chip, we'll see perhaps 64 out-of-order
> cores (of modest size) instead for the efficiency core contingent.
>
> That's because eventually they'll bump into constraints with memory
> bandwidth, but there's still some headroom left.
>
> I do think that eventually a Cray-like vector architecture is something that
> should be considered as we look for ways to make chips more powerful.
<

CRAYs were designed to consume memory bandwidth, something you said
will top out in the paragraph above. To consume as much memory bandwidth
as someone can afford to build. Consume this Bandwidth while tolerating
the ever growing latency measured in cycles..

<
> After all, we've gone from MMX all the way to AVX-512 on the one hand,

These are not CRAY like.

> and on the other hand, efforts have been made to make GPU computing
> more versatile and flexible.

These are neither CRAY like, nor MMX-AVX like.
And grossly morph generation to generation.

>
> Today, some Intel chips slow down their clock rates when doing AVX-512
> operations. This reduces, but does not eliminate, the performance
> increase in going from AVX-256 to AVX-512.
>
> What I'm thinking a chip of the future, aimed at the ultimate in high
> performance might do is this:
>
> It would have vector instructions similar to those of the Cray-I or its
> successors.
<

VVM provides everything CRAYs do, and allows the HW to organize itself
AVX fashion all from a scalar instruction set, and without dragging ever
larger register files around.

>
> These instructions would use floating point ALUs that run more slowly
> than the regular main floating-point ALU on the chip, which are organized
> into something that _somewhat_ resembles a GPU (but not exactly, so as
> to be versatile and flexible enough to handle everything a vector
> supercomputer can do).
<

You simply FAIL to understand the model of the GPU. They are not:: really
wide SIMD, they are really wide SIMT almost as if thousands of "threads"
on a barrel scheduler. On cycle[k] they perform 32 instructions for
threads[m..m+31], on cycle[k+1] they perform 32 instructions from
threads[x..x+31]. Threads[k] and thread[x] can be from entirely different
"draw calls" with running different code under different MMU tables,...
All this "different" is there to tolerate the latency of memory. Your typical
LD may take 100 trips around the barrel. So you need other threads to
operate while waiting.
<
GPUs satisfy high IPC only on embarrassingly parallel calculation patterns;
patterns which do not contain branches.

Brett

unread,

Feb 15, 2022, 1:54:46 PM2/15/22

to

Branches used to matter, but now the herd of chickens is so big that memory
bandwidth is the limit.

Quadibloc

unread,

Feb 15, 2022, 3:07:41 PM2/15/22

to

On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:

> CRAYs were designed to consume memory bandwidth, something you said
> will top out in the paragraph above. To consume as much memory bandwidth
> as someone can afford to build. Consume this Bandwidth while tolerating
> the ever growing latency measured in cycles..

This puzzles me. The CRAY-I architecture worked like modern RISC architectures:
one did loads and stores from memory, and then arithmetic in the register file.
The idea was to do as much arithmetic within the register file as possible, in order
to get as much work as possible done within the constraint of the memory bandwidth.

> VVM provides everything CRAYs do, and allows the HW to organize itself
> AVX fashion all from a scalar instruction set, and without dragging ever
> larger register files around.

It's true that without an explicit register file, one is allowed to have implementations
of different sizes, which use a cache instead to conserve memory bandwidth. And
the SX-6, for example, didn't even have a cache.

I am thinking that a CRAY-style machine still needs something like VVM as well,
because the vector registers might have 64 or 256 elements, while

> You simply FAIL to understand the model of the GPU. They are not:: really
> wide SIMD, they are really wide SIMT almost as if thousands of "threads"
> on a barrel scheduler. On cycle[k] they perform 32 instructions for
> threads[m..m+31], on cycle[k+1] they perform 32 instructions from
> threads[x..x+31]. Threads[k] and thread[x] can be from entirely different
> "draw calls" with running different code under different MMU tables,...
> All this "different" is there to tolerate the latency of memory. Your typical
> LD may take 100 trips around the barrel. So you need other threads to
> operate while waiting.

Yes, that is an important point. Memory latency is a serious characteristic
of modern architectures, so an architecture that can take full advantage of
memory bandwidth despite memory latency is useful.

John Savard

MitchAlsup

unread,

Feb 15, 2022, 3:48:59 PM2/15/22

to

On Tuesday, February 15, 2022 at 2:07:41 PM UTC-6, Quadibloc wrote:
> On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:
>
> > CRAYs were designed to consume memory bandwidth, something you said
> > will top out in the paragraph above. To consume as much memory bandwidth
> > as someone can afford to build. Consume this Bandwidth while tolerating
> > the ever growing latency measured in cycles..
<
> This puzzles me. The CRAY-I architecture worked like modern RISC architectures:
> one did loads and stores from memory, and then arithmetic in the register file.
<

Memory was uncached and 14 cycles away (10 in Cray 1-s)
Vectors were used so the latency of unCached access was hidden under the
64 memory requests (took 78 cycles latency)
<
Can your proposed system read 4 different cache lines in 78 cycles ?

<
> The idea was to do as much arithmetic within the register file as possible, in order
> to get as much work as possible done within the constraint of the memory bandwidth.
<

The idea was to perform 64 or 128 arithmetic operations done while waiting the 78
cycles when yo can start arithmetic again.

<
> > VVM provides everything CRAYs do, and allows the HW to organize itself
> > AVX fashion all from a scalar instruction set, and without dragging ever
> > larger register files around.
<
> It's true that without an explicit register file, one is allowed to have implementations
> of different sizes, which use a cache instead to conserve memory bandwidth. And
> the SX-6, for example, didn't even have a cache.
<

None of the Crays had a cache, and the S-register file of Cray-2 was a failure
(compilers could hardly program it)
<
Later Crays could have 2 LDs and a ST pending on memory at the same time
(192 individual references), and let us postulate that a modern CRAY would run
at 5 GHz. That is 120 GB/s per core. I will let you multiply by the number of cores
you want.
<
Vector machines used banked memory systems. The cores would spew out 3
references per cycle continuously. How many banks is your purported computer
architecture going to have. Unless it is way above 64-banks, you have no chance.
Most computers today ship with 1 or 2 DRAM DIMMs. This is where the volume
this is where you should be designing.
<
Vector machines came into existence for a narrow range of applications that require
high bandwidth memory and high FP calculation rates where the data sets had no
chance of fitting into cache.
<
Vector machines fell out of favor when memory latency got so large that the vector
size no longer covered memory latency. NEC hung in for a while by changing the
length of the vectors from 64->128 then to 256. At this point the vector register
file access time became problematic in pipelining.
<
Cray-like vectors had run their course.

Thomas Koenig

unread,

Feb 15, 2022, 4:09:12 PM2/15/22

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> I noticed that in another thread I might have seemed to have contradicted myself.
>
> So I will clarify.
>
> In the near term, in two or three years, I think that it's entirely possible that we
> will have dies that combine four "GBOoO" performance cores with sixteen in-
> order efficiency cores, and chips that have four of those dies in a package, to
> give good performance on both number-crunching and database workloads.

Who actually needs number crunching?

I certainly do, even in the the company I work in (wich is in
the chemical industry, so rather technical) the number of people
actually running code which depends on floating point execution
speed is rather small, probably in the low single digit percent
range of all employees.

That does not mean that floating point is not important :-) but that
most users would not notice if they had a CPU with, let's say, a
reasonably efficient software emulation of floating point numbers.

Such a CPU would look horrible in SPECfp, and the savings from removing
floating point from a general purpose CPU are probably not that great so
it is not done, and I think that as an intensive user of floating point,
I have to be grateful for that.

Hmm, come to think of it, that is the fist positive thing about
SPEC that occurred to me in quite a few years...

Scott Smader

unread,

Feb 15, 2022, 4:52:05 PM2/15/22

to

Wow. That reads like a great teaser for an upcoming "The History of Computer Architecture" by Mitch Alsup, and I'm eager to read the whole story.

I hope it's something you might consider. In your spare time.

Quadibloc

unread,

Feb 16, 2022, 5:02:07 AM2/16/22

to

On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:

> On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:

> > After all, we've gone from MMX all the way to AVX-512 on the one hand,

> These are not CRAY like.

That's true. One of the problems you've noted with the current
vector architectures is that they keep changing the instruction
set in order to make the vectors bigger.

A Cray-like architecture makes the vectors _much_ bigger than even
in AVX-512, so in my naivete, I would have thought that this would
allow the constant changes to stop for a while.

As it became possible to put more and more transistors on a chip,
at first, the best way to make use of those extra transistors was
obvious. Give the computer the ability to do 16-bit arithmetic,
not just 8-bit arithmetic. Add floating-point in hardware.

Adding Cray-like vector instructions seemed to me like the final
natural step in this evolution. Existing vector instructions kept
getting wider, so this would take that to its limit.

This doesn't mean I expect to solve all supercomputing problems
that way. I don't claim to have a magic wand with which to solve the
memory latency issue.

However, today's microprocessors *do* have L3 caches that are as
big as the memory of the original Cray I. So, while that wouldn't
help in solving the problems that the supercomputers of today are
working on, it _would_ make more arithmetic operations available,
say, to video game writers.

I'm envisaging a chip that tries to increase memory bandwidth, but
only within bounds suitable to a "consumer" product. Just as heatsinks
have grown much bigger than people would have expected back in the
days of the 386 processor, I'm thinking we could go with having 512
data lines going out of a processor. With four signal levels to further
double memory bandwidth.

This is all predicated on the assumption that, given that lithography
has reached its ultimate limits, and so Moore's Law is over, people
are desperate for more performance. They don't know how to do
parallel programming well, and so they're desperate for things that
make parallelism more palatable - like out-of-order execution and
vectors.

GPU hardware is apparently the best way to get the most FLOPs on
a chip. It may be a bad fit for a Cray-like ISA, but the native GPU
design is a bad fit for programmers. And no two GPUs are alike.

Exactly how a modified GPU design aimed at simulating a Cray
or multiple Crays in parallel working on different problems might
look is not clear to me, but I presume that if one can put a bunch
of ALUs on a chip, and one can organize that to look like a GPU
or like a Xeon Phi (but with RISC instead of x86), it could also be
organized to look like something in between adapted to a
Cray-like instruction set.

Since Crays used 64-element vector registers for code in loops
that handled vectors with more than 64 elements, that these loops
might well be... augmented... by means of something looking a
bit like your VVM is also not beyond the bounds of imagination.
(But if you're using something like VVM, why have vector instructions?
Reducing decoding overhead!)

Of course, though, my designs will have scalar floating-point
instructions, short vector instructions (sort of like AVX-256),
and long vector instructions (like a Cray)... because they're
intended to illustrate what an architecture burdened with a
rather large amount of legacy stuff carried over. But because it
was designed on a clean sheet of paper, it only gets one
kind of short vectors to support, rather than several like an x86.

And there would be a somewhat VVM-like set of
vector of vector wrapper instructions that could be wrapped
around *any* of them.

Which combination does it make sense to use? Why, that's
outlined in the Application Notes for the particular device
implementing the ISA that you're using. So the same ISA
serves supercomputers, servers, desktop PCs, and smartphones,
software is tailored to where in this food chain it's being used,
but it shares as much as it can...

John Savard

MitchAlsup

unread,

Feb 16, 2022, 1:25:10 PM2/16/22

to

On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
> On Tuesday, February 15, 2022 at 8:42:40 AM UTC-7, MitchAlsup wrote:
> > On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:
>
> > > After all, we've gone from MMX all the way to AVX-512 on the one hand,
>
> > These are not CRAY like.
> That's true. One of the problems you've noted with the current
> vector architectures is that they keep changing the instruction
> set in order to make the vectors bigger.
>
> A Cray-like architecture makes the vectors _much_ bigger than even
> in AVX-512, so in my naivete, I would have thought that this would
> allow the constant changes to stop for a while.
<

Point of Order::
CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
AVX vectors are processed en massé <however wide> per unit time.
These are VASTLY different things.

<
>
> As it became possible to put more and more transistors on a chip,
> at first, the best way to make use of those extra transistors was
> obvious. Give the computer the ability to do 16-bit arithmetic,
> not just 8-bit arithmetic. Add floating-point in hardware.
>
> Adding Cray-like vector instructions seemed to me like the final
> natural step in this evolution. Existing vector instructions kept
> getting wider, so this would take that to its limit.
>
> This doesn't mean I expect to solve all supercomputing problems
> that way. I don't claim to have a magic wand with which to solve the
> memory latency issue.
>
> However, today's microprocessors *do* have L3 caches that are as
> big as the memory of the original Cray I.
<

But with considerably LOWER concurrency.
A CRAY might have 64 memory banks (NEC up to 245 banks)
.....Each bank might take 5-10 cycles to perform 1 request
.....but there can be up to 64 requests being performed.
<
At best modern L3 can be doing 3:
Receiving write data,
Routing Data around the SRAM matrix,
Sending out read data.
<
There is nothing fundamental about the difference, but L3 caches are
not build to have the concurrency of CRAYs banked memory.

<
< So, while that wouldn't
> help in solving the problems that the supercomputers of today are
> working on, it _would_ make more arithmetic operations available,
> say, to video game writers.
<

In principle, yes; in practice, not so much.

>
> I'm envisaging a chip that tries to increase memory bandwidth, but
> only within bounds suitable to a "consumer" product. Just as heatsinks
> have grown much bigger than people would have expected back in the
> days of the 386 processor, I'm thinking we could go with having 512
> data lines going out of a processor. With four signal levels to further
> double memory bandwidth.
<

PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
double data rate and PAM4 modulation; and achieves 64GTs per wire
each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
out and 8GB/s in.
<
Now, remember from yesterday out 120 GB/s per core. You will need
10 of these 4 pin wires to support inbound bandwidth and 5 to support
the outbound bandwidth.
<
But hey, if you want to provide 512 pins, I sure you can find some use
for this kind of bandwidth. {but try dealing with the heat.}

>
> This is all predicated on the assumption that, given that lithography
> has reached its ultimate limits, and so Moore's Law is over, people
> are desperate for more performance. They don't know how to do
> parallel programming well, and so they're desperate for things that
> make parallelism more palatable - like out-of-order execution and
> vectors.
<

More pins wiggling faster has always provided more bandwidth.
Being able to absorb the latency has always been the problem.
{That and paying for it: $$$ and heat}

>
> GPU hardware is apparently the best way to get the most FLOPs on
> a chip. It may be a bad fit for a Cray-like ISA, but the native GPU
> design is a bad fit for programmers. And no two GPUs are alike.
<

GPUs are evolving like PCUs were evolving from 1948 to 1980.
GPUs are being modified each generation in order to address
bad performance characteristics of last generation GPUs.
Tolerance for the divergence found in ray tracing application
is the modern addition which required a pretty fundamental
change in how WARPs are organized and reorganized over
time. Gen[-2] instructions set provided no concept of WARP
reorganization, we are just coming to grips with what needs
fixed in Gen[-1] while kicking Gen[0] out the door, designing
Gen[+1].

>
> Exactly how a modified GPU design aimed at simulating a Cray
> or multiple Crays in parallel working on different problems might
> look is not clear to me, but I presume that if one can put a bunch
> of ALUs on a chip, and one can organize that to look like a GPU
> or like a Xeon Phi (but with RISC instead of x86), it could also be
> organized to look like something in between adapted to a
> Cray-like instruction set.
>
> Since Crays used 64-element vector registers for code in loops
> that handled vectors with more than 64 elements, that these loops
> might well be... augmented... by means of something looking a
> bit like your VVM is also not beyond the bounds of imagination.
> (But if you're using something like VVM, why have vector instructions?
> Reducing decoding overhead!)
<

Exactly! Let each generation of HW give the maximum performance
if can while the application code remains constant.
<
Secondly: If you want wide vector performance, you need to be organized
around ¼, ½ 1 cache line per clock out of the cache and back into the cache.
The width appropriate for one generation is not necessarily appropriate for
the next--so don't expose width through ISA.
<
Machines that can afford 4 FMACs per core will have enough area that
performing multiple iterations of a loop per cycle are an easily recognized
pattern. I happened to make this discovery considerably simpler with my
LOOP instruction.

>
> Of course, though, my designs will have scalar floating-point
> instructions, short vector instructions (sort of like AVX-256),
> and long vector instructions (like a Cray)... because they're
> intended to illustrate what an architecture burdened with a
> rather large amount of legacy stuff carried over. But because it
> was designed on a clean sheet of paper, it only gets one
> kind of short vectors to support, rather than several like an x86.
>
> And there would be a somewhat VVM-like set of
> vector of vector wrapper instructions that could be wrapped
> around *any* of them.
<

Question: If you have VVM and VVM performs as well as CRAY
vectors running Matrix300, why have the CRAY vector state
or bloat your ISA with CRAY vector instructions?

Quadibloc

unread,

Feb 16, 2022, 6:24:14 PM2/16/22

to

On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:

> > A Cray-like architecture makes the vectors _much_ bigger than even
> > in AVX-512, so in my naivete, I would have thought that this would
> > allow the constant changes to stop for a while.
> <
> Point of Order::
> CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
> AVX vectors are processed en massé <however wide> per unit time.
> These are VASTLY different things.

Yes. However, a faster Cray-like machine can be implemented
with as many, or more, floating ALUs than an AVX-style vector unit.

So you could have, say, AVX-512 with eight 64-bit floats across, and
then switch to Cray in the next generation with sixteen ALUs, and then
stick with Cray with thirty-two ALUs in the generation after that.

> > However, today's microprocessors *do* have L3 caches that are as
> > big as the memory of the original Cray I.

> But with considerably LOWER concurrency.
> A CRAY might have 64 memory banks (NEC up to 245 banks)
> .....Each bank might take 5-10 cycles to perform 1 request
> .....but there can be up to 64 requests being performed.

But if the cache is *on the same die*, having more wires connecting it
to the CPU isn't much of a problem?

> At best modern L3 can be doing 3:
> Receiving write data,
> Routing Data around the SRAM matrix,
> Sending out read data.
> <
> There is nothing fundamental about the difference, but L3 caches are
> not build to have the concurrency of CRAYs banked memory.

So we seem to be in agreement on this point.

> PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
> double data rate and PAM4 modulation; and achieves 64GTs per wire
> each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
> out and 8GB/s in.

Of course, PCIe 6.0 is a complicated protocol, while interfaces
like DDR 5 to DRAM are kept simple by comparison.

> But hey, if you want to provide 512 pins, I sure you can find some use
> for this kind of bandwidth. {but try dealing with the heat.}

Presumably chips implemented in the final evolution of 1nm or whatever
will run slightly cooler.

I had thought that the number of CPUs in the package was what governed
the heat, and using more pins for data would not be too bad. If that's not
true, then, yes, this would be *one* fatal objection to my concepts.

> More pins wiggling faster has always provided more bandwidth.
> Being able to absorb the latency has always been the problem.
> {That and paying for it: $$$ and heat}

Of course, the way to absorb latency is to do something else while
you're waiting. So now you need bandwidth for the first thing you
were doing, and now more bandwidth for the something else
you're doing so as to make the latency less relevant.

This sounds like a destructive paradox. But since latency is
fundamentally unfixable (until you make the transistors and
wires faster) while you can have more bandwidth if you pay for
it, the idea of having the amount of bandwidth you needed in
the first place, times eight or so, almost makes sense.

> > Exactly how a modified GPU design aimed at simulating a Cray
> > or multiple Crays in parallel working on different problems might
> > look is not clear to me, but I presume that if one can put a bunch
> > of ALUs on a chip, and one can organize that to look like a GPU
> > or like a Xeon Phi (but with RISC instead of x86), it could also be
> > organized to look like something in between adapted to a
> > Cray-like instruction set.

....and a Cray-like instruction set could be like a later generation of
the Cray, with more and longer vector registers, and in other ways
it could move to being more GPU-like if that was needed to fix some
flaws.

> > Since Crays used 64-element vector registers for code in loops
> > that handled vectors with more than 64 elements, that these loops
> > might well be... augmented... by means of something looking a
> > bit like your VVM is also not beyond the bounds of imagination.
> > (But if you're using something like VVM, why have vector instructions?
> > Reducing decoding overhead!)

> Exactly! Let each generation of HW give the maximum performance
> if can while the application code remains constant.

I'm glad you approve of something...

> Secondly: If you want wide vector performance, you need to be organized
> around ¼, ½ 1 cache line per clock out of the cache and back into the cache.
> The width appropriate for one generation is not necessarily appropriate for
> the next--so don't expose width through ISA.

Of course, IBM's take on a Cray-like architecture avoided that pitfall, by
excluding the vector width from the ISA spec, making it model-dependent,
so that's definitely possible.

> > Of course, though, my designs will have scalar floating-point
> > instructions, short vector instructions (sort of like AVX-256),
> > and long vector instructions (like a Cray)... because they're
> > intended to illustrate what an architecture burdened with a
> > rather large amount of legacy stuff carried over. But because it
> > was designed on a clean sheet of paper, it only gets one
> > kind of short vectors to support, rather than several like an x86.

> > And there would be a somewhat VVM-like set of
> > vector of vector wrapper instructions that could be wrapped
> > around *any* of them.

> Question: If you have VVM and VVM performs as well as CRAY
> vectors running Matrix300, why have the CRAY vector state
> or bloat your ISA with CRAY vector instructions?

Now, that's a very good question.

Possible answer 1:
This is only included in the ISA because the spec is meant to
illustrate possibilities, and would be omitted in any real-world
CPU.

Possible answer 2:
The idea is that VVM wrapped around scalar floating-point
instructions works well for vectors that are "this" long;

VVM wrapped around AVX-style vector instructions works for
vectors that are 4x longer, in proportion to the number of floats
in a single AVX vector...

VVM wrapped around Cray-style vector instructions is intended
for vectors that are 64x longer than VVM wrapped around scalar
instructions.

Assume VVM around scalar handles vectors somewhat longer
than Cray without VVM. Then what we've got is a range of options,
each one adapted to how long your vectors happen to be. (And to
things like granularity, because using VVM around Cray for vectors
2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)

Possible answer 3:
And I might fail to implement VVM well enough to avoid some
associated overhead.

John Savard

Quadibloc

unread,

Feb 16, 2022, 6:32:06 PM2/16/22

to

Actually, there's _another_ point I would need to raise here.

In your design, VVM is the primary way to create vector operations.

So if additional ALU resources are brought into play for vector
computation, using VVM would invoke them.

On the other hand, I have a prejudice against instruction fusion,
as I feel it complicates decoding.

And so I'm thinking in terms of a "dumb" implementation of VVM.
It reduces loop overhead, it permits data forwarding and stuff like
that...

but it _doesn't_ tell the scalar floating-point instructions to start
using the ALUs that belong to the AVX instructions or the Cray
instructions.

So if you've got a 65,536-element vector, you can *choose* to
process it using VVM around scalar, VVM around AVX, or
VVM around Cray.

What this will *change* is how many instances of your program
can be floating around on your CPU at a given time running
concurrently - instead of some sitting idle on disk while a handful
are actually running.

The VVM around Cray program would be the one that could finish
faster in wall clock time if it happened to be given the run of the
entire CPU without having to share.

John Savard

Quadibloc

unread,

Feb 16, 2022, 6:39:57 PM2/16/22

to

On Wednesday, February 16, 2022 at 4:32:06 PM UTC-7, Quadibloc wrote:

> Actually, there's _another_ point I would need to raise here.
>
> In your design, VVM is the primary way to create vector operations.
>
> So if additional ALU resources are brought into play for vector
> computation, using VVM would invoke them.
>
> On the other hand, I have a prejudice against instruction fusion,
> as I feel it complicates decoding.
>
> And so I'm thinking in terms of a "dumb" implementation of VVM.
> It reduces loop overhead, it permits data forwarding and stuff like
> that...
>
> but it _doesn't_ tell the scalar floating-point instructions to start
> using the ALUs that belong to the AVX instructions or the Cray
> instructions.
>
> So if you've got a 65,536-element vector, you can *choose* to
> process it using VVM around scalar, VVM around AVX, or
> VVM around Cray.
>
> What this will *change* is how many instances of your program
> can be floating around on your CPU at a given time running
> concurrently - instead of some sitting idle on disk while a handful
> are actually running.
>
> The VVM around Cray program would be the one that could finish
> faster in wall clock time if it happened to be given the run of the
> entire CPU without having to share.

But surely, instead of being a _versatile_ design, that's a badly
flawed design!

Don't you just want the program code to say "I want to process
a 65,536-element vector", with the CPU then doing so in the
best way possible given whatever the load on the system happens
to be at a given time?

Now, *that's* a big advantage of a _proper_ implementation of
VVM.

(Nothing, of course, _prevents_ my bloated ISA, in its
top-end implementation, from having one. In which case, the
AVX-style and Cray-style vector instructions would be entirely
superfluous for _that_ implementation. But presumably VVM
is not so hard to implement that it would make any sense not
to do it that way in the first place.)

John Savard

Quadibloc

unread,

Feb 16, 2022, 6:50:17 PM2/16/22

to

In my overly complicated ISA, though, even with a good VVM
implementation, where VVM-around-scalar float instructions
would be able to use the Cray-style ALUs, it would *not* be able to
touch or disturb the state of the Cray-style vector registers.

So those vector instructions would miss out on one possible
resource for buffer storage, possibly having a slight impact
on their performance, which means that these different types
of really long vector instructions would still have _some_
difference.

John Savard

Quadibloc

unread,

Feb 16, 2022, 6:59:54 PM2/16/22

to

On Wednesday, February 16, 2022 at 4:50:17 PM UTC-7, Quadibloc wrote:

> In my overly complicated ISA, though, even with a good VVM
> implementation, where VVM-around-scalar float instructions
> would be able to use the Cray-style ALUs, it would *not* be able to
> touch or disturb the state of the Cray-style vector registers.

Oh, and by the way: to prevent waste in extreme low-end
implementations, where there is only a scalar floating-point ALU,
and the AVX-like and Cray-like vector registers are faked, being
located in DRAM...

it should be noted in the ISA description that when AVX-like
vector instructions or Cray-like vector instructions are placed
within a VVM wrapper, *the associated register contents are
not guaranteed to be anything in particular on exit*.

They may get trashed, or the CPU may not bother to touch them,
depending on whether or not using them helps.

John Savard

JimBrakefield

unread,

Feb 16, 2022, 7:14:13 PM2/16/22

to

On Tuesday, February 15, 2022 at 4:48:03 AM UTC-6, Quadibloc wrote:

Since the topic is "computers of the future"
And Intel has opened up their Tofino product
It could happen that ethernet switches become "vector processors" of Ethernet frames?
E.g. the data is shipped to which ever switch can process it at pipeline rates?
Uses existing infrastructure and augments it with specialized high data rate processing.

MitchAlsup

unread,

Feb 16, 2022, 7:54:47 PM2/16/22

to

On Wednesday, February 16, 2022 at 5:24:14 PM UTC-6, Quadibloc wrote:
> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
> > On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
>
> > > A Cray-like architecture makes the vectors _much_ bigger than even
> > > in AVX-512, so in my naivete, I would have thought that this would
> > > allow the constant changes to stop for a while.
> > <
> > Point of Order::
> > CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
> > AVX vectors are processed en massé <however wide> per unit time.
> > These are VASTLY different things.
<
> Yes. However, a faster Cray-like machine can be implemented
> with as many, or more, floating ALUs than an AVX-style vector unit.
>
> So you could have, say, AVX-512 with eight 64-bit floats across, and
> then switch to Cray in the next generation with sixteen ALUs, and then
> stick with Cray with thirty-two ALUs in the generation after that.
<

Yes but no mater how wide the CRAY implementation was, the register
file seen by instructions had the same shape.
<
Every time AVX grows in width SW gets a different model to program to.
<
That is the difference CRAY exposed a common model, AVX exposed
an ever changing model.

Logic may, pins will not.

>
> I had thought that the number of CPUs in the package was what governed
> the heat, and using more pins for data would not be too bad. If that's not
> true, then, yes, this would be *one* fatal objection to my concepts.
<

For the power cost of sending 1 byte from DRAM to pins of processor
you can read ~{64 to 128} bytes from nearby on die SRAM.

<
> > More pins wiggling faster has always provided more bandwidth.
> > Being able to absorb the latency has always been the problem.
> > {That and paying for it: $$$ and heat}
<
> Of course, the way to absorb latency is to do something else while
> you're waiting. So now you need bandwidth for the first thing you
> were doing, and now more bandwidth for the something else
> you're doing so as to make the latency less relevant.
>
> This sounds like a destructive paradox. But since latency is
> fundamentally unfixable (until you make the transistors and
> wires faster) while you can have more bandwidth if you pay for
> it, the idea of having the amount of bandwidth you needed in
> the first place, times eight or so, almost makes sense.
<
> > > Exactly how a modified GPU design aimed at simulating a Cray
> > > or multiple Crays in parallel working on different problems might
> > > look is not clear to me, but I presume that if one can put a bunch
> > > of ALUs on a chip, and one can organize that to look like a GPU
> > > or like a Xeon Phi (but with RISC instead of x86), it could also be
> > > organized to look like something in between adapted to a
> > > Cray-like instruction set.
<
> ....and a Cray-like instruction set could be like a later generation of
> the Cray, with more and longer vector registers, and in other ways
> it could move to being more GPU-like if that was needed to fix some
> flaws.
<

Since you mentioned later CRAYs--they do Gather/Scatter LDs/STs.
64 address to 64 different memory locations in 64 clocks per port.
3 of these ports.

The width of the SIMD engine is hidden from the expression of
use found in the code. Byte loops would tend to run 64-iterations
per cycle, HalfWord loops 32 iterations per cycle, Word Loops
16-iterations per cycle, DoubleWord loops 8-iterations per cycle.
<
At this width, you are already at the limit of Cache Access, likely
DRAM, wo I don't foresee a need for more than 1 cache line of
calculation: you can't feed it from memory.

>
> VVM wrapped around AVX-style vector instructions works for
> vectors that are 4x longer, in proportion to the number of floats
> in a single AVX vector...
<

VVM supplants any need for AVX contortions. VVM supplies the
entire memory reference and calculation set of AVX with just 2
instructions.

>
> VVM wrapped around Cray-style vector instructions is intended
> for vectors that are 64x longer than VVM wrapped around scalar
> instructions.
<

VVM supplants any need for CRAY-like vectors without the register file
and without a zillion vector instructions.

>
> Assume VVM around scalar handles vectors somewhat longer
> than Cray without VVM. Then what we've got is a range of options,
> each one adapted to how long your vectors happen to be. (And to
> things like granularity, because using VVM around Cray for vectors
> 2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)
<

VVM does not expose width or depth to software--this is the point you
are missing. Thus, SW has a common model over time. Users can go
about optimizing the hard parts of their applications rather than
byte-copy. Byte copy runs just as fast as AVX copy on My 66000.
<
Why add all the gorp when you can wave your hands and make all
the cruft vanish.

>
> Possible answer 3:
> And I might fail to implement VVM well enough to avoid some
> associated overhead.
<

Not my problem.
>
> John Savard

MitchAlsup

unread,

Feb 16, 2022, 8:03:09 PM2/16/22

to

VVM loops are not fused. So, you have nothing to worry about.
A VVM loop is a contract of what the loop must perform, and
what the register state needs to be after the loop completes.

>
> And so I'm thinking in terms of a "dumb" implementation of VVM.
> It reduces loop overhead, it permits data forwarding and stuff like
> that...
<
>
> but it _doesn't_ tell the scalar floating-point instructions to start
> using the ALUs that belong to the AVX instructions or the Cray
> instructions.
<

Because there ARE NO AVX or CRAY instructions, the programmer never got
misinformed as to what way to do this or that. HW gets to make those
choices on an implementation by implementation. The old SW just
keeps getting faster and remains near optimal across many generations.

>
> So if you've got a 65,536-element vector, you can *choose* to
> process it using VVM around scalar, VVM around AVX, or
> VVM around Cray.
<

Just code it up, let the compiler VVM the inner loops, and smile all
the way to the bank.

>
> What this will *change* is how many instances of your program
> can be floating around on your CPU at a given time running
> concurrently - instead of some sitting idle on disk while a handful
> are actually running.
<

You are not sharing these VVM computational resources across core
boundaries. Bull dozer pretty much demonstrated you don't want to do
that.

>
> The VVM around Cray program would be the one that could finish
> faster in wall clock time if it happened to be given the run of the
> entire CPU without having to share.
<

Doing CRAY-1 stuff, VVM should be just as efficient as CRAY.
In fact it can perform Gather/Scatter as a side effect that is
vectorizes LOOPs not instructions.
<
For the addition of 2 instructions, you get a shot CRAY performance,
get a shot at AVX performance, and did not waste 1000 opcodes to
get there.
>
> John Savard

Quadibloc

unread,

Feb 17, 2022, 12:09:16 AM2/17/22

to

On Wednesday, February 16, 2022 at 6:03:09 PM UTC-7, MitchAlsup wrote:

> For the addition of 2 instructions, you get a shot CRAY performance,
> get a shot at AVX performance, and did not waste 1000 opcodes to
> get there.

I think I see where I've gone wrong now.

I've assumed that VVM might be difficult to implement, so I needed
an instruction set that would serve smaller-scale implementations.

And, even worse, I've assumed that because VVM lacks the big
vector registers of a Cray, it's constrained to do operations in memory,
which, of course, exacts a huge penalty. But that's silly: a loop with
Cray instructions in it will take input from memory and put output to
memory, while putting intermediate results in vector registers... and
a VVM loop would do exactly the same thing, except using ordinary
registers.

I mean, either operand forwarding works, or you've got problems
that have to be solved before you can be doing anything with vectors.

John Savard

Terje Mathisen

unread,

Feb 17, 2022, 3:16:21 AM2/17/22

to

MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 5:24:14 PM UTC-6, Quadibloc wrote:
>> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
>>> On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
>>
>>>> A Cray-like architecture makes the vectors _much_ bigger than even
>>>> in AVX-512, so in my naivete, I would have thought that this would
>>>> allow the constant changes to stop for a while.
>>> <
>>> Point of Order::
>>> CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
>>> AVX vectors are processed en massé <however wide> per unit time.
>>> These are VASTLY different things.
> <
>> Yes. However, a faster Cray-like machine can be implemented
>> with as many, or more, floating ALUs than an AVX-style vector unit.
>>
>> So you could have, say, AVX-512 with eight 64-bit floats across, and
>> then switch to Cray in the next generation with sixteen ALUs, and then
>> stick with Cray with thirty-two ALUs in the generation after that.
> <
> Yes but no mater how wide the CRAY implementation was, the register
> file seen by instructions had the same shape.
> <
> Every time AVX grows in width SW gets a different model to program to.
> <
> That is the difference CRAY exposed a common model, AVX exposed
> an ever changing model.

This is the main advantage of a two-stage model, as used by C#/DotNet/JVM:

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Feb 17, 2022, 3:19:25 AM2/17/22

to

MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 5:24:14 PM UTC-6, Quadibloc wrote:
>> On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
>>> On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
>>
>>>> A Cray-like architecture makes the vectors _much_ bigger than even
>>>> in AVX-512, so in my naivete, I would have thought that this would
>>>> allow the constant changes to stop for a while.
>>> <
>>> Point of Order::
>>> CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
>>> AVX vectors are processed en massé <however wide> per unit time.
>>> These are VASTLY different things.
> <
>> Yes. However, a faster Cray-like machine can be implemented
>> with as many, or more, floating ALUs than an AVX-style vector unit.
>>
>> So you could have, say, AVX-512 with eight 64-bit floats across, and
>> then switch to Cray in the next generation with sixteen ALUs, and then
>> stick with Cray with thirty-two ALUs in the generation after that.
> <
> Yes but no mater how wide the CRAY implementation was, the register
> file seen by instructions had the same shape.
> <
> Every time AVX grows in width SW gets a different model to program to.
> <
> That is the difference CRAY exposed a common model, AVX exposed
> an ever changing model.

(Hit Send to quickly. :-()
With a two-stage model like Mill/DotNet/JVM you can have vector
operations in your C# source code that gets turned into the best/widest
available instruction set during JIT/AOT first run.

This supposedly works today, haven't tried it yet myself.

Terje

BGB

unread,

Feb 17, 2022, 4:24:20 AM2/17/22

to

On 2/15/2022 3:09 PM, Thomas Koenig wrote:
> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>> I noticed that in another thread I might have seemed to have contradicted myself.
>>
>> So I will clarify.
>>
>> In the near term, in two or three years, I think that it's entirely possible that we
>> will have dies that combine four "GBOoO" performance cores with sixteen in-
>> order efficiency cores, and chips that have four of those dies in a package, to
>> give good performance on both number-crunching and database workloads.
>
> Who actually needs number crunching?
>
> I certainly do, even in the the company I work in (wich is in
> the chemical industry, so rather technical) the number of people
> actually running code which depends on floating point execution
> speed is rather small, probably in the low single digit percent
> range of all employees.
>

Probably depends. IME, integer operations tend to dominate, but in some
workloads, floating point tends to make its presence known.

> That does not mean that floating point is not important :-) but that
> most users would not notice if they had a CPU with, let's say, a
> reasonably efficient software emulation of floating point numbers.
>

Probably depends on "reasonably efficient".
Say, ~ 20 cycles, probably only a minority of programs will notice.
Say, ~ 500 cycles, probably nearly everything will notice.

One big factor is having integer operations which are larger than the
floating-point values being worked with.

As noted, having an FPU which does ADD/SUB/MUL and a few conversion ops
and similar is for the most part "sufficient" for most practical uses.

Granted, having all the rest could be "better", but is more expensive.

> Such a CPU would look horrible in SPECfp, and the savings from removing
> floating point from a general purpose CPU are probably not that great so
> it is not done, and I think that as an intensive user of floating point,
> I have to be grateful for that.
>

Quick look, at least in my case, the FPU costs less than the L1 D$.

For a 1-wide core, it may be tempting to omit the FPU and MMU for cost
reasons. For a bigger core, omitting them may not be worthwhile if
anything actually uses them.

Or, at least within the limits of an FPU which is cost-cut enough to
where the LSB being correctly rounded is a bit hit or miss.

> Hmm, come to think of it, that is the fist positive thing about
> SPEC that occurred to me in quite a few years...

...

Ivan Godard

unread,

Feb 17, 2022, 8:43:22 AM2/17/22

to

If you don't really need the bandwidth, but have the pincount in the
socket, can't you get less heat by say driving the pins eight at a time
at an eighth the clock? (please forgive my HW ignorance)

Ivan Godard

unread,

Feb 17, 2022, 8:52:27 AM2/17/22

to

Yes - when the hardware can recognize that a loop is VVMable and doesn't
do it scalar. It's clear that a lot of simple (micro-benchmark) loops
can be recognized with acceptably complex recognizer logic. My concern
about VVM is that the efforts for auto-vectorization (which I do realize
is not the same as VVM) have shown that simple-enough loops are not as
common in the wild as they are in benchmarks.

I think the problem is really a linguistic one: our programming
languages have caused us to think about our programs in control flow
terms, when they are more naturally (and more efficiently) thought of
in dataflow terms.

Hence, streamers.

Ivan Godard

unread,

Feb 17, 2022, 8:56:51 AM2/17/22

to

That's all Mill has.

Stefan Monnier

unread,

Feb 17, 2022, 9:19:23 AM2/17/22

to

> Cray instructions in it will take input from memory and put output to
> memory, while putting intermediate results in vector registers... and
> a VVM loop would do exactly the same thing, except using ordinary
> registers.

There is the important difference that Cray vector registers are
significantly bigger than My 66000 registers, so really if you look at
the ISA itself, VVM matches the behavior of those CPUs that had
in-memory vectors rather than vector registers.

IIRC none of those vector processors used caches, so maybe VVM can be
compared to Cray-style in-register vectors except the vector registers
are stored in the L1 cache (tho it might not give quite as much read
bandwidth, admittedly).

Stefan

Quadibloc

unread,

Feb 17, 2022, 9:55:57 AM2/17/22

to

On Thursday, February 17, 2022 at 6:52:27 AM UTC-7, Ivan Godard wrote:

> I think the problem is really a linguistic one: our programming
> languages have caused us to think about our programs in control flow
> terms, when they are more naturally (and more efficiently) thought of
> in dataflow terms.

That does raise another issue.

It's my opinion that the ENIAC was originally a dataflow computer.

A dataflow machine makes very efficient use of its available ALUs.

That's a good thing. But it doesn't seem to be as general-purpose as
a von Neumann machine; you set it up to process multiple streams
of data in a way somewhat analogous to what punched card machines
did with their data, so it doesn't work well with problems that don't
fit into that paradigm.

I may have already mentioned 'way back that VVM could be considered
a way to sugar-coat the description of a dataflow setup. (And I tend
to prefer making it explicit what you're really telling the computer to do.)

John Savard

MitchAlsup

unread,

Feb 17, 2022, 10:17:05 AM2/17/22

to

Point of order:
<
The actual calculation units of Opteron were, indeed, smaller that the
data storage area of DCache (or ICache).
<
But the FP register file of Opteron was larger than FMUL and FADD
combined as DECODE did some renaming to smooth the flow of
calculations.
<
Also note the LD and ST buffers of Opteron were larger than the
storage (64KB 2-banked) of Opteron.
<
So it depends if you mean "L1 D$" as L1-tag, L1-TLB, L1 data,
or whether you mean "L1 D$" as L1-tag, L1-TLB, L1 data, LD buffer, ST buffer.
And we still do not have the area of the miss buffer.

>
> For a 1-wide core, it may be tempting to omit the FPU and MMU for cost
> reasons. For a bigger core, omitting them may not be worthwhile if
> anything actually uses them.
<

Not any more. When you can fit 16 GBOoOs in a chip and a LBIO is 1/12
and that LBIO already has a FMAC fully pipelined, there is no reason to
leave out this kind of functionality unless your market is "lowest
possible power" and still, this decision cause you grief wrt IMUL and IDIV.
<
When you could put ~200 LBIOs in a die, making the LBIO 30% smaller
is like taking air conditioning and power widows out of your car.

MitchAlsup

unread,

Feb 17, 2022, 10:18:53 AM2/17/22

to

Sure, heat is essentially proportional to GTs×pins

MitchAlsup

unread,

Feb 17, 2022, 10:22:54 AM2/17/22

to

On Thursday, February 17, 2022 at 8:19:23 AM UTC-6, Stefan Monnier wrote:
> > Cray instructions in it will take input from memory and put output to
> > memory, while putting intermediate results in vector registers... and
> > a VVM loop would do exactly the same thing, except using ordinary
> > registers.
<

Point of misunderstanding:
<
VVM uses register specifiers to determine data flow within a loop.
It does not actually use GPRs while performing the calculations !
GPRs get written only when there is an interrupt/exception, or at
loop termination based on the "outs" clause of the VEC instruction.
<
This is the relaxation that enables SIMD VVM--not overburdening
the registers themselves.

<
> There is the important difference that Cray vector registers are
> significantly bigger than My 66000 registers, so really if you look at
> the ISA itself, VVM matches the behavior of those CPUs that had
> in-memory vectors rather than vector registers.
<

This needs a lot of justification before I'll buy it.

BGB

unread,

Feb 17, 2022, 2:20:46 PM2/17/22

to

Yeah, I don't have FPU registers.

I dropped them initially (along with the original FMOV instructions)
because they added a fair bit of cost.

I have more recently re-added an "FMOV.S" instruction, which in this
form is implemented as a normal Load/Store with Single<->Double
converters glued on. A case could be made for only doing it load-side,
as this is both cheaper and more common.

Otherwise, loading/storing Single values requires using explicit
conversion ops (with the FADD/FMUL units only operating natively on
Double). Well, it is either that, or add (proper) single-precision ops.

I was counting the D$ and TLB separately, because in my case they are
represented as different modules, and the Vivado netlist view also keeps
them separate.

Some modules just disappear though, like the instruction decoder tends
to get absorbed into other things.

Between FADD and FMUL, FADD is larger.
Just the TLB (by itself) is larger than the FADD (in terms of LUTs).

Or, a few costs (from the netlist, argressively rounded):
FPU : ~ 5k (FADD/FMUL: ~ 2k each, ~ 1k other, *1).
TLB : ~ 3k (4-way, 256 x 4 x 128b)
L1 D$ : ~ 7k
L1 I$ : ~ 2k
ALU : ~ 4k (~ 1.3k per lane)
GPR RF: ~ 2k (64x64b, 6R3W)
...
L2 : ~ 10k (256K, 2-way, 64B cache lines, *2)
DDR : ~ 7k (64B cache lines)
VRAM : ~ 7k (RAM backed, structurally similar to an L1 cache)

*1: This is for Double precision with SIMD.
SIMD seems to have minimal effect on cost.
Enabling GFPX (widens FADD and FMUL to 96 bit, S.E15.M80) does add a
more significant cost increase.

The SIMD operations are internally implemented using pipelining:
Extract element from vector, convert to internal format;
Feed through FADD or FMUL;
Convert back to vector format, store into output vector;
Final result is the output vector.

*2: The size of the L2 cache and DDR PHY vary significantly based on the
use of 16B or 64B cache lines (along the L2<->DRAM interface). The
bigger lines are better for performance as they allow for (more
efficient) burst transfers.

The ringbus continues to operate with 16B logical transfers in either
case (so the L2 is presented to the ringbus in terms of 16B lines).

The RAM in this case is a DDR2 module with a 16-bit bus interface being
run at 50MHz (with 'DLL disabled'). Some other boards use RAM with LPDDR
or QSPI interfaces, but my Verilog code doesn't currently support these (*).

This build was with UTLB disabled, where UTLB adds a smaller 1-way TLB
to the L1 D$. This can (on a hit) allow requests to skip over the main
TLB and go (more directly) onto the L2 ring.

This was not done for the I$, as the I$ miss rate tends to be low enough
(relative to the D$) on average that it isn't likely to be gain much.

*: One of the boards I bought, I thought was going to have QSPI RAM, but
the model I got turned out to not have any RAM (the RAM came with the
XC7A35T variant but not the XC7S25 variant).

>>
>> For a 1-wide core, it may be tempting to omit the FPU and MMU for cost
>> reasons. For a bigger core, omitting them may not be worthwhile if
>> anything actually uses them.
> <
> Not any more. When you can fit 16 GBOoOs in a chip and a LBIO is 1/12
> and that LBIO already has a FMAC fully pipelined, there is no reason to
> leave out this kind of functionality unless your market is "lowest
> possible power" and still, this decision cause you grief wrt IMUL and IDIV.
> <
> When you could put ~200 LBIOs in a die, making the LBIO 30% smaller
> is like taking air conditioning and power widows out of your car.

Yeah.

I have little idea how many BJX2 cores could fit into an ASIC
implementation. In any case, probably a whole lot more than on an Artix-7.

It can make sense mostly for the lower-end Spartan chips (eg: XC7S25),
and is probably necessary for ICE40 based designs.

Stephen Fuld

unread,

Feb 17, 2022, 3:38:42 PM2/17/22

to

On 2/17/2022 11:20 AM, BGB wrote:

big snip

> Or, a few costs (from the netlist, argressively rounded):
> FPU : ~ 5k (FADD/FMUL: ~ 2k each, ~ 1k other, *1).
> TLB : ~ 3k (4-way, 256 x 4 x 128b)
> L1 D$ : ~ 7k
> L1 I$ : ~ 2k

Why does the L1 D$ cache take 3.5 X as much as the L1 I$

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Terje Mathisen

unread,

Feb 17, 2022, 4:37:39 PM2/17/22

to

Stephen Fuld wrote:
> On 2/17/2022 11:20 AM, BGB wrote:
>
> big snip
>
>> Or, a few costs (from the netlist, argressively rounded):

>> Â FPUÂ Â : ~ 5k (FADD/FMUL: ~ 2k each, ~ 1k other, *1).
>> Â TLBÂ Â : ~ 3k (4-way, 256 x 4 x 128b)
>> Â L1 D$ : ~ 7k
>> Â L1 I$ : ~ 2k

>
> Why does the L1 D$ cache take 3.5 X as much as the L1 I$

I$ is effectively read-only, vs D$ which has to handle writes all the time?

BGB

unread,

Feb 17, 2022, 5:06:10 PM2/17/22

to

On 2/17/2022 2:38 PM, Stephen Fuld wrote:
> On 2/17/2022 11:20 AM, BGB wrote:
>
> big snip
>
>> Or, a few costs (from the netlist, argressively rounded):
>>    FPU   : ~ 5k (FADD/FMUL: ~ 2k each, ~ 1k other, *1).
>>    TLB   : ~ 3k (4-way, 256 x 4 x 128b)
>>    L1 D$ : ~ 7k
>>    L1 I$ : ~ 2k
>
> Why does the L1 D$ cache take 3.5 X as much as the L1 I$
>

Probably because it is:
Byte Addressable vs Word Addressable;
Supports a wider fetch (128-bit vs 96-bit);
Supports both Load and Store operations (vs just Fetch);
Also deals with MMIO operations (different behavior on the bus);
...

There is a lot of extra logic needed to deal with memory stores, dirty
cache lines, storing stuff back to RAM, ... which is effectively
irrelevant to the I$.

The I$ has logic to deal with things like instruction length and
instruction bundles, but this seems to be comparably much smaller than
the logic needed by the D$ to deal with things like byte-aligned Load
and Store.

...

Not shown in this stat is Block-RAM cost, where:
L1 D$ is 32K, organized as 1024 x 2 x 16B.
Each 16B line has: 128-bits of data and 108* bits of metadata (tag).
L1 I$ is 16K, organized as 512 x 2 x 16B.
Each 16B line has: 128-bits of data and 72 bits of metadata (tag).

Cache lines are organized as Even/Odd pairs (or A/B), mostly to deal
with Load/Store crossing a 16B boundary. An "Aligned Only" cache could
skip this step (I$ would still need it though to be able to support
variable-length / variable-alignment bundles).

*: In a newer/experimental version of the L1 D$ which adds epoch
flushing and some other tweaks (the prior version was 88 bits). However,
this version still has some bugs and I haven't released the code yet.

The D$ would actually be a lot more expensive if I did not impose an 8B
alignment restriction for 128-bit Load/Store. This would effectively
double the size of the Extract/Insert logic; as the 128b Load/Store
works by bypassing the logic normally used for Extract/Insert of values
64-bits or less.

The Extract/Insert logic fetches 16B from a 8B aligned position, and
then selects a byte-aligned 8B from this, and then zero or sign-extends
it to the requested length (for Load).

For Store, it combines this value with the value being stored (based on
the store width), inserts it (byte aligned) back into the 16B block, and
then builds a set of Store cache-lines which re-insert this block.

For 128-bit store, the stored value replaces the 16B block, which is
then inserted into the store cache lines.

If the request is Store type (and not MMIO), then the newly-updated
cache-lines may be stored back to the cache-line arrays (with the Dirty
Flag) being Set. On a Cache Miss, it will then be sent out on the bus
before requesting the replacement cache line.

...

BGB

unread,

Feb 17, 2022, 6:22:30 PM2/17/22

to

On 2/17/2022 3:37 PM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 2/17/2022 11:20 AM, BGB wrote:
>>
>> big snip
>>
>>> Or, a few costs (from the netlist, argressively rounded):
>>> Â FPUÂ Â : ~ 5k (FADD/FMUL: ~ 2k each, ~ 1k other, *1).
>>> Â TLBÂ Â : ~ 3k (4-way, 256 x 4 x 128b)
>>> Â L1 D$ : ~ 7k
>>> Â L1 I$ : ~ 2k
>>
>> Why does the L1 D$ cache take 3.5 X as much as the L1 I$
>
> I$ is effectively read-only, vs D$ which has to handle writes all the time?
>

This is probably most of it.

There is a whole lot of logic which just goes poof and disappears when
the cache is read-only.

Also, in general, the logic in the I$ is simpler than in the D$.

Main unique thing that the I$ needs to do is look at instruction bits
and figure out how long the bundle is.

So, say (15:12):
0111: w32b=1, Wx=(11) //XGPR
1001: w32b=1, Wx=(11) //XGPR
1110: w32b=1, Wx=((11:9)==101) //PrWEX
1111: w32b=1, Wx=(10)
Else: w32b=0, Wx=0

Then look at this across several instruction words,
w32b=0,Wx=0: 16-bit
w32b=1,Wx=0: 32-bit
(w32b=1,Wx=1), (w32b=0,Wx=0): 48-bit (unused)
(w32b=1,Wx=1), (w32b=1,Wx=0): 64-bit
(w32b=1,Wx=1), (w32b=1,Wx=1), (w32b=1,Wx=0): 96-bit

The actual logic for this part, while not particularly concise, doesn't
have a particularly high LUT cost.

The I$ doesn't care all that much what the instructions "actually do"
(this part is for the Decode and Execute stages to figure out).
Similarly, the decode stage has to sort out all the stuff with Jumbo
Prefixes (the I$ mostly ignores these, treating them like a special case
of the normal bundle encoding).

...

Some of my other ISA designs (imagined) would have gone over to,
effectively (15,13):
0zz: 16b
10z: 16b
110: 32b (Wx=0)
111: 32b (Wx=1)

For something like WEX3W, this would allow determining bundle length by
looking at 6-bits:
{ (15:13), (47:45) }
{ (31:29), (63:61) }
...

Though, I have gone back and forth between using the high bits of each
16-bit word (as what BJX2 currently does), or the low order bits of each
word (like RISC-V).

The RISC-V ordering makes more sense if one assumes a (consistent)
little-ending ordering, but the high-bits scheme makes more sense for
laying out the encodings in hexadecimal (even if the resultant encoding
is mixed-endian).

The bit layout doesn't make any real difference to an FPGA though.

Partial limiting factors (that have partly kept BJX2 alive), is the
issue that I can't really fit 6 bit registers, predication, and WEX,
into a 32-bit instruction word, which still having a good amount of
encoding space.

Say:
zzzz-zzzz tttt-ttss ssss-nnnn nnqq-pw11

pw:
00: 32-bit, Unconditional
01: 32-bit, WEX, Unconditional
10: 32-bit, Predicated
11: 32-bit, WEX, Predicated
qq (p=0):
00: Block 0
01: Block 1
10: Block 2
11: Block 3
qq (p=1):
00: Block 0 ?T
01: Block 0 ?F
10: Block 0 ?ST
11: Block 0 ?SF

The opcode space would work out considerably smaller than what I
currently have (to compensate, would need to move over to using 40 or 48
bit instructions, which is also undesirable).

Another option being to gain one or two bits by dropping the possibility
of 16-bit instructions.
zzzz-zzzz tttt-ttss ssss-nnnn nnzz-qqpw
This could be at least roughly break-even in terms of opcode space.

The tradeoff being, that with the current ISA one has to get a little
mix/match:
Predicates + WEX, R0..R31 only.
WEX + R0..R63, no predicates.
Predicates + R0..R63, no WEX

But, did hack it more recently:
Predicates + WEX + R0..R63; 2-wide bundle (2 ops in 96 bits).
(The encoding here is a bit "questionable", but oh well).

Kinda ugly, basically works though...

A lot of this hair magically disappears if one assumes a subset which
only allows R0..R31 (all these extra special-case encodings become
"invalid" and the implementation can ignore their existence).

...

Tim Rentsch

unread,

Mar 6, 2022, 8:46:53 AM3/6/22

to

Ivan Godard <iv...@millcomputing.com> writes:

[...]

> I think the problem is really a linguistic one: our programming
> languages have caused us to think about our programs in control

> flow terms, [...]

Some of us. Not everyone, fortunately.

John Levine

unread,

Mar 6, 2022, 12:50:47 PM3/6/22

to

According to Ivan Godard <iv...@millcomputing.com>:

>I think the problem is really a linguistic one: our programming
>languages have caused us to think about our programs in control flow
>terms, when they are more naturally (and more efficiently) thought of
>in dataflow terms.

I blame von Neumann. Look at the First Draft of a Report on the EDVAC
and it's brutally serial, one step after another, one word at a time
loaded or stored from the iconoscope.

Compare it to the lovely ENIAC where you could plug anything into
anything else, cables permitting, and the data flowed as soon as it
was ready. Yeah, it was a little harder to program and only five
people in the world could do it, but you can't have everything.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Anton Ertl

unread,

Mar 6, 2022, 1:42:08 PM3/6/22

to

John Levine <jo...@taugh.com> writes:
>According to Ivan Godard <iv...@millcomputing.com>:
>>I think the problem is really a linguistic one: our programming
>>languages have caused us to think about our programs in control flow
>>terms, when they are more naturally (and more efficiently) thought of
>>in dataflow terms.
>
>I blame von Neumann. Look at the First Draft of a Report on the EDVAC
>and it's brutally serial, one step after another, one word at a time
>loaded or stored from the iconoscope.
>
>Compare it to the lovely ENIAC where you could plug anything into
>anything else, cables permitting, and the data flowed as soon as it
>was ready. Yeah, it was a little harder to program and only five
>people in the world could do it, but you can't have everything.

You can, and you could a quarter century ago: You specify the code in
a sequential way (so more than five people in the world can program
it), and the magic of OoO execution means that data flows as soon as
it is ready.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Terje Mathisen

unread,

Mar 7, 2022, 6:12:16 AM3/7/22

to

I tend to think much more in data flow terms, as I've written before:

An optimal program is like a creek finding its path down from the
mountain, always searching for the path of least resistance.

Tom Gardner

unread,

Mar 7, 2022, 6:43:35 AM3/7/22

to

On 17/02/22 13:52, Ivan Godard wrote:
> I think the problem is really a linguistic one: our programming languages have
> caused us to think about our programs in control flow terms, when they are more
> naturally (and more efficiently) thought of in dataflow terms.

Not all languages, of course.

The commercially and technically important class of hardware
description languages (e.g. Verilog, VHDL) have a very large
component of data flowing (streaming) through "wires" connecting
"processes/FSMs" and event based computation. To anyone used
to "thinking in hardware" that, and its associated parallelism
is easy and natural.

Softies find a marked "impedance mismatch" when they encounter
HDLs, and try to "write Fortran" in the HDL. Hence there are
now moves to have more procedural concepts and constructs,
e.g. System C. The impedance mismatch is lower, but it is still
easy to create something that is poorly "synthesisable", i.e.
can't be translated into hardware/dataflow.

Marcus

unread,

Mar 7, 2022, 7:48:44 AM3/7/22

to

On 2022-02-15, Thomas Koenig wrote:
> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>> I noticed that in another thread I might have seemed to have contradicted myself.
>>
>> So I will clarify.
>>
>> In the near term, in two or three years, I think that it's entirely possible that we
>> will have dies that combine four "GBOoO" performance cores with sixteen in-
>> order efficiency cores, and chips that have four of those dies in a package, to
>> give good performance on both number-crunching and database workloads.
>
> Who actually needs number crunching?
>
> I certainly do, even in the the company I work in (wich is in
> the chemical industry, so rather technical) the number of people
> actually running code which depends on floating point execution
> speed is rather small, probably in the low single digit percent
> range of all employees.
>

> That does not mean that floating point is not important :-) but that
> most users would not notice if they had a CPU with, let's say, a
> reasonably efficient software emulation of floating point numbers.

The problem is that AFAIK there exists no reasonably efficient software
emulation of IEEE-754 floating-point (where "reasonable" would be
something like 10x slower than using an FPU).

I recently spent some time optimizing function argument sanity checks in
an API layer of an embedded software. For instance some arguments were
orthonormal matrices, and the API made a crude check for that. Normally
this is not a problem, but in this case the API layer ran on an ARM M4
without an FPU, and so a single argument check could take more than
0.1 ms, which is a considerable time period in a real time system where
all work must finish in less than 16 ms.

It would be interesting to see an FP standard that is optimized for
software implementation, and/or CPU:s with "software FP aid"
instructions (fast handling of NaN:s and special cases,
(de)normalization instructions - like CLZ on steroids, etc), so that
common FP operations can be implemented with 5-20 instructions, for
instance.

/Marcus

JimBrakefield

unread,

Mar 7, 2022, 8:52:14 AM3/7/22

to

I call HDL writers "time lords" as they can operate across both time and space.

Ivan Godard

unread,

Mar 7, 2022, 10:03:51 AM3/7/22

to

+1!

Terje Mathisen

unread,

Mar 7, 2022, 11:43:51 AM3/7/22

to

40-50 clock cycles for FADD/FMAC is just about doable.

>
> I recently spent some time optimizing function argument sanity checks in
> an API layer of an embedded software. For instance some arguments were
> orthonormal matrices, and the API made a crude check for that. Normally
> this is not a problem, but in this case the API layer ran on an ARM M4
> without an FPU, and so a single argument check could take more than
> 0.1 ms, which is a considerable time period in a real time system where
> all work must finish in less than 16 ms.

Could you do the checking in integer domain, or would that be meaningless?

>
> It would be interesting to see an FP standard that is optimized for
> software implementation, and/or CPU:s with "software FP aid"
> instructions (fast handling of NaN:s and special cases,
> (de)normalization instructions - like CLZ on steroids, etc), so that
> common FP operations can be implemented with 5-20 instructions, for
> instance.

We did quite a bit of work on sw/hw fp codesign for an absolute minimum
Mill, you are correct that it helps to have a small number of hw
helpers. In particular you need a saturating shift right in order to
implement the sticky bit for proper rounding.

It also helps to have a fast classifier for one or two inputs,
optionally sorting a pair of inputs by magnitude (needed for FADD/FSUB).
You do all the special cases in parallel with the normal calculations,
then select either that normal result or the special value at the very end.

Doing it this way you can get into a 5x slower speed ballpark.

MitchAlsup

unread,

Mar 7, 2022, 12:43:54 PM3/7/22

to

On Monday, March 7, 2022 at 5:43:35 AM UTC-6, Tom Gardner wrote:

> On 17/02/22 13:52, Ivan Godard wrote:
> > I think the problem is really a linguistic one: our programming languages have
> > caused us to think about our programs in control flow terms, when they are more
> > naturally (and more efficiently) thought of in dataflow terms.
> Not all languages, of course.
>
> The commercially and technically important class of hardware
> description languages (e.g. Verilog, VHDL) have a very large
> component of data flowing (streaming) through "wires" connecting
> "processes/FSMs" and event based computation. To anyone used
> to "thinking in hardware" that, and its associated parallelism
> is easy and natural.
>
> Softies find a marked "impedance mismatch" when they encounter
> HDLs, and try to "write Fortran" in the HDL.
<

Many Verilog "subroutines" can have their "assignment statements"
placed backwards and the subroutine still does the same job !
You HAVE to give up the vonNeumann paradigm that one thing
happens then another and another--they all happen based on
dependencies (some of which are not "in scope".)

MitchAlsup

unread,

Mar 7, 2022, 12:48:46 PM3/7/22

to

All you need is a native integer 2× as wide as the FP you simulate,
Find First 1 over 2×, and 2× Shifts, and some rounding instruction
also over 2×, and extract/insert exponents and fractions.
<
The problem is there is no 2× designed to emulate FP they are all
designed to emulate 2× integers.
>
> /Marcus

Tim Rentsch

unread,

Mar 8, 2022, 7:58:20 AM3/8/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:

> Tim Rentsch wrote:
>
>> Ivan Godard <iv...@millcomputing.com> writes:
>>
>> [...]
>>
>>> I think the problem is really a linguistic one: our programming
>>> languages have caused us to think about our programs in control
>>> flow terms, [...]
>>
>> Some of us. Not everyone, fortunately.
>
> I tend to think much more in data flow terms, as I've written before:
>
> An optimal program is like a creek finding its path down from the
> mountain, always searching for the path of least resistance.

It's an interesting simile. Note by the way that creeks are
greedy algorithms, in the sense of being only locally optimal,
not necessarily globally optimal.

To me the more interesting question is how does this perspective
affect how the code looks? If reading your programs, would I see
something that looks pretty much like other imperative code, or
would there be some distinguishing characteristics that would
indicate your different thought mode? Can you say anything about
what those characteristics might be? Or perhaps give an example
or two? (Short is better if that is feasible.)

Quadibloc

unread,

Mar 9, 2022, 2:33:02 AM3/9/22

to

On Sunday, March 6, 2022 at 10:50:47 AM UTC-7, John Levine wrote:
> According to Ivan Godard <iv...@millcomputing.com>:
> >I think the problem is really a linguistic one: our programming
> >languages have caused us to think about our programs in control flow
> >terms, when they are more naturally (and more efficiently) thought of
> >in dataflow terms.

> I blame von Neumann. Look at the First Draft of a Report on the EDVAC
> and it's brutally serial, one step after another, one word at a time
> loaded or stored from the iconoscope.

> Compare it to the lovely ENIAC where you could plug anything into
> anything else, cables permitting, and the data flowed as soon as it
> was ready. Yeah, it was a little harder to program and only five
> people in the world could do it, but you can't have everything.

After the ENIAC, though, eventually more people learned how to
program in dataflow terms. Because that's how analog computers
were programmed.

And what about punched-card accounting machines?

Originally, the victory of the von Neumann machine came about
for a very simple reason: such a machine required far less hardware.
One ALU for a program of any length or complexity, instead of as many
arithmetic stages as there were steps in the problem.

But the other problem is that a lot of uses were found for computers.
Dataflow seems to be good for one thing: turning a stack of input numbers
into a stack of output numbers. Now that programs are written to work in
a GUI instead of from a command line, however, a paradigm that's at least
analogous to dataflow is needed on the highest level, instead of just in
the innermost loops.

Also, while a computer could be designed to be a dataflow engine on
one level, there is the brutal slowness of the interface to main memory,
to external DRAM.

So it would be simple enough to design a dataflow processor. You
put a pile of ALUs on a chip. You have a lot of switchable connections
linking them. And you also put a pile of separate memories in the chip,
so that you can link multiple inputs and outputs to the computation.

Some of those memories could be used as look-up tables instead of
input or output hoppers.

The problem is: how useful would such a processor be? Assume it
shares a die with a conventional processor, which is used to set it up
for problems. How often will setting it up for a problem be faster than
just having the serial processor do the problem?

I have tried to think of a way to define a dataflow instruction for
what is otherwise a conventional processor. Basically, the instruction
performs a vector operation, but with multiple opcodes, with all the
data forwarding specified.

John Savard

Terje Mathisen

unread,

Mar 9, 2022, 6:29:12 AM3/9/22

to

First example: Use (very) small lookup tables to get rid of branches,
typically trying to turn code state machines into data state machines.
This is driven by the fact that modern CPUs are much better at dependent
loads than unpredictable branches.

I typically try to write code that can run in SIMD mode, i.e. using
tables/branchless/predicated ops to handle any single-lane alternate paths.

Such code can also much more often scale across multiple cores/systems
even if that means that I have to do some redundant work across the
boundaries. I.e. when I process lidar data I can make the problem almost
embarrasingly parallelizable by splitting the input into tiles with
35-50m overlap: This is sufficient to effectively eliminate all edge
artifacts when I generate contours and vegetation classifications.

The most fun is when I find ways to remove all internal branching
related to exceptional data, i.e. anything which can impede that nicely
flowing stream of water going downhill.

BTW, looking for both local and global optimal paths/solutions is very
similar to what we do when competing in orienteering: What is my running
speed going to be along these paths vs going cross country? How is that
affected by the density of the vegetation and the amount of elevation
gain/loss? How much more time will I need to spend on the navigation
part when taking a more risky direct route without obvious intermediate
features to use for feedback? (In the latter case I also need to
incorporate the expected loss of time by not hitting the control
perfectly and then having to search and/or recover my position by
running to the nearest large/obvious feature.)

Thomas Koenig

unread,

Mar 9, 2022, 6:36:57 AM3/9/22

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> So it would be simple enough to design a dataflow processor. You
> put a pile of ALUs on a chip. You have a lot of switchable connections
> linking them. And you also put a pile of separate memories in the chip,
> so that you can link multiple inputs and outputs to the computation.
>
> Some of those memories could be used as look-up tables instead of
> input or output hoppers.

Sounds like an FPGA to me.

> The problem is: how useful would such a processor be? Assume it
> shares a die with a conventional processor, which is used to set it up
> for problems. How often will setting it up for a problem be faster than
> just having the serial processor do the problem?

> I have tried to think of a way to define a dataflow instruction for
> what is otherwise a conventional processor. Basically, the instruction
> performs a vector operation, but with multiple opcodes, with all the
> data forwarding specified.

Use a HDL.

If you want, you can have a softcore for your CPU and define special
instructions for your special needs. I don't think it is easy
to modify the FPGA programming on the fly.

JimBrakefield

unread,

Mar 9, 2022, 9:24:28 AM3/9/22

to

It is possible to modify block RAM on the fly
And it can be used to contain micro-code

Quadibloc

unread,

Mar 9, 2022, 9:47:46 AM3/9/22

to

On Wednesday, March 9, 2022 at 4:36:57 AM UTC-7, Thomas Koenig wrote:

> If you want, you can have a softcore for your CPU and define special
> instructions for your special needs. I don't think it is easy
> to modify the FPGA programming on the fly.

It depends what FPGA you use. Some use fusible links, others use
logic gates that are controlled by data stored in flip-flops. I'm thinking of
something a bit less flexible, but with wiring that is somewhat like that in
the latter type of FPGA. Thus, one doesn't have components to replace
arbitrary Boolean logic, one just makes connections between adders,
multipliers, and dividers.

Over the course of a single program, the dataflow logic would likely
get set up differently several times, because a computer program
doesn't normally consist of just one inner loop.

The idea is _not_ to have a computer designed for one "special need",
but to have a computer that does what computers do now, just faster,
due to the use of dataflow circuitry - designed to do _better_ than just
deducing data flows from regular code with a limited window, as done
in out-of-order designs.

However, I have my doubts that this idea can be made to work. Regular
FPGAs, as you suggest, do have their uses. I don't quarrel with that,
but I am looking to satisfy a different use case.

John Savard

Brian G. Lucas

unread,

Mar 9, 2022, 11:03:53 AM3/9/22

to

For an attempt at a dataflow co-processor to an Arm-9, see
<www.microarch.org/micro36/html/pdf/may-ReconfigStreaming.pdf>
I was the lead architect. It obviously was not successful but we
did have working silicon. The memory system we had to use was way
too slow to keep up.

brian

Ivan Godard

unread,

Mar 9, 2022, 11:23:00 AM3/9/22

to

Sounds a lot like the data plane of a network control processor.

MitchAlsup

unread,

Mar 9, 2022, 1:00:23 PM3/9/22

to

The problem with data flow is managing to constrain the width of
execution congruent with the width of the resources. While conventional
machine struggle to find parallelism, data flow machines struggled to
bound parallelism so as to avoid overflowing the "reservation stations"
of the machine.

BGB

unread,

Mar 9, 2022, 3:17:47 PM3/9/22

to

On 3/9/2022 8:47 AM, Quadibloc wrote:
> On Wednesday, March 9, 2022 at 4:36:57 AM UTC-7, Thomas Koenig wrote:
>
>> If you want, you can have a softcore for your CPU and define special
>> instructions for your special needs. I don't think it is easy
>> to modify the FPGA programming on the fly.
>
> It depends what FPGA you use. Some use fusible links, others use
> logic gates that are controlled by data stored in flip-flops. I'm thinking of
> something a bit less flexible, but with wiring that is somewhat like that in
> the latter type of FPGA. Thus, one doesn't have components to replace
> arbitrary Boolean logic, one just makes connections between adders,
> multipliers, and dividers.
>

Most are SRAM based, some are Flash based.

Fusible links seems to be more a thing for one-time-programmable PAL and
CPLD devices.

> Over the course of a single program, the dataflow logic would likely
> get set up differently several times, because a computer program
> doesn't normally consist of just one inner loop.
>
> The idea is _not_ to have a computer designed for one "special need",
> but to have a computer that does what computers do now, just faster,
> due to the use of dataflow circuitry - designed to do _better_ than just
> deducing data flows from regular code with a limited window, as done
> in out-of-order designs.
>
> However, I have my doubts that this idea can be made to work. Regular
> FPGAs, as you suggest, do have their uses. I don't quarrel with that,
> but I am looking to satisfy a different use case.
>

I guess it could be possible to make an FPGA that, instead of being
built around 1-bit interconnections, is mostly 4 and 8 bit
interconnections (Say, 4-bit mostly for local connections, and 8-bit for
longer-distance pathways).

This could potentially be used to make the routing cheaper, though would
be much better suited for larger coarse-grain signals, rather than
fine-grain signals.

So, LUT elements are primarily 8->4 bit, ADD units are 8+8=>8 (With
CarryIn/CarryOut).

Adders could be in groups of 4, so that they can use 4b inputs/outputs
for carry signaling, with "bit-transform" units placed in the "carry-in
/ carry-out" paths to allow readily feeding the carry signals back into
to other adders within the same unit (this unit being usable as a
32+32->32 ADD, with 1-bit left for carry-in / carry-out).

Ideally, the unit could also be configured for SUB/AND/OR/XOR as well.

Would be less efficient in some ways, things like bit-shuffling, etc,
would now require using LUTs, and emulating smaller LUTs would be
inefficient.

It is likely that, along with LUT and FF elements, one would also need
bit-transform elements 2x4->2x4 with a configurable mapping between the
input and output bits, likely glued onto the output side of each LUT
(between the LUT and a 4b FF).

The bit-transform elements could allow for more efficiently emulating
smaller LUTs, like 4->1 or 4->2 (and would operate in a similar way to
an SSE shuffle with two inputs and two outputs).

Would likely need to widen DSPs to 20-bit inputs (A20*B20+C48), or
narrow to 16 (A16*B16+C40).

Or, maybe a DSP unit that could be configured as, say:
Two 16 or 20 bit multiply MUL or MAC units;
One 32-bit MUL;
One 32-bit FMUL or FMAC.

Though, another option would be to use a mixed approach, with mostly
1-bit local routing (would function more like in existing FPGAs), but
then use coarser grain 4 and 8 bit signaling for moving signals between
distant parts of the FPGA.

Say, for example, each group of 16 or 64 CLBs has internal 1-bit
routing, with a 4-bit interconnect to other groups of CLBs.

Some of this is based on the thought that larger multi-bit signals and
operations tend to be a lot more common than 1-bit operations, that some
loss of efficiency for 1-bit operations could be acceptable.

...

MitchAlsup

unread,

Mar 9, 2022, 3:36:39 PM3/9/22

to

Adders should be built in units of 9-bits. The "extra bit" can be used to
clip-carry (inputs = 0 and 0), propagate carry (1 0 or 0 1), or insert carry
(1 1). Thus, you can perform SIMD arithmetic in a std integer adder.

>
> Ideally, the unit could also be configured for SUB/AND/OR/XOR as well.
>
>
>
> Would be less efficient in some ways, things like bit-shuffling, etc,
> would now require using LUTs, and emulating smaller LUTs would be
> inefficient.
>
> It is likely that, along with LUT and FF elements, one would also need
> bit-transform elements 2x4->2x4 with a configurable mapping between the
> input and output bits, likely glued onto the output side of each LUT
> (between the LUT and a 4b FF).
<

4×4 or 8×8 bit-matrix-multiply

BGB

unread,

Mar 9, 2022, 4:01:55 PM3/9/22

to

Granted, this does carry the unverified assumption that 4b or 8b paths
would be cheaper than a bunch of independent 1-bit paths.

The number of signal wires would be similar, as would the number of
transistors needed for the crossbar itself. It would mostly effect to
the cost of configuring the crossbar (eg: controlling 4 or 8 MOSFETs
with a single input signal).

It would mostly reduce the number of SRAM cells, and the number of
signals being used in the crossbar for configuring said SRAM cells.

>>
>> This could potentially be used to make the routing cheaper, though would
>> be much better suited for larger coarse-grain signals, rather than
>> fine-grain signals.
>>
>>
>> So, LUT elements are primarily 8->4 bit, ADD units are 8+8=>8 (With
>> CarryIn/CarryOut).
>>
>> Adders could be in groups of 4, so that they can use 4b inputs/outputs
>> for carry signaling, with "bit-transform" units placed in the "carry-in
>> / carry-out" paths to allow readily feeding the carry signals back into
>> to other adders within the same unit (this unit being usable as a
>> 32+32->32 ADD, with 1-bit left for carry-in / carry-out).
> <
> Adders should be built in units of 9-bits. The "extra bit" can be used to
> clip-carry (inputs = 0 and 0), propagate carry (1 0 or 0 1), or insert carry
> (1 1). Thus, you can perform SIMD arithmetic in a std integer adder.

Issue is 9b would not fit into a pathway build on 4-bit interconnects.

But, ideally, one wants to minimize the number of 4-bit paths being used
to carry 1 or 2 bit signals, hence the idea of a bit-transform.

>>
>> Ideally, the unit could also be configured for SUB/AND/OR/XOR as well.
>>
>>
>>
>> Would be less efficient in some ways, things like bit-shuffling, etc,
>> would now require using LUTs, and emulating smaller LUTs would be
>> inefficient.
>>
>> It is likely that, along with LUT and FF elements, one would also need
>> bit-transform elements 2x4->2x4 with a configurable mapping between the
>> input and output bits, likely glued onto the output side of each LUT
>> (between the LUT and a 4b FF).
> <
> 4×4 or 8×8 bit-matrix-multiply

Yeah, could probably be done this way.

An 8x8 multiply could do it in one step, but would need 64 control bits.
In effect it reduces to a bunch of AND and OR operations.

The number of control bits would be higher with this than doing it SSE
style, but this could also implement a few simple logic operations
between groups of bits.

...

Tim Rentsch

unread,

Mar 9, 2022, 4:08:42 PM3/9/22

to

Interesting. Thank you for the examples.

> The most fun is when I find ways to remove all internal branching
> related to exceptional data, i.e. anything which can impede that
> nicely flowing stream of water going downhill.

In that sentence the word "which" should be "that". The reason is
"that" is restrictive, whereas "which" in non-restrictive. In
other words you mean the rest of the sentence after "which" to
constrain the word "anything", i.e., to restrict what is covered.
This distinction is important in technical writing; unfortunately
even native speakers sometimes get it wrong.

MitchAlsup

unread,

Mar 9, 2022, 4:16:03 PM3/9/22

to

A pair of 4-bit channels with a 1-bit separator makes one 9-bit channel.
A single 8-bit channel with a 1-bit separator makes 1 9-bit channel.
<
MIPS R3000 used this an you can see it in the die photo. The other parts
of the channel are used to Vdd/Gnd ties and on-die capacitors. We contemplated
using a channel such as this for signal repeaters (double inverters back to back)
for wires jumping over the "data path" at 35u.

> >>
> >> Ideally, the unit could also be configured for SUB/AND/OR/XOR as well.
> >>
> >>
> >>
> >> Would be less efficient in some ways, things like bit-shuffling, etc,
> >> would now require using LUTs, and emulating smaller LUTs would be
> >> inefficient.
> >>
> >> It is likely that, along with LUT and FF elements, one would also need
> >> bit-transform elements 2x4->2x4 with a configurable mapping between the
> >> input and output bits, likely glued onto the output side of each LUT
> >> (between the LUT and a 4b FF).
> > <
> > 4×4 or 8×8 bit-matrix-multiply
> Yeah, could probably be done this way.
>
> An 8x8 multiply could do it in one step, but would need 64 control bits.
> In effect it reduces to a bunch of AND and OR operations.
>
> The number of control bits would be higher with this than doing it SSE
> style, but this could also implement a few simple logic operations
> between groups of bits.
<

You started with configurable mappings. Many of these use constants
as one input of the BMM.

Terje Mathisen

unread,

Mar 10, 2022, 8:08:16 AM3/10/22

to

The core idea there is that it can often make sense to do extra work as
long as that removes the cost of maintaining exact boundaries. A similar
idea was the building block for my julian day to Y-M-D conversion:

The math to do so is quite complicated, so instead I make a very quick
approximate guess and do the reverse (very fast!) calculation which ends
with a branchless adjustment if the guess was off-by-one.

>
>> The most fun is when I find ways to remove all internal branching
>> related to exceptional data, i.e. anything which can impede that
>> nicely flowing stream of water going downhill.
>
> In that sentence the word "which" should be "that". The reason is
> "that" is restrictive, whereas "which" in non-restrictive. In
> other words you mean the rest of the sentence after "which" to
> constrain the word "anything", i.e., to restrict what is covered.
> This distinction is important in technical writing; unfortunately
> even native speakers sometimes get it wrong.
>

OK, that is subtle but noted!

I'll hide behind my standard "I only started to learn English in 5th
grade" excuse. :-)

My own kids started in 1st or 2nd grade, but I'll claim that they became
fluent over a couple of years when I read Harry Potter to them every
evening.

Tim Rentsch

unread,

Mar 10, 2022, 11:11:30 AM3/10/22

to

I am intrigued. Would you mind sharing the code (either posting
or sending to me via email, whichever you think more appropriate)?

>>> The most fun is when I find ways to remove all internal branching
>>> related to exceptional data, i.e. anything which can impede that
>>> nicely flowing stream of water going downhill.
>>
>> In that sentence the word "which" should be "that". The reason is
>> "that" is restrictive, whereas "which" in non-restrictive. In
>> other words you mean the rest of the sentence after "which" to
>> constrain the word "anything", i.e., to restrict what is covered.
>> This distinction is important in technical writing; unfortunately
>> even native speakers sometimes get it wrong.
>
> OK, that is subtle but noted!
>
> I'll hide behind my standard "I only started to learn English in 5th
> grade" excuse. :-)

No excuse needed. I routinely give such comments to people whose
first language is other than English, not with the intention of
faulting a bad usage but with the idea of helping them better
learn the subtleties and bear traps of English. If anyone should
give an excuse, I should, because my own other-language skills are
so limited. I admire anyone who speaks more than one language, as
I have found it very very difficult to do so.

Bill Findlay

unread,

Mar 10, 2022, 11:55:16 AM3/10/22

to

On 10 Mar 2022, Terje Mathisen wrote
(in article <t0ct7p$muk$1...@gioia.aioe.org>):

You have no need of an excuse, because your original wording was correct,
and is exactly how most British English users would put it.

Rentsch is wrong to think that using 'that' means anything different.
His claim comes straight out of Strunk & White's usage guide,
which has been debunked by English grammar experts, such as Geoff Pullum.

I personally prefer 'that' to 'which' in restrictive clauses,
but that is a matter of aesthetics, not grammatical correctness.
--
Bill Findlay

Terje Mathisen

unread,

Mar 11, 2022, 6:30:02 AM3/11/22

to

This is actually more interesting than I immediately thought:

First, is there a difference between US and UK English here?

Looking at it now, to me using "anything which can" vs "anything that
can" does have those two different meanings: The first one is in fact
more inclusive than the second, i.e. "which" also includes stuff that
only incidentally or as a side effect cause this, while "that" implies
that it is more of a primary result.

This was actually what I was trying to express.

Regards,

Terje Mathisen

unread,

Mar 11, 2022, 6:38:24 AM3/11/22

to

Thanks, see my previous response to Tim. :-)

Michael S

unread,

Mar 11, 2022, 7:27:57 AM3/11/22

to

Not being an expert of English English or of American English or of any other English variation...
In this particular case I'd use 'which' rather than 'that'. Because I don't like a look of the same word 'that' appearing twice just two words apart. In different meanings, which probably makes things worse.

There is another reason for me to often prefer 'which' over 'that' : it's o.k. to put comma before 'which', but comma before 'that' is considered bad grammar.
I like commas, but sometimes (it depends on context) dislike annoying readers with bad grammar.

Michael S

unread,

Mar 11, 2022, 7:38:51 AM3/11/22

to

Thinking about it, not "probably". different meanings certainly make things worse.
Repeated (recurring?) 'that' in the same meaning, like that, is o.k with me:

This is the farmer sowing his corn,
That kept the cock that crow'd in the morn,
That waked the priest all shaven and shorn,
That married the man all tatter'd and torn,
That kissed the maiden all forlorn,
That milk'd the cow with the crumpled horn,
That tossed the dog,
That worried the cat,
That killed the rat,
That ate the malt
That lay in the house that Jack built.

David Brown

unread,

Mar 11, 2022, 7:53:36 AM3/11/22

to

The point of language is to communicate. Unless you can be absolutely
sure that the reader will distinguish between "which" and "that",
interpreting the difference exactly as you intended, then you cannot
communicate any semantic information by making the distinction. So if
you need to communicate this information to a wider audience, you have
to do it in a different way - expressing yourself in a different manner
or adding more text.

I am a native English speaker, well versed in technical writing, and
with an above-average interest in and knowledge of grammar. (AFAIK Tim
has far more experience in technical writing than me, however.) I would
not infer anything from the use of "which" or "that" in this context, I
would not assume any writer intended any implication, and I would not
assume any reader would infer anything from it.

Language changes over time, it varies from place to place (though I
don't think this particular case is an American versus British issue),
and it varies from context. Little-used and subtle implications do not
last.

Of course it is best to pick the most appropriate word at the time. But
that does not necessarily mean picking the word that has, in some
circles, particular under-tones. "Which" can well be considered a
better choice to avoid the duplication of the word "that" (as Michael
mentioned). Personally, I'd probably write "anything that can impede
the nicely flowing stream".

Oh, and I'd have a comma after "i.e." :-)

(Despite knowing full well that typesetting rules, like grammar, changes
over time, and that small details rarely matter, I still find myself
mentally nit-picking typographical errors. Perhaps it is because I read
"The TeXbook" many years ago.)

Michael S

unread,

Mar 11, 2022, 8:14:38 AM3/11/22

to

It depends.
For many, if not for most, posters on comp.arch, the main point is to express themselves.
Communication is secondary.

Ivan Godard

unread,

Mar 11, 2022, 1:37:25 PM3/11/22

to

Even long-standing language notion can die. I use both the conditional -
"if I were a rich man..." instead of "if I was a rich man...", and
distinguish "shall" from "will" - simple future instead of intentional.
Both the conditional and and use of shall have largely disappeared in my
lifetime.

But then, I've sometimes been told I have a somewhat antiquated style :-)

Tim Rentsch

unread,

Mar 11, 2022, 1:58:27 PM3/11/22

to

Apparently Mr Findlay is offering himself as an authority on English
grammar.

I am not offering myself as an authority. I mean only to report
what I have heard and read elsewhere. If I remember correctly, I
first heard this rule from a tech writer where I was working many
years ago. Some years later I did read The Elements of Style,
although I don't have any specific memory of it mentioning this
distinction (but of course it almost certainly did). Other tech
writers I have talked to over the years have concurred on the
that/which distinction.

Consulting a handful of online dictionaries for "that", all agreed
on "that" clauses being restrictive or defining.

Consulting a handful of online dictionaries for "which", the results
were divided. Some clearly said that "which" is non-restrictive,
some didn't say, and at least one said "which" can be used in place
of "that".

It's possible that there is some correlation between the different
views and the question of American English and British English. I
did not investigate that question. (I speak and write American
English, although at times I may inadvertently employ some British
English phrasing without realizing it.)

A web search for "that vs which" (without the quotes) turned up a
lot of hits (I don't know how many). Looking at the first page of
results (I use duckduckgo as my search engine), all agreed on the
principle of using "that" for restrictive or defining clauses and
using "which" for non-restrictive clauses. I think the mildest
support was in a page on grammarbook.com; it said this:

NOTE: We feel that maintaining the distinction between that
and which in essential and nonessential phrases and clauses
is useful, even though the principle is sometimes disregarded
by experienced writers.

After seeing all this, I offer the following advice.

If you mean to introduce a restrictive or defining clause, always
use "that", because it's never wrong, and will never confuse a
reader about what is meant.

If you mean to introduce a non-restrictive clause, which gives
only additional information and does not otherwise affect the
noun or phrase being referenced, use "which". To emphasize the
non-restrictiveness, commas can be used before and after the
"which" clause. Using both "which" and delimiting commas is
also, AFAICT, never wrong, and the commas in particular should
remove any confusion about the intended meaning not being
restrictive.

If you want to allow some ambiguity or the possiblity of confusion
or misunderstanding, feel free to use "which" in place of "that"
in introducing a restrictive clause. In my own writing I usually
try to remove as much ambiguity, and also potential confusion or
misunderstanding, as I can. But different writers have different
preferences.

Final thought: I encourage anyone who is interested to consult
any dictionaries, style guides, essays on that vs which, and so
forth, that they can easily find, to get a broader sense of the
question.

Tim Rentsch

unread,

Mar 11, 2022, 2:02:14 PM3/11/22

to

Michael S <already...@yahoo.com> writes:

> For many, if not for most, posters on comp.arch, the main point is
> to express themselves. Communication is secondary.

They want to express themselves but don't care if they are
communicating? A profound idea. Maybe that explains my
reactions to many postings.

Tim Rentsch

unread,

Mar 11, 2022, 2:16:26 PM3/11/22

to

I suspect there is, but I don't have any direct evidence for that
conclusion.

More generally, I have just responded to Mr Findlay's comments,
and you may find my comments there to be of interest.

> Looking at it now, to me using "anything which can" vs "anything that
> can" does have those two different meanings: The first one is in fact
> more inclusive than the second, i.e. "which" also includes stuff that
> only incidentally or as a side effect cause this, while "that" implies
> that it is more of a primary result.

To me the key word is "anything". You don't mean literally anything,
but "anything subject to the limitations of the following clause" (at
least, that's what I think you mean). So I'm not sure what difference
you are trying to explain here.

> This was actually what I was trying to express.

Which one, the first or the second? (And I'm still not sure what you
think the difference is between them.)

Bill Findlay

unread,

Mar 11, 2022, 4:57:50 PM3/11/22

to

On 11 Mar 2022, Tim Rentsch wrote
(in article <86h784b...@linuxsc.com>):

To the contrary, I defer to the expertise of Professor Geoff Pullum,
one of the authors of the "The Cambridge Grammar of the English Language".
See, for example:<https://languagelog.ldc.upenn.edu/nll/?p=4357>

> I am not offering myself as an authority.

!

--
Bill Findlay

George Neuner

unread,

Mar 11, 2022, 8:32:10 PM3/11/22

to

On Wed, 9 Mar 2022 11:36:53 -0000 (UTC), Thomas Koenig
<tko...@netcologne.de> wrote:

>If you want, you can have a softcore for your CPU and define special
>instructions for your special needs. I don't think it is easy
>to modify the FPGA programming on the fly.

With the right architecture, it can be relatively easy.

First, small to medium sized FPGA can be reconfigured very rapidly:
for many parts that are likely to used as 'accelerator' co-processors
it takes less than a millisecond for a full reconfiguration.

Second, most FPGAs large enough to support soft cores now support
partial reconfiguration while running. Essentially you can load a
single 'full' binary or some (model dependent) number of 'slice'
binaries which allow you to reprogram one area of the chip while other
areas continue to execute. The details obviously are model specific,
and, of course, the external hardware must be able to drive the chip
for slice programming if that is desired.

In the late 90s I worked on software support for a prototype FPGA
based array processor. It was conceived originally as an image
processing accelerator, but the implementation ended up general enough
to be used as a 'compute server' suitable for more general array
processing.

The board had 133MHz PCI bus interface, from 1 to 4 'compute element'
FPGAs, a programmable interconnect that enabled using the FPGAs
individually or in any combination, a high-speed banked DRAM having
2-dimensional (rectangular block) addressing ability that was able to
support 32-bit 2R/1W access simultaneously from all FPGAs at full
speed, and a DSP that provided board control, execution of user
scripts (described below), and general FP support.
[Larger FPGAs of that era could implement simple FP functions and even
some compositions of simple functions, but complex FP pretty much was
out of the question - so the DSP was necessary. The board was
designed with the FPGAs as plug-in modules to accomodate expansion and
(to a point) switching to larger FPGAs as they became affordable.]

Rather than allow arbitrary user FPGA programs, we implemented a
library of FPGA based array processing functions and a scripting
language to use them. The compiler/runtime directed allocations in
various memories: DRAM, FPGA parameter and scratch SRAMS, etc., and
handled the complicated work of setting up function parameter blocks,
kicking the FPGAs to configure and execute, extracting results, and
cleaning up afterward. The compiler supported chaining execution of
multiple FPGA functions - as opposed to returning control to the
script after every function - and automatically forwarded result to
satisfy data dependencies between parameter blocks within the chain.

Each parameter block included two identifiers that indicated which
configuration binary to use and which function in that binary to
execute. Kicking the FPGA to execute caused a hardware monitor
(itself a tiny FPGA) to look at those identifiers and reconfigure the
compute FPGA if the binary identifier didn't match. Each FPGA had a
configuration cache holding (model dependent) 10 or more binaries. A
typical binary implemented ~10..15 functions and could be loaded from
the cache in less than 2ms.
[At that time there was no partial reconfiguration, and for a part
having 600..800 CLBs, full reconfiguration in under 2ms was quite
fast.]

A service on the host cooperated with the board runtime to partition
the available FPGA resources (if desired), download user scripts and
configuration binaries, and implemented a bi-directional messaging
service to allow host programs to interact with running scripts. With
multiple FPGAs installed on the board, multiple scripts could be run
simultaneously [hence the 'compute server' aspect].

The company had a number of industrial QA/QC products and was looking
to improve on performance versus affordable SIMD processors. As a
proof of concept we implemented an expensive image inspection normally
performed using optimized SSE2 on a 1GHZ Pentium-4. Excluding image
capture, the image processing on the Pentium took ~600ms. Our
prototype board - using a single Xilinx Virtex-100 - performed the
same processing in just ~20ms including 2 reconfigurations.

Unfortunately, it never turned into a commercial product ... CPU
speeds were rapidly improving and it was thought that the proprietary
accelerator - even if it could be sold as a separate product - would
not remain cost effective for long. But for a couple of years it was
fun to work on (and with).

YMMV,
George

Terje Mathisen

unread,

Mar 12, 2022, 7:02:13 AM3/12/22

to

To Tim and everyone else who contributed to this exchange on language,
Thank you!

I did learn something, what I'm most happy about is that I seem to be
getting sufficiently fluid that I subconsciously choose exactly how to
express myself, without thinking about all the details. :-)

Terje
PS. Just getting rid of Covid here, something like 20% of the Norwegian
population seems to be going through it more or less at the same time,
thankfully with very little serious consequences. It seems like we are
getting out of these two years (I sent everyone home from the office in
Oslo exactly two years ago today) with something like 25-30% reduction
in death rate because all the covid restrictions have removed two full
flu seasons.

Terje Mathisen

unread,

Mar 12, 2022, 7:10:01 AM3/12/22

to

We discussed several of these approaches at the time, here on c.arch,
every single time the main problem was effectively Moore's Law: Each
time somebody had developed another FPGA accelerator board for function
X, I (or anyone else like me) could take the same algorithm, more or
less, turn it into hand-optimized x86 asm, and either immediately or
within two years, that SW would be just as fast as the FPGA and far cheaper.

Programming-wise it sounds sort of similar to the Cell Broadband Engine
of the Playstation: Could deliver excellent results but required
hardcore programming support from a quite small pool of
available/competent programmers.

Terje

David Brown

unread,

Mar 12, 2022, 9:37:28 AM3/12/22

to

In Norwegian, "jeg skal" (the same root as "I shall") is the future
tense - "I shall" or "I am going to". "Jeg vil" (the same root as "I
will") is "I want" or "I want to".

> But then, I've sometimes been told I have a somewhat antiquated style :-)
>
>

I think the change to language is inevitable - but I don't always think
it makes the language better. I agree with your usage in these
examples, and follow them too. However, you have to be careful in
expecting others to understand the distinction you make.

There was a time when you could separate "educated" people from
"uneducated" people. The educated class had spent long hours in the
classroom studying grammar and language, and Latin (several of the rules
of English grammar were imported directly from Latin in perhaps the 18th
or 19th centuries). A lot of technical or scientific writing was only
ever of interest within this group, and you could rely on a higher
minimum level of language.

Now, I would say, we have a more level field. People mix across a wider
range of backgrounds. There is more communication (and movement)
between countries, with people living and working in their second or
third language. In most aspects, this is a good thing IMHO. But it
does mean that some of the subtler aspects of language are lost.

David Brown

unread,

Mar 12, 2022, 9:43:01 AM3/12/22

to

On 11/03/2022 13:38, Michael S wrote:
> On Friday, March 11, 2022 at 2:27:57 PM UTC+2, Michael S wrote:
>> Not being an expert of English English or of American English or of any other English variation...
>> In this particular case I'd use 'which' rather than 'that'. Because I don't like a look of the same word 'that' appearing twice just two words apart. In different meanings, which probably makes things worse.
>
> Thinking about it, not "probably". different meanings certainly make things worse.
> Repeated (recurring?) 'that' in the same meaning, like that, is o.k with me:
>

Here's a grammar challenge for you regarding repeated words - can you
punctuate the following?

Peter where John had had had had had had had had had had had the
examiners approval

(Hint - it's not just one sentence.)

Stephen Fuld

unread,

Mar 12, 2022, 11:11:48 AM3/12/22

to

On 3/12/2022 4:09 AM, Terje Mathisen wrote:
> George Neuner wrote:
>> On Wed, 9 Mar 2022 11:36:53 -0000 (UTC), Thomas Koenig
>> <tko...@netcologne.de> wrote:
>>
>>> If you want, you can have a softcore for your CPU and define special
>>> instructions for your special needs. I don't think it is easy
>>> to modify the FPGA programming on the fly.
>>
>> With the right architecture, it can be relatively easy.
>>

snip

>> Unfortunately, it never turned into a commercial product ... CPU
>> speeds were rapidly improving and it was thought that the proprietary
>> accelerator - even if it could be sold as a separate product - would
>> not remain cost effective for long. But for a couple of years it was
>> fun to work on (and with).
>
> We discussed several of these approaches at the time, here on c.arch,
> every single time the main problem was effectively Moore's Law: Each
> time somebody had developed another FPGA accelerator board for function
> X, I (or anyone else like me) could take the same algorithm, more or
> less, turn it into hand-optimized x86 asm, and either immediately or
> within two years, that SW would be just as fast as the FPGA and far
> cheaper.

Agreed. However, with the current situation of CPU not getting (much)
faster, and, of course larger capacity FPGAs, I wonder if that is still
true. Or does using graphics chips for this provide similar functionality

> Programming-wise it sounds sort of similar to the Cell Broadband Engine
> of the Playstation: Could deliver excellent results but required
> hardcore programming support from a quite small pool of
> available/competent programmers.

Yes. But perhaps it could be that this pool of programmers provides a
"library" of low level primitives, that mere mortals can use fazirly easily.

I don't know if any of this is real - just wondering.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Mar 12, 2022, 12:33:25 PM3/12/22

to

On 3/12/2022 6:02 AM, Terje Mathisen wrote:
> To Tim and everyone else who contributed to this exchange on language,
> Thank you!
>
> I did learn something, what I'm most happy about is that I seem to be
> getting sufficiently fluid that I subconsciously choose exactly how to
> express myself, without thinking about all the details. :-)
>
> Terje
> PS. Just getting rid of Covid here, something like 20% of the Norwegian
> population seems to be going through it more or less at the same time,
> thankfully with very little serious consequences. It seems like we are
> getting out of these two years (I sent everyone home from the office in
> Oslo exactly two years ago today) with something like 25-30% reduction
> in death rate because all the covid restrictions have removed two full
> flu seasons.
>

Where I am living, even asking people to wear masks in public is
apparently asking too much, so most have not been bothering. It is a
little awkward when one is part of the minority who does actually bother
with this...

...

Well, there is this, and also the annoyance of being surrounded by YECs
(Young Earth Creationists), but not actually believing in this (most
evidence is against it; even linguistic and historical evidence is
against it; like some sort of bizarre consequence of word choice in the
KJV or something...).

Or, like the awkwardness of being generally leaning towards being
theistic, but not wanting to take a strong stance on anything outside
what can be observed and measured, but kinda leaning against
interpretations built strongly on supernatural events (say, we assume
people wrote what they believed happened, but stories can be
embellished, and events misinterpreted).

The question of religious identity becomes a bit fuzzy though when one
allows for a "well, it is what the people who wrote it believed
happened" stance, rather than a "it actually happened this way, and the
various creeds/etc are describing actual events". Will not say it didn't
happen though, as I don't take a "supernatural events don't happen"
stance either.

Not much evidence to support the notion that direct supernatural
intervention into normal peoples' lives is really a thing though, seems
like more of a "rare thing for special or historically significant
events" thing. A lot of what people may experience (as supernatural
events) may be more likely due to psychological or neurological
explanations or similar (just because one may experience something
doesn't mean it is real).

Though, the standard for evidence is particularly low in these circles
(and then there are a lot of people who take the "I believe it so that
makes it real" interpretation), ...

...

Anton Ertl

unread,

Mar 12, 2022, 12:38:27 PM3/12/22

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> However, with the current situation of CPU not getting (much)
>faster, and, of course larger capacity FPGAs, I wonder if that is still
>true.

What is still true is that softcores are still a factor of 10 (typical
numbers I remember are 250MHz for fast-clocked softcores) slower than
custom-designed silicon CPUs, and I guess the same is true for other
logic functions implemented in FPGAs. So the niche for FPGA stuff
seems quite small to me: Functions that are not implemented in custom
silicon, take many steps in software, yet can be implemented with few
steps in hardware (and FPGAs); i.e., something like new crypto
algorithms that have not found their way into hardware accelerators
yet.

At least as far as performance is relevant; a larger niche for FPGAs
is helper chips for boards that do not have enough volume for
full-custom chips.

At least that's how the situation looks to me.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

Mar 12, 2022, 12:47:54 PM3/12/22

to

Possibly similar in my case, though as can be noted, I am autistic, so
my interpretation and usage of language semantics may be potentially
non-standard even in areas where there would seem to be agreement.

Also, half wonders if that example were intended to be a "Fiddler on the
Roof" reference, or if this merely happened by chance...

BGB

unread,

Mar 12, 2022, 1:17:24 PM3/12/22

to

On 3/12/2022 11:25 AM, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> However, with the current situation of CPU not getting (much)
>> faster, and, of course larger capacity FPGAs, I wonder if that is still
>> true.
>
> What is still true is that softcores are still a factor of 10 (typical
> numbers I remember are 250MHz for fast-clocked softcores) slower than
> custom-designed silicon CPUs, and I guess the same is true for other
> logic functions implemented in FPGAs. So the niche for FPGA stuff
> seems quite small to me: Functions that are not implemented in custom
> silicon, take many steps in software, yet can be implemented with few
> steps in hardware (and FPGAs); i.e., something like new crypto
> algorithms that have not found their way into hardware accelerators
> yet.
>

Also, 250 MHz would require either a high end FPGA or a very small core
(probably 16-bit).

It seems likely though, that a hybrid could be made, where FPGA logic
was glued on at the level of customizable CPU instructions, without the
CPU itself being implemented via the FPGA. In effect treating it like a
user-customizable coprocessor.

If this were limited to small enough "granules", it is possible the FPGA
logic could be run at or near the CPU's native clock speed, or with some
downsampling (Say, 4x or 8x). So, while each instruction represented via
FPGA logic is significantly slower than had it been native silicon, but
faster than had it been expressed with a series of normal CPU instructions.

> At least as far as performance is relevant; a larger niche for FPGAs
> is helper chips for boards that do not have enough volume for
> full-custom chips.
>
> At least that's how the situation looks to me.
>

Could be.

At present, they seem to come in one of several form factors:
Standalone boards, programmable via USB or similar, various connectors
(commonly "PMOD");
Boards that go inside a PC, typically via PCI-e, with little or no
external connectivity (and typically very expensive);
Boards that go inside a PC in an M.2 slot or similar.

The latter seem more relevant for "accelerators", whereas the former is
more relevant for those of us wanting to experiment with making custom
CPU SOC's, or for embedded control applications.

While an FPGA is more expensive than a normal microcontroller, it can do
some things a microcontroller can't (eg, saw something recently where
someone has used an FPGA to implement a custom digital oscilloscope +
logic analyzer).

...

JimBrakefield

unread,

Mar 12, 2022, 2:12:24 PM3/12/22

to

If power consumption is the metric, FPGAs have several advantages:

1) Power consumption is proportional to computation as unused FPGA fabric has low power consumption. FPGAs are designed to have low static power?
2) FPGAs tend to require less memory traffic. Memory data flow is more efficient, e.g. fewer loads and stores required. And memories are scaled in size to their usage.
3) Much computation is done outside of hard or soft cores and requires no instruction pipeline. In many cases a simple state machine suffices.
4) Moore's law still applies to FPGAs. E.g., they continue to get faster and more power efficient.
5) Single chip hybrid hard core(s) + FPGA fabric are readily available in many different combinations.
6) There is little internal fragmentation. Data sizes (and FPGA resources) are set to data requirements, not the register size.

Michael S

unread,

Mar 12, 2022, 2:33:51 PM3/12/22

to

I am not going to try.
The correct use or even adequate understanding of simple "had had" is already above my English language skills; "had had had" - more so.

Michael S

unread,

Mar 12, 2022, 2:57:40 PM3/12/22

to

On Saturday, March 12, 2022 at 7:38:27 PM UTC+2, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> > However, with the current situation of CPU not getting (much)
> >faster, and, of course larger capacity FPGAs, I wonder if that is still
> >true.
> What is still true is that softcores are still a factor of 10 (typical
> numbers I remember are 250MHz for fast-clocked softcores) slower than
> custom-designed silicon CPUs,

You can push Nios2f to 300MHz on industrial-grade Startix-10 FPGA.
Probably, 10% faster on top commercial grade Stratix-10 FPGA, esp. smaller models.
But that will only work with your core running from either relativly small tightly-coupled memories (something like 128KB for both code and data)
or from even smaller instruction and data caches.
Also, don't forget, that apart from 10x slowdown factor in frequency, relatively to best "hard" CPUs there is at least 4x slowdown in IPC. May be, more than that, I didn't measure. Besides, an exact factor strongly depends on the application. May be, the simplest methodology that would not be totally misleading is to consider Nios2f IPC to be 80% of Arm Cortex A35.

> and I guess the same is true for other
> logic functions implemented in FPGAs.

Not necessarily. I can think about many functions that can be usefully run (on the same industrial-grade Startix-10) at above 400MHz.
A simpler ones at 450 MHz.

> So the niche for FPGA stuff
> seems quite small to me: Functions that are not implemented in custom
> silicon, take many steps in software, yet can be implemented with few
> steps in hardware (and FPGAs); i.e., something like new crypto
> algorithms that have not found their way into hardware accelerators
> yet.
>
> At least as far as performance is relevant; a larger niche for FPGAs
> is helper chips for boards that do not have enough volume for
> full-custom chips.
>
> At least that's how the situation looks to me.
>

That's sound about right.
FPGAs are *extremely* useful embedded devices that make majority of today's "long tail"
embedded computing industry at all possible. But FPGA-based "general-purpose" compute accelerators
are 99.9% hype and 0.1% substance. That's being generous.

Michael S

unread,

Mar 12, 2022, 3:07:10 PM3/12/22

to

It depends on what you consider low.
Certainly, no FPGA is going to consume 1mW when idle.
Big FPGA is not going to consume 100mW either. But 1-2 Wats are possible, except in certain unfortunate families, like Arria-2, luckily not likely to be used for new projects.

> 2) FPGAs tend to require less memory traffic. Memory data flow is more efficient, e.g. fewer loads and stores required. And memories are scaled in size to their usage.
> 3) Much computation is done outside of hard or soft cores and requires no instruction pipeline. In many cases a simple state machine suffices.
> 4) Moore's law still applies to FPGAs. E.g., they continue to get faster and more power efficient.

More efficient - yes.
Bigger - yes. But the biggest FPGA offerings are already MCM (tiles) under the hood.
Faster - I am not sure. Quite possibly we will not see anything faster than Stratix-10 (available for ~4 years) for quite some time.
I mean, from the Big 2. I don't follow Achronix and similar exotic players.

MitchAlsup

unread,

Mar 12, 2022, 5:10:06 PM3/12/22

to

On Saturday, March 12, 2022 at 11:38:27 AM UTC-6, Anton Ertl wrote:

> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> > However, with the current situation of CPU not getting (much)
> >faster, and, of course larger capacity FPGAs, I wonder if that is still
> >true.
<
> What is still true is that softcores are still a factor of 10 (typical
> numbers I remember are 250MHz for fast-clocked softcores) slower than
> custom-designed silicon CPUs, and I guess the same is true for other
> logic functions implemented in FPGAs. So the niche for FPGA stuff
> seems quite small to me: Functions that are not implemented in custom
> silicon, take many steps in software, yet can be implemented with few
> steps in hardware (and FPGAs); i.e., something like new crypto
> algorithms that have not found their way into hardware accelerators
> yet.
<

A couple of points:
a) Anything programmable is going to be no faster than DIV-5 and more likely
to be DIV-20 (CPUs are 5GHz, FPGA is 250 MHz).
b) You are not going to get "your function unit" inside of a CPU, you are going
to have to put it outside of the CPU and likely outside of the L2 cache.
<
This has the consequences:
1) that if you are going to have such a function unit, and it is going to perform,
then you are going to have to feed it 1,000 units of work at a minimum;
2) you will need to give it more than 1 microseconds to do its job.
<
A difficult programming model for the synchronous vonNeumann paradigm.

Ivan Godard

unread,

Mar 12, 2022, 6:47:18 PM3/12/22

to

Rather than memorialize MSG, why not supply hardware continuations
(which need much the same stuff I expect) instead?

MitchAlsup

unread,

Mar 12, 2022, 7:02:10 PM3/12/22

to

I did not realize I was memorializing MSG.
I don't see how you got that from what I wrote.
<
But assume I was in some remote sense::
<
In essence MSG passes ½ cache lines to target (the arguments)
Continuations pass 4 cache lines to target (an entire register set)
<
MSG recipient uses 80% of his own registers (4 arguments + 16 his).
Continuation recipient uses few of his own registers.
<
So, I don't see how the choices are remotely similar unless you are looking from
far enough away that you can't see the overhead from <ahem> the overhead.
<
The point I was making is that you are not going to get a function unit of your
design close enough to the CPU of their design to make use of it 1 command
at a time interface philosophy (co-processor.) It is more array-processor-like,
or GPU-like, than a function unit embedded in a CPU.

JimBrakefield

unread,

Mar 12, 2022, 7:47:33 PM3/12/22

to

I'm showing Stratix-10 at the "IA" 14nm node and Versal at the "AX" 7nm node. Going to smaller process nodes should reduce wire delay if nothing else. Further process nodes will take place slowly?

|>Big FPGA is not going to consume 100mW either.

Agreed, compared to 10 watts of an idle x86-64 however?
Of course if the ASIC is ten times faster, guess the power contest is ~tied.

Quadibloc

unread,

Mar 12, 2022, 10:42:50 PM3/12/22

to

On Thursday, March 10, 2022 at 9:55:16 AM UTC-7, Bill Findlay wrote:

> You have no need of an excuse, because your original wording was correct,
> and is exactly how most British English users would put it.
>
> Rentsch is wrong to think that using 'that' means anything different.
> His claim comes straight out of Strunk & White's usage guide,
> which has been debunked by English grammar experts, such as Geoff Pullum.
>
> I personally prefer 'that' to 'which' in restrictive clauses,
> but that is a matter of aesthetics, not grammatical correctness.

I certainly was unaware of any difference between "that" and "which"
in terms of being restrictive. I was not aware that this myth even existed.
Of course, if it existed in _Fowler_ and not just _Strunk & White_, then I
might have to reconsider my position.

John Savard

Quadibloc

unread,

Mar 12, 2022, 10:52:44 PM3/12/22

to

On Friday, March 11, 2022 at 4:30:02 AM UTC-7, Terje Mathisen wrote:

> Looking at it now, to me using "anything which can" vs "anything that
> can" does have those two different meanings: The first one is in fact
> more inclusive than the second, i.e. "which" also includes stuff that
> only incidentally or as a side effect cause this, while "that" implies
> that it is more of a primary result.

To determine whether Strunk & White have _anything_ to back them up,
we need to ask some basic questions. What is "which", and what is "that"?

Which is the neuter counterpart of who.

A person who can read English. A computer which comes with a FORTRAN
compiler.

But we also say:

Who saw what happened? What computers are still available?

So in some cases, the neuter counterpart of who is _what_ instead of which.

But you don't use "what" where you would use "which", except if one's command
of grammar is weak:

My grandpappy had a rifle whut could shoot the paint off'n a barn door.

The most familiar use of "that" is as a pronoun, similar to "this" - _or_ as an article.

I was looking for this. Will that arrive on time?
I was looking for this book. Will that parcel arrive on time?

One can use that instead of "who" or "which":

A person that can read English. A computer that comes with a FORTRAN compiler.

A computer either comes with a FORTRAN compiler, or it doesn't. A person is either
able to read English or not.

So it's very difficult for me to see how "A computer that comes with a FORTRAN compiler"
and "A computer which comes with a FORTRAN compiler" can fail to describe the same
set of objects. In both cases, only computers can be members of the set, and only computers
in possession of a FORTRAN compiler.

John Savard

Quadibloc

unread,

Mar 12, 2022, 10:55:14 PM3/12/22

to

On Friday, March 11, 2022 at 11:37:25 AM UTC-7, Ivan Godard wrote:

> Even long-standing language notion can die. I use both the conditional -
> "if I were a rich man..." instead of "if I was a rich man...", and
> distinguish "shall" from "will" - simple future instead of intentional.
> Both the conditional and and use of shall have largely disappeared in my
> lifetime.

Of course, "Were I a rich man..." is the subjunctive.

And of course "If I were a rich man..." got a new lease on life from Fiddler
on the Roof.

John Savard

Quadibloc

unread,

Mar 12, 2022, 11:23:38 PM3/12/22

to

Having looked in Fowler, I see that there _is_ a difference between "that" and
"which", and this actual difference may be the source of the confusion in
Strunk & White.

It is preferable to use "which" instead of "that" in a sentence such as the
following:

I always buy the genuine IBM PC, which comes with a shorter version of
the advanced BASIC interpreter that makes use of the BASIC in ROM which
those computers have, instead of a clone.

The first "which" immediately following PC, according to Fowler, should
not be replaced with "that", since coming with BASICA instead of GW-BASIC
is true of _all_ genuine IBM PCs, and thus the following clause is not intended
to impose a constraint on "the genuine IBM PC", but is descriptive of a characteristic
they all have.

So in this sense, 'that' is restrictive and 'which' is non-restrictive.

Or perhaps it is more accurate to say that "which" may be used in either
a restrictive or non-restrictive manner, but "that" should only be used in
a restrictive context.

Actually, though, despite Fowler perhaps being in agreement with Strunk & White,
I have to admit I wouldn't have seen anything odd if "that" were used instead
in that position.

John Savard

Quadibloc

unread,

Mar 12, 2022, 11:30:44 PM3/12/22

to

On Saturday, March 12, 2022 at 9:23:38 PM UTC-7, Quadibloc wrote:

> Actually, though, despite Fowler perhaps being in agreement with Strunk & White,
> I have to admit I wouldn't have seen anything odd if "that" were used instead
> in that position.

And _since_ I think of "which" as a neuter counterpart to some of the uses
of "who", I can see why I'm not in company with Fowler _as well as_ Strunk
& White.

Obviously, one uses "who" in both the restrictive and non-restrictive cases.

An author who rejoices in the surname of Smith...
A black author, who has first-hand experience of racism...

The first phrase is _clearly_ restrictive, whereas the second phrase is
likely at least intended to be thought of as non-restrictive (the author
has experienced racism personally *because he is black*, rather than
simply being a black author who happens to have also personally
experienced racism).

One wouldn't think of using the neuter-appearing "that" in order
to indicate that the first phrase is restrictive.

John Savard

Quadibloc

unread,

Mar 12, 2022, 11:37:08 PM3/12/22

to

On Saturday, March 12, 2022 at 9:30:44 PM UTC-7, Quadibloc wrote:

Come to think of it, it could be argued that

> An author who rejoices in the surname of Smith...

is non-restrictive, since "an author" is _singular_, and thus we might have
one particular author in mind, such as, say, E. E. "Doc" Smith.

So instead an example of a restrictive phrase would be

Those authors who rejoice in the surname of Smith...

John Savard

Ivan Godard

unread,

Mar 12, 2022, 11:59:40 PM3/12/22

to

There seems to have been a difference of opinion among linguists
regarding the conditional. During earlier centuries there was an
academic push to force-fit English into the grammatical structure of
Latin which (apropos this topic) had a subjunctive. Later, as more
far-flung languages were recognized (and not immediately consigned to
the heap of barbarous (look up the derivation of "barbarian" someday for
a giggle) mumblings unfit for English tongues, it was realized that
English was rooted in the Germanic branch of PIE and didn't have Latin
grammar despite the best efforts of Public Schools.

That second look has led to abandoning many strictures - about splitting
infinitives for example - that were artificial Latinizations, among
which was calling the conditional irrealis a subjunctive. However, many
writers (and textbooks) still bandy "subjunctive" about, frequently with
wagging fingers.

For far more than you really want see
https://en.wikipedia.org/wiki/English_subjunctive, especially the
section "Variant terminology and misconceptions".

Tom Gardner

unread,

Mar 13, 2022, 6:08:26 AM3/13/22

to

These are one sentence, apparently :) I've seen code that is as
difficult to parse and understand as these :(

"In Middle English, there was no difference between & and and, and
and and &, and this situation persists in the present day."

"Wouldn't the sentence 'I want to put two hyphens between the words
Pig and And, and And and Whistle in my Pig-And-Whistle sign' have
been clearer if quotation marks had been placed before Pig, and
between Pig and and, and and and And, and And and and, and and and
And, and And and and, and and and Whistle, and after Whistle?"

Thomas Koenig

unread,

Mar 13, 2022, 6:10:32 AM3/13/22

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:
> George Neuner wrote:

[...]

>> Unfortunately, it never turned into a commercial product ... CPU
>> speeds were rapidly improving and it was thought that the proprietary
>> accelerator - even if it could be sold as a separate product - would
>> not remain cost effective for long. But for a couple of years it was
>> fun to work on (and with).
>
> We discussed several of these approaches at the time, here on c.arch,
> every single time the main problem was effectively Moore's Law: Each
> time somebody had developed another FPGA accelerator board for function
> X, I (or anyone else like me) could take the same algorithm, more or
> less, turn it into hand-optimized x86 asm, and either immediately or
> within two years, that SW would be just as fast as the FPGA and far cheaper.
>
> Programming-wise it sounds sort of similar to the Cell Broadband Engine
> of the Playstation: Could deliver excellent results but required
> hardcore programming support from a quite small pool of
> available/competent programmers.

It might also be argued that programming hand-optimized x86 assembly
also needs a special kind of programmer :-)

I know of at least one case where an intern wrote a time-critical
piece of software (a Hough transform for recognizing straight lines)
in AVX2 assembler, leading to a significant piece of software.
He was a bit miffed when the company didn't use the code, but since
he was the only person in the company who could have maintained that
piece of code, it's understandable - it is the nature of interns to
leave :-)