On Wednesday, February 16, 2022 at 11:25:10 AM UTC-7, MitchAlsup wrote:
> On Wednesday, February 16, 2022 at 4:02:07 AM UTC-6, Quadibloc wrote:
> > A Cray-like architecture makes the vectors _much_ bigger than even
> > in AVX-512, so in my naivete, I would have thought that this would
> > allow the constant changes to stop for a while.
> <
> Point of Order::
> CRAY vectors are processed in a pipeline, 1,2,4 units of work per unit time.
> AVX vectors are processed en massé <however wide> per unit time.
> These are VASTLY different things.
Yes. However, a faster Cray-like machine can be implemented
with as many, or more, floating ALUs than an AVX-style vector unit.
So you could have, say, AVX-512 with eight 64-bit floats across, and
then switch to Cray in the next generation with sixteen ALUs, and then
stick with Cray with thirty-two ALUs in the generation after that.
> > However, today's microprocessors *do* have L3 caches that are as
> > big as the memory of the original Cray I.
> But with considerably LOWER concurrency.
> A CRAY might have 64 memory banks (NEC up to 245 banks)
> .....Each bank might take 5-10 cycles to perform 1 request
> .....but there can be up to 64 requests being performed.
But if the cache is *on the same die*, having more wires connecting it
to the CPU isn't much of a problem?
> At best modern L3 can be doing 3:
> Receiving write data,
> Routing Data around the SRAM matrix,
> Sending out read data.
> <
> There is nothing fundamental about the difference, but L3 caches are
> not build to have the concurrency of CRAYs banked memory.
So we seem to be in agreement on this point.
> PCIe 6.0 uses 16 GHz clock to send 4 bits per wire per cycle using
> double data rate and PAM4 modulation; and achieves 64GTs per wire
> each direction. So 4 pins: true-comp out, true comp in: provide 8GB/s
> out and 8GB/s in.
Of course, PCIe 6.0 is a complicated protocol, while interfaces
like DDR 5 to DRAM are kept simple by comparison.
> But hey, if you want to provide 512 pins, I sure you can find some use
> for this kind of bandwidth. {but try dealing with the heat.}
Presumably chips implemented in the final evolution of 1nm or whatever
will run slightly cooler.
I had thought that the number of CPUs in the package was what governed
the heat, and using more pins for data would not be too bad. If that's not
true, then, yes, this would be *one* fatal objection to my concepts.
> More pins wiggling faster has always provided more bandwidth.
> Being able to absorb the latency has always been the problem.
> {That and paying for it: $$$ and heat}
Of course, the way to absorb latency is to do something else while
you're waiting. So now you need bandwidth for the first thing you
were doing, and now more bandwidth for the something else
you're doing so as to make the latency less relevant.
This sounds like a destructive paradox. But since latency is
fundamentally unfixable (until you make the transistors and
wires faster) while you can have more bandwidth if you pay for
it, the idea of having the amount of bandwidth you needed in
the first place, times eight or so, almost makes sense.
> > Exactly how a modified GPU design aimed at simulating a Cray
> > or multiple Crays in parallel working on different problems might
> > look is not clear to me, but I presume that if one can put a bunch
> > of ALUs on a chip, and one can organize that to look like a GPU
> > or like a Xeon Phi (but with RISC instead of x86), it could also be
> > organized to look like something in between adapted to a
> > Cray-like instruction set.
....and a Cray-like instruction set could be like a later generation of
the Cray, with more and longer vector registers, and in other ways
it could move to being more GPU-like if that was needed to fix some
flaws.
> > Since Crays used 64-element vector registers for code in loops
> > that handled vectors with more than 64 elements, that these loops
> > might well be... augmented... by means of something looking a
> > bit like your VVM is also not beyond the bounds of imagination.
> > (But if you're using something like VVM, why have vector instructions?
> > Reducing decoding overhead!)
> Exactly! Let each generation of HW give the maximum performance
> if can while the application code remains constant.
I'm glad you approve of something...
> Secondly: If you want wide vector performance, you need to be organized
> around ¼, ½ 1 cache line per clock out of the cache and back into the cache.
> The width appropriate for one generation is not necessarily appropriate for
> the next--so don't expose width through ISA.
Of course, IBM's take on a Cray-like architecture avoided that pitfall, by
excluding the vector width from the ISA spec, making it model-dependent,
so that's definitely possible.
> > Of course, though, my designs will have scalar floating-point
> > instructions, short vector instructions (sort of like AVX-256),
> > and long vector instructions (like a Cray)... because they're
> > intended to illustrate what an architecture burdened with a
> > rather large amount of legacy stuff carried over. But because it
> > was designed on a clean sheet of paper, it only gets one
> > kind of short vectors to support, rather than several like an x86.
> > And there would be a somewhat VVM-like set of
> > vector of vector wrapper instructions that could be wrapped
> > around *any* of them.
> Question: If you have VVM and VVM performs as well as CRAY
> vectors running Matrix300, why have the CRAY vector state
> or bloat your ISA with CRAY vector instructions?
Now, that's a very good question.
Possible answer 1:
This is only included in the ISA because the spec is meant to
illustrate possibilities, and would be omitted in any real-world
CPU.
Possible answer 2:
The idea is that VVM wrapped around scalar floating-point
instructions works well for vectors that are "this" long;
VVM wrapped around AVX-style vector instructions works for
vectors that are 4x longer, in proportion to the number of floats
in a single AVX vector...
VVM wrapped around Cray-style vector instructions is intended
for vectors that are 64x longer than VVM wrapped around scalar
instructions.
Assume VVM around scalar handles vectors somewhat longer
than Cray without VVM. Then what we've got is a range of options,
each one adapted to how long your vectors happen to be. (And to
things like granularity, because using VVM around Cray for vectors
2x the Cray size presumably wouldn't be "bad" in terms of efficiency.)
Possible answer 3:
And I might fail to implement VVM well enough to avoid some
associated overhead.
John Savard