On Thursday, February 2, 2023 at 11:56:55 AM UTC-6, Stephen Fuld wrote:
> On 2/2/2023 9:26 AM, Thomas Koenig wrote:
> > Thomas Koenig <
tko...@netcologne.de> schrieb:
> >
> >> Also, VVM faithfully reproduces a special ordering at the loop,
> >> the one that it was written in. This is obviously ideal for
> >> languages which write out explicit loops, like C, but does not
> >> offer improvment for languages (like Fortran :-) which operate
> >> more on vectors or arrays.
> >
> > Actually, that should read
> >
> > "does not offer additional improvement". Having VVM for Fortran would
> > already speed up a lot of code.
<
> I get what you are saying, but I am not sure it is true. Specifically,
> while the CPU can do only one add at a time (due to the dependency),
> so it doesn't offer the advantage of operating on multiple elements
> simultaneously, as it does on non loop carried dependency loops.
<
If we are talking about the following loop (snatched from above)::
<
void add (const int *a, int *b, int *c, int n)
{
for (int i=0; i<n; i++)
a[i] = b[i] + c[i];
}
<
The only thing preventing performing 2-4-8 adds, at the same time,
is whether (or not) a[] actually aliases b[] or c[]. If/when there is no
aliasing, this can be performed as wide as the implementation has
resources. If aliasing is actually taking place then one is limited by
the actual data-flow.
<
Note: That in the case above::
<
add( a, a, @a[37]);
<
Does not perform and ACTUAL aliasing up to execution widths of
32 per cycle (8 per cycle FP due to latency).
>
> But it may be, and Mitch is the expert here, that the other advantages
> of VVM, specifically, the large "prefill" buffers (I'm sorry, I forget
> what they are officially called) offer an advantage over pure scaler code.
<
Consider that a processor is comprised of many function units, each
having a characteristic latency:: integer is 1 cycle, LDs are 3, STs are 1
cycle, FP is 4 cycles. Now consider that there is 1 FU of each kind and
that the LOOP instruction executes in 1 cycle. Finally consider that the
underlying machine is 1-wide and in-order (classic 1st generation RISC).
Cache width is 128-bits--not 1st generation RISC, but suitable for mis-
aligned DoubleWord access at full speed.
>
Here, loops are bound by the number of times a FU is used within the
loop. DAXPY, which has 2 LDs, 1 ST, 1 FMAC, per iteration. LDs and STs
are being performed 128-bits per access, or 2 Iterations per access.
<
On my 1-wide in order machine, we are reading 128-bits per for the
2 LDs, and writing 128-bits for the ST every 3 (2*)cycles, so with a
single FMAC unit we can perform 2 LDs, 1 FMAC, 1 ST and the equi-
valent of 1×(ADD-CMP-BLT) every 3 cycles (7 instructions in 3 cycles
= 2+1/3 I/C). 2.3 IPC is not bad for a 1-wide machine being about
3.5× the performance of the scalar code all by itself.
<
(*) The nature of cache lines having 1 tag and 4 quadword columns
enables the 2 LDs and the ST to both be performed in 2 cycles since
the cache line is 4 accesses wide and we need 3 tag accesses every
4 cycles. {This may require an additional QuadWord of buffering for
STs.} In order to do this, the column being stored cannot be a column
being read by either of the 2 LDs. 7 effective instructions in 2 cycles
is 3.5 IPC and 5× what scalar code could perform.
<
Also note: The C (or Fortran) source has manually unrolled this loop 4
times. This just multiplies the number of instructions in the loop by
4 while leaving a single LOOP instruction for iteration control.
<
> BTW, whether you call what VVM does for this loop "vectorizing" is, I
> think a matter of ones definition of vectorizing, specifically whether
> the definition must include operating on more than one element
> simultaneously or not.
>
In the same way a CARY-like vector machine does not need fetch
bandwidth while its vector instructions ae being performed, a VVM
implementation does not need the Fetch-Parse-Decode stages of
the pipeline to be active while the loop is being performed.